1. 24 Apr, 2017 8 commits
  2. 23 Apr, 2017 6 commits
  3. 21 Apr, 2017 22 commits
    • Dan Carpenter's avatar
      lightnvm: don't print a warning for ADDR_EMPTY · 659226eb
      Dan Carpenter authored
      Reading from ADDR_EMPTY is out of bounds.  The current code generates a
      static checker warning because we check for out of bounds "lba" before
      we check for ADDR_EMPTY, so the second check is always false.  It looks
      like we intended ADDR_EMPTY to be a no-op without printing a warning.
      
      Fixes: a4bd217b ("lightnvm: physical block device (pblk) target")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarJavier González <javier@cnexlabs.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      659226eb
    • Dan Carpenter's avatar
      lightnvm: potential underflow in pblk_read_rq() · 5bf1e1ee
      Dan Carpenter authored
      This is a static checker fix, and perhaps not a real bug.  The static
      checker thinks that nr_secs could be negative.  It would result in
      zeroing more memory than intended.  Anyway, even if it's not a bug,
      changing this variable to unsigned makes the code easier to audit.
      
      Fixes: a4bd217b ("lightnvm: physical block device (pblk) target")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarJavier González <javier@cnexlabs.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      5bf1e1ee
    • Ilya Dryomov's avatar
      block: get rid of blk_integrity_revalidate() · 19b7ccf8
      Ilya Dryomov authored
      Commit 25520d55 ("block: Inline blk_integrity in struct gendisk")
      introduced blk_integrity_revalidate(), which seems to assume ownership
      of the stable pages flag and unilaterally clears it if no blk_integrity
      profile is registered:
      
          if (bi->profile)
                  disk->queue->backing_dev_info->capabilities |=
                          BDI_CAP_STABLE_WRITES;
          else
                  disk->queue->backing_dev_info->capabilities &=
                          ~BDI_CAP_STABLE_WRITES;
      
      It's called from revalidate_disk() and rescan_partitions(), making it
      impossible to enable stable pages for drivers that support partitions
      and don't use blk_integrity: while the call in revalidate_disk() can be
      trivially worked around (see zram, which doesn't support partitions and
      hence gets away with zram_revalidate_disk()), rescan_partitions() can
      be triggered from userspace at any time.  This breaks rbd, where the
      ceph messenger is responsible for generating/verifying CRCs.
      
      Since blk_integrity_{un,}register() "must" be used for (un)registering
      the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
      setting there.  This way drivers that call blk_integrity_register() and
      use integrity infrastructure won't interfere with drivers that don't
      but still want stable pages.
      
      Fixes: 25520d55 ("block: Inline blk_integrity in struct gendisk")
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 4.4+, needs backporting
      Tested-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      19b7ccf8
    • Rakesh Pandit's avatar
      lightnvm: propagate pblk_init return to userspace · 8d77bb82
      Rakesh Pandit authored
      From userspace calling ioctl(NVM_DEV_CREATE) was returning ENOMEM for
      invalid arguments even though pblk (pblk_init) was returning correctly
      -EINVAL to nvm_create_tgt inside core.  This patch propagates the
      correct return value to userspace.
      
      Because pblk was introduced recently this only needs to go in 4.12.
      
      Fixes: a4bd217b ("lightnvm: physical block device (pblk) target")
      Signed-off-by: default avatarRakesh Pandit <rakesh@tuxera.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      8d77bb82
    • Bart Van Assche's avatar
      blk-mq: Fix preempt count imbalance · abc25a69
      Bart Van Assche authored
      Avoid that the following kernel bug gets triggered:
      
      BUG: sleeping function called from invalid context at ./include/linux/buffer_head.h:349
      in_atomic(): 1, irqs_disabled(): 0, pid: 8019, name: find
      CPU: 10 PID: 8019 Comm: find Tainted: G        W I     4.11.0-rc4-dbg+ #2
      Call Trace:
       dump_stack+0x68/0x93
       ___might_sleep+0x16e/0x230
       __might_sleep+0x4a/0x80
       __ext4_get_inode_loc+0x1e0/0x4e0
       ext4_iget+0x70/0xbc0
       ext4_iget_normal+0x2f/0x40
       ext4_lookup+0xb6/0x1f0
       lookup_slow+0x104/0x1e0
       walk_component+0x19a/0x330
       path_lookupat+0x4b/0x100
       filename_lookup+0x9a/0x110
       user_path_at_empty+0x36/0x40
       vfs_statx+0x67/0xc0
       SYSC_newfstatat+0x20/0x40
       SyS_newfstatat+0xe/0x10
       entry_SYSCALL_64_fastpath+0x18/0xad
      
      This happens since the big if/else in blk_mq_make_request() doesn't
      have final else section that also drops the ctx. Add that.
      
      Fixes: b00c53e8 ("blk-mq: fix schedule-while-atomic with scheduler attached")
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Omar Sandoval <osandov@fb.com>
      
      Added a bit more to the commit log.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      abc25a69
    • Jens Axboe's avatar
      Merge branch 'nvme-4.12' of git://git.infradead.org/nvme into for-4.12/block · f8a05a1d
      Jens Axboe authored
      Christoph writes:
      
      This is the current NVMe pile: virtualization extensions, lots of FC
      updates and various misc bits.  There are a few more FC bits that didn't
      make the cut, but we'd like to get this request out before the merge
      window for sure.
      f8a05a1d
    • Jens Axboe's avatar
      mtip32xx: fix dereference of stack garbage · 95c55ff4
      Jens Axboe authored
      We need to get the command payload from the request before
      we attempt to dereference it.
      
      Fixes: 4dda4735 ("mtip32xx: add a status field to struct mtip_cmd")
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      95c55ff4
    • Junxiong Guan's avatar
      nvme: let dm-mpath distinguish nvme error codes · e02ab023
      Junxiong Guan authored
      Currently most IOs which return the nvme error codes are retried on
      the other path if those IOs returns EIO from NVMe driver. This
      patch let Multipath distinguish nvme media error codes and some
      generic or cmd-specific nvme error codes so that multipath will
      not retry those kinds of IO, to save bandwidth.
      Signed-off-by: default avatarJunxiong Guan <guanjunxiong@huawei.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      e02ab023
    • Keith Busch's avatar
      nvme/pci: Poll CQ on timeout · 7776db1c
      Keith Busch authored
      If an IO timeout occurs, it's helpful to know if the controller did not
      post a completion or the driver missed an interrupt. While we never expect
      the latter, this patch will make it possible to tell the difference so
      we don't have to guess.
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      7776db1c
    • James Smart's avatar
      nvmet_fc: Change traddr field separator to a colon · 43631357
      James Smart authored
      The FC-NVME spec revised syntax to avoid comma separators.
      Sync with the change in the parser for traddr on port attachments.
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      43631357
    • James Smart's avatar
      nvme_fc: Add ls aborts on remote port teardown · 8d64daf7
      James Smart authored
      remoteport teardown never aborted the LS opertions. Add support.
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      8d64daf7
    • James Smart's avatar
      nvme_fc: Move LS's to rport · c913a8b0
      James Smart authored
      Link LS's on the remoteport rather than the controller. LS's are
      between nport's. Makes more sense, especially on async teardown where
      the controller is torn down regardless of the LS (LS is more of a notifier
      to the target of the teardown), to have them on the remoteport.
      
      While revising ls send/done routines, issues were seen relative to
      refcounting and cleanup, especially in async path. Reworked these code
      paths.
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      c913a8b0
    • James Smart's avatar
      nvmet_fc: add missing reference in add_port · 568ad51e
      James Smart authored
      Add missing reference in add_port
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      568ad51e
    • James Smart's avatar
      nvmet_fc: Rework target side abort handling · a97ec51b
      James Smart authored
      target transport:
      ----------------------
      There are cases when there is a need to abort in-progress target
      operations (writedata) so that controller termination or errors can
      clean up. That can't happen currently as the abort is another target
      op type, so it can't be used till the running one finishes (and it may
      not).  Solve by removing the abort op type and creating a separate
      downcall from the transport to the lldd to request an io to be aborted.
      
      The transport will abort ios on queue teardown or io errors. In general
      the transport tries to call the lldd abort only when the io state is
      idle. Meaning: ops that transmit data (readdata or rsp) will always
      finish their transmit (or the lldd will see a state on the
      link or initiator port that fails the transmit) and the done call for
      the operation will occur. The transport will wait for the op done
      upcall before calling the abort function, and as the io is idle, the
      io can be cleaned up immediately after the abort call; Similarly, ios
      that are not waiting for data or transmitting data must be in the nvmet
      layer being processed. The transport will wait for the nvmet layer
      completion before calling the abort function, and as the io is idle,
      the io can be cleaned up immediately after the abort call; As for ops
      that are waiting for data (writedata), they may be outstanding
      indefinitely if the lldd doesn't see a condition where the initiatior
      port or link is bad. In those cases, the transport will call the abort
      function and wait for the lldd's op done upcall for the operation, where
      it will then clean up the io.
      
      Additionally, if a lldd receives an ABTS and matches it to an outstanding
      request in the transport, A new new transport upcall was created to abort
      the outstanding request in the transport. The transport expects any
      outstanding op call (readdata or writedata) will completed by the lldd and
      the operation upcall made. The transport doesn't act on the reported
      abort (e.g. clean up the io) until an op done upcall occurs, a new op is
      attempted, or the nvmet layer completes the io processing.
      
      fcloop:
      ----------------------
      Updated to support the new target apis.
      On fcp io aborts from the initiator, the loopback context is updated to
      NULL out the half that has completed. The initiator side is immediately
      called after the abort request with an io completion (abort status).
      On fcp io aborts from the target, the io is stopped and the initiator side
      sees it as an aborted io. Target side ops, perhaps in progress while the
      initiator side is done, continue but noop the data movement as there's no
      structure on the initiator side to reference.
      
      patch also contains:
      ----------------------
      Revised lpfc to support the new abort api
      
      commonized rsp buffer syncing and nulling of private data based on
      calling paths.
      
      errors in op done calls don't take action on the fod. They're bad
      operations which implies the fod may be bad.
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      a97ec51b
    • James Smart's avatar
      nvme_fcloop: split job struct from transport for req_release · ce79bfc2
      James Smart authored
      Current design has the fcloop job struct, used for both initiator and
      target processing, allocated as part of the initiator request structure.
      On aborts, the initiator side (based on the request) may terminate, yet
      the target side wants to continue processing. the target side can't do
      that if the initiator side goes away.
      Revise fcloop to allocate an independent target side structure when it
      starts an io from the initiator.
      
      Added a lock to the request struct as well to synchronize pointer updates
      on abort calls.
      
      Modified target downcalls to recognize conditions where initiator has
      aborted the io (thus nulled the pointer between job structs), thus
      avoid referencing sgl lists which are gone and no longer making upcalls
      to the initiator.
      
      In conditions where the targetport is no longer connected, have the
      initiator return an access failure rather than simulating a command
      completion.
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      ce79bfc2
    • James Smart's avatar
      nvmet_fc: add req_release to lldd api · 19b58d94
      James Smart authored
      With the advent of the opdone calls changing context, the lldd can no
      longer assume that once the op->done call returns for RSP operations
      that the request struct is no longer being accessed.
      
      As such, revise the lldd api for a req_release callback that the
      transport will call when the job is complete. This will also be used
      with abort cases.
      
      Fixed text in api header for change in io complete semantics.
      
      Revised lpfc to support the new req_release api.
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      19b58d94
    • James Smart's avatar
      nvmet_fc: add target feature flags for upcall isr contexts · 39498fae
      James Smart authored
      Two new feature flags were added to control whether upcalls to the
      transport result in context switches or stay in the calling context.
      
      NVMET_FCTGTFEAT_CMD_IN_ISR:
        By default, if the flag is not set, the transport assumes the
        lldd is in a non-isr context and in the cpu context it should be
        for the io queue. As such, the cmd handler is called directly in the
        calling context.
        If the flag is set, indicating the upcall is an isr context, the
        transport mandates a transition to a workqueue. The workqueue assigned
        to the queue is used for the context.
      NVMET_FCTGTFEAT_OPDONE_IN_ISR
        By default, if the flag is not set, the transport assumes the
        lldd is in a non-isr context and in the cpu context it should be
        for the io queue. As such, the fcp operation done callback is called
        directly in the calling context.
        If the flag is set, indicating the upcall is an isr context, the
        transport mandates a transition to a workqueue. The workqueue assigned
        to the queue is used for the context.
      
      Updated lpfc for flags
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      39498fae
    • Logan Gunthorpe's avatar
      nvmet: convert from kmap to nvmet_copy_from_sgl · 1c05cf90
      Logan Gunthorpe authored
      This is safer as it doesn't rely on the data being stored in
      a single page in an sgl.
      
      It also aids our effort to start phasing out users of sg_page. See [1].
      
      For this we kmalloc some memory, copy to it and free at the end. Note:
      we can't allocate this memory on the stack as the kbuild test robot
      reports some frame size overflows on i386.
      
      [1] https://lwn.net/Articles/720053/Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      1c05cf90
    • Helen Koike's avatar
      nvme: improve performance for virtual NVMe devices · f9f38e33
      Helen Koike authored
      This change provides a mechanism to reduce the number of MMIO doorbell
      writes for the NVMe driver. When running in a virtualized environment
      like QEMU, the cost of an MMIO is quite hefy here. The main idea for
      the patch is provide the device two memory location locations:
       1) to store the doorbell values so they can be lookup without the doorbell
          MMIO write
       2) to store an event index.
      I believe the doorbell value is obvious, the event index not so much.
      Similar to the virtio specification, the virtual device can tell the
      driver (guest OS) not to write MMIO unless you are writing past this
      value.
      
      FYI: doorbell values are written by the nvme driver (guest OS) and the
      event index is written by the virtual device (host OS).
      
      The patch implements a new admin command that will communicate where
      these two memory locations reside. If the command fails, the nvme
      driver will work as before without any optimizations.
      
      Contributions:
        Eric Northup <digitaleric@google.com>
        Frank Swiderski <fes@google.com>
        Ted Tso <tytso@mit.edu>
        Keith Busch <keith.busch@intel.com>
      
      Just to give an idea on the performance boost with the vendor
      extension: Running fio [1], a stock NVMe driver I get about 200K read
      IOPs with my vendor patch I get about 1000K read IOPs. This was
      running with a null device i.e. the backing device simply returned
      success on every read IO request.
      
      [1] Running on a 4 core machine:
        fio --time_based --name=benchmark --runtime=30
        --filename=/dev/nvme0n1 --nrfiles=1 --ioengine=libaio --iodepth=32
        --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4
        --rw=randread --blocksize=4k --randrepeat=false
      Signed-off-by: default avatarRob Nelson <rlnelson@google.com>
      [mlin: port for upstream]
      Signed-off-by: default avatarMing Lin <mlin@kernel.org>
      [koike: updated for upstream]
      Signed-off-by: default avatarHelen Koike <helen.koike@collabora.co.uk>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      f9f38e33
    • Keith Busch's avatar
      nvme/pci: Don't set reserved SQ create flags · 81c1cd98
      Keith Busch authored
      The QPRIO field is only valid if weighted round robin arbitration is used,
      and this driver doesn't enable that controller configuration option.
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      81c1cd98
    • Jens Axboe's avatar
      blk-stat: kill blk_stat_rq_ddir() · 99c749a4
      Jens Axboe authored
      No point in providing and exporting this helper. There's just
      one (real) user of it, just use rq_data_dir().
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      99c749a4
    • Josef Bacik's avatar
      nbd: set the max segments to USHRT_MAX · 1cc1f17a
      Josef Bacik authored
      I lack the basic understanding of what segments mean, so we were being
      limited to 512kib requests even with higher max_sectors sizes set.
      Setting the maximum number of segments to unlimited allows us to
      actually have arbitrarily large IO's go through NBD.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      1cc1f17a
  4. 20 Apr, 2017 4 commits
    • Bart Van Assche's avatar
      blk-mq: Remove blk_mq_sched_move_to_dispatch() · 246665db
      Bart Van Assche authored
      commit c13660a0 ("blk-mq-sched: change ->dispatch_requests()
      to ->dispatch_request()") removed the last user of this function.
      Hence also remove the function itself.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      246665db
    • Jens Axboe's avatar
      blk-mq: add might_sleep check to blk_mq_get_driver_tag() · 5feeacdd
      Jens Axboe authored
      If the caller passes in wait=true, it has to be able to block
      for a driver tag. We just had a bug where flush insertion
      would block on tag allocation, while we had preempt disabled.
      Ensure that we catch cases like that earlier next time.
      Reviewed-by: default avatarBart Van Assche <Bart.VanAssche@sandisk.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      5feeacdd
    • Stephen Bates's avatar
      blk-mq: Fix poll_stat for new size-based bucketing. · 0206319f
      Stephen Bates authored
      Fixes an issue where the size of the poll_stat array in request_queue
      does not match the size expected by the new size based bucketing for
      IO completion polling.
      
      Fixes: 720b8ccc ("blk-mq: Add a polling specific stats function")
      Signed-off-by: default avatarStephen Bates <sbates@raithlin.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      0206319f
    • Jens Axboe's avatar
      blk-mq: fix schedule-while-atomic with scheduler attached · b00c53e8
      Jens Axboe authored
      We must have dropped the ctx before we call
      blk_mq_sched_insert_request() with can_block=true, otherwise we risk
      that a flush request can block on insertion if we are currently out of
      tags.
      
      [   47.667190] BUG: scheduling while atomic: jbd2/sda2-8/2089/0x00000002
      [   47.674493] Modules linked in: x86_pkg_temp_thermal btrfs xor zlib_deflate raid6_pq sr_mod cdre
      [   47.690572] Preemption disabled at:
      [   47.690584] [<ffffffff81326c7c>] blk_mq_sched_get_request+0x6c/0x280
      [   47.701764] CPU: 1 PID: 2089 Comm: jbd2/sda2-8 Not tainted 4.11.0-rc7+ #271
      [   47.709630] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
      [   47.718081] Call Trace:
      [   47.720903]  dump_stack+0x4f/0x73
      [   47.724694]  ? blk_mq_sched_get_request+0x6c/0x280
      [   47.730137]  __schedule_bug+0x6c/0xc0
      [   47.734314]  __schedule+0x559/0x780
      [   47.738302]  schedule+0x3b/0x90
      [   47.741899]  io_schedule+0x11/0x40
      [   47.745788]  blk_mq_get_tag+0x167/0x2a0
      [   47.750162]  ? remove_wait_queue+0x70/0x70
      [   47.754901]  blk_mq_get_driver_tag+0x92/0xf0
      [   47.759758]  blk_mq_sched_insert_request+0x134/0x170
      [   47.765398]  ? blk_account_io_start+0xd0/0x270
      [   47.770679]  blk_mq_make_request+0x1b2/0x850
      [   47.775766]  generic_make_request+0xf7/0x2d0
      [   47.780860]  submit_bio+0x5f/0x120
      [   47.784979]  ? submit_bio+0x5f/0x120
      [   47.789631]  submit_bh_wbc.isra.46+0x10d/0x130
      [   47.794902]  submit_bh+0xb/0x10
      [   47.798719]  journal_submit_commit_record+0x190/0x210
      [   47.804686]  ? _raw_spin_unlock+0x13/0x30
      [   47.809480]  jbd2_journal_commit_transaction+0x180a/0x1d00
      [   47.815925]  kjournald2+0xb6/0x250
      [   47.820022]  ? kjournald2+0xb6/0x250
      [   47.824328]  ? remove_wait_queue+0x70/0x70
      [   47.829223]  kthread+0x10e/0x140
      [   47.833147]  ? commit_timeout+0x10/0x10
      [   47.837742]  ? kthread_create_on_node+0x40/0x40
      [   47.843122]  ret_from_fork+0x29/0x40
      
      Fixes: a4d907b6 ("blk-mq: streamline blk_mq_make_request")
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b00c53e8