1. 14 Feb, 2018 1 commit
  2. 12 Feb, 2018 1 commit
    • Roland Dreier's avatar
      nvme: Don't use a stack buffer for keep-alive command · 0a34e466
      Roland Dreier authored
      In nvme_keep_alive() we pass a request with a pointer to an NVMe command on
      the stack into blk_execute_rq_nowait().  However, the block layer doesn't
      guarantee that the request is fully queued before blk_execute_rq_nowait()
      returns.  If not, and the request is queued after nvme_keep_alive() returns,
      then we'll end up using stack memory that might have been overwritten to
      form the NVMe command we pass to hardware.
      
      Fix this by keeping a special command struct in the nvme_ctrl struct right
      next to the delayed work struct used for keep-alives.
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      0a34e466
  3. 11 Feb, 2018 2 commits
    • James Smart's avatar
      nvme_fc: cleanup io completion · c3aedd22
      James Smart authored
      There was some old cold that dealt with complete_rq being called
      prior to the lldd returning the io completion. This is garbage code.
      The complete_rq routine was being called after eh_timeouts were
      called and it was due to eh_timeouts not being handled properly.
      The timeouts were fixed in prior patches so that in general, a
      timeout will initiate an abort and the reset timer restarted as
      the abort operation will take care of completing things. Given the
      reset timer restarted, the erroneous complete_rq calls were eliminated.
      
      So remove the work that was synchronizing complete_rq with io
      completion.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      c3aedd22
    • James Smart's avatar
      nvme_fc: correct abort race condition on resets · 3efd6e8e
      James Smart authored
      During reset handling, there is live io completing while the reset
      is taking place. The reset path attempts to abort all outstanding io,
      counting the number of ios that were reset. It then waits for those
      ios to be reclaimed from the lldd before continuing.
      
      The transport's logic on io state and flag setting was poor, allowing
      ios to complete simultaneous to the abort request. The completed ios
      were counted, but as the completion had already occurred, the
      completion never reduced the count. As the count never zeros, the
      reset/delete never completes.
      
      Tighten it up by unconditionally changing the op state to completed
      when the io done handler is called.  The reset/abort path now changes
      the op state to aborted, but the abort only continues if the op
      state was live priviously. If complete, the abort is backed out.
      Thus proper counting of io aborts and their completions is working
      again.
      
      Also removed the TERMIO state on the op as it's redundant with the
      op's aborted state.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      3efd6e8e
  4. 08 Feb, 2018 4 commits
  5. 07 Feb, 2018 9 commits
    • Paolo Valente's avatar
      block, bfq: add requeue-request hook · a7877390
      Paolo Valente authored
      Commit 'a6a252e6 ("blk-mq-sched: decide how to handle flush rq via
      RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
      be re-inserted into the active I/O scheduler for that device. As a
      consequence, I/O schedulers may get the same request inserted again,
      even several times, without a finish_request invoked on that request
      before each re-insertion.
      
      This fact is the cause of the failure reported in [1]. For an I/O
      scheduler, every re-insertion of the same re-prepared request is
      equivalent to the insertion of a new request. For schedulers like
      mq-deadline or kyber, this fact causes no harm. In contrast, it
      confuses a stateful scheduler like BFQ, which keeps state for an I/O
      request, until the finish_request hook is invoked on the request. In
      particular, BFQ may get stuck, waiting forever for the number of
      request dispatches, of the same request, to be balanced by an equal
      number of request completions (while there will be one completion for
      that request). In this state, BFQ may refuse to serve I/O requests
      from other bfq_queues. The hang reported in [1] then follows.
      
      However, the above re-prepared requests undergo a requeue, thus the
      requeue_request hook of the active elevator is invoked for these
      requests, if set. This commit then addresses the above issue by
      properly implementing the hook requeue_request in BFQ.
      
      [1] https://marc.info/?l=linux-block&m=151211117608676Reported-by: default avatarIvan Kozik <ivan@ludios.org>
      Reported-by: default avatarAlban Browaeys <alban.browaeys@gmail.com>
      Tested-by: default avatarMike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarSerena Ziviani <ziviani.serena@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a7877390
    • Tang Junhui's avatar
      bcache: fix for data collapse after re-attaching an attached device · 73ac105b
      Tang Junhui authored
      back-end device sdm has already attached a cache_set with ID
      f67ebe1f-f8bc-4d73-bfe5-9dc88607f119, then try to attach with
      another cache set, and it returns with an error:
      [root]# cd /sys/block/sdm/bcache
      [root]# echo 5ccd0a63-148e-48b8-afa2-aca9cbd6279f > attach
      -bash: echo: write error: Invalid argument
      
      After that, execute a command to modify the label of bcache
      device:
      [root]# echo data_disk1 > label
      
      Then we reboot the system, when the system power on, the back-end
      device can not attach to cache_set, a messages show in the log:
      Feb  5 12:05:52 ceph152 kernel: [922385.508498] bcache:
      bch_cached_dev_attach() couldn't find uuid for sdm in set
      
      In sysfs_attach(), dc->sb.set_uuid was assigned to the value
      which input through sysfs, no matter whether it is success
      or not in bch_cached_dev_attach(). For example, If the back-end
      device has already attached to an cache set, bch_cached_dev_attach()
      would fail, but dc->sb.set_uuid was changed. Then modify the
      label of bcache device, it will call bch_write_bdev_super(),
      which would write the dc->sb.set_uuid to the super block, so we
      record a wrong cache set ID in the super block, after the system
      reboot, the cache set couldn't find the uuid of the back-end
      device, so the bcache device couldn't exist and use any more.
      
      In this patch, we don't assigned cache set ID to dc->sb.set_uuid
      in sysfs_attach() directly, but input it into bch_cached_dev_attach(),
      and assigned dc->sb.set_uuid to the cache set ID after the back-end
      device attached to the cache set successful.
      Signed-off-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      73ac105b
    • Tang Junhui's avatar
      bcache: return attach error when no cache set exist · 7f4fc93d
      Tang Junhui authored
      I attach a back-end device to a cache set, and the cache set is not
      registered yet, this back-end device did not attach successfully, and no
      error returned:
      [root]# echo 87859280-fec6-4bcc-20df7ca8f86b > /sys/block/sde/bcache/attach
      [root]#
      
      In sysfs_attach(), the return value "v" is initialized to "size" in
      the beginning, and if no cache set exist in bch_cache_sets, the "v" value
      would not change any more, and return to sysfs, sysfs regard it as success
      since the "size" is a positive number.
      
      This patch fixes this issue by assigning "v" with "-ENOENT" in the
      initialization.
      Signed-off-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7f4fc93d
    • Coly Li's avatar
      bcache: set writeback_rate_update_seconds in range [1, 60] seconds · 7a5e3ecb
      Coly Li authored
      dc->writeback_rate_update_seconds can be set via sysfs and its value can
      be set to [1, ULONG_MAX].  It does not make sense to set such a large
      value, 60 seconds is long enough value considering the default 5 seconds
      works well for long time.
      
      Because dc->writeback_rate_update is a special delayed work, it re-arms
      itself inside the delayed work routine update_writeback_rate(). When
      stopping it by cancel_delayed_work_sync(), there should be a timeout to
      wait and make sure the re-armed delayed work is stopped too. A small max
      value of dc->writeback_rate_update_seconds is also helpful to decide a
      reasonable small timeout.
      
      This patch limits sysfs interface to set dc->writeback_rate_update_seconds
      in range of [1, 60] seconds, and replaces the hand-coded number by macros.
      
      Changelog:
      v2: fix a rebase typo in v4, which is pointed out by Michael Lyle.
      v1: initial version.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7a5e3ecb
    • Tang Junhui's avatar
      bcache: fix for allocator and register thread race · 682811b3
      Tang Junhui authored
      After long time running of random small IO writing,
      I reboot the machine, and after the machine power on,
      I found bcache got stuck, the stack is:
      [root@ceph153 ~]# cat /proc/2510/task/*/stack
      [<ffffffffa06b2455>] closure_sync+0x25/0x90 [bcache]
      [<ffffffffa06b6be8>] bch_journal+0x118/0x2b0 [bcache]
      [<ffffffffa06b6dc7>] bch_journal_meta+0x47/0x70 [bcache]
      [<ffffffffa06be8f7>] bch_prio_write+0x237/0x340 [bcache]
      [<ffffffffa06a8018>] bch_allocator_thread+0x3c8/0x3d0 [bcache]
      [<ffffffff810a631f>] kthread+0xcf/0xe0
      [<ffffffff8164c318>] ret_from_fork+0x58/0x90
      [<ffffffffffffffff>] 0xffffffffffffffff
      [root@ceph153 ~]# cat /proc/2038/task/*/stack
      [<ffffffffa06b1abd>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
      [<ffffffffa06b1bd1>] bch_btree_insert+0xf1/0x170 [bcache]
      [<ffffffffa06b637f>] bch_journal_replay+0x13f/0x230 [bcache]
      [<ffffffffa06c75fe>] run_cache_set+0x79a/0x7c2 [bcache]
      [<ffffffffa06c0cf8>] register_bcache+0xd48/0x1310 [bcache]
      [<ffffffff812f702f>] kobj_attr_store+0xf/0x20
      [<ffffffff8125b216>] sysfs_write_file+0xc6/0x140
      [<ffffffff811dfbfd>] vfs_write+0xbd/0x1e0
      [<ffffffff811e069f>] SyS_write+0x7f/0xe0
      [<ffffffff8164c3c9>] system_call_fastpath+0x16/0x1
      The stack shows the register thread and allocator thread
      were getting stuck when registering cache device.
      
      I reboot the machine several times, the issue always
      exsit in this machine.
      
      I debug the code, and found the call trace as bellow:
      register_bcache()
         ==>run_cache_set()
            ==>bch_journal_replay()
               ==>bch_btree_insert()
                  ==>__bch_btree_map_nodes()
                     ==>btree_insert_fn()
                        ==>btree_split() //node need split
                           ==>btree_check_reserve()
      In btree_check_reserve(), It will check if there is enough buckets
      of RESERVE_BTREE type, since allocator thread did not work yet, so
      no buckets of RESERVE_BTREE type allocated, so the register thread
      waits on c->btree_cache_wait, and goes to sleep.
      
      Then the allocator thread initialized, the call trace is bellow:
      bch_allocator_thread()
      ==>bch_prio_write()
         ==>bch_journal_meta()
            ==>bch_journal()
               ==>journal_wait_for_write()
      In journal_wait_for_write(), It will check if journal is full by
      journal_full(), but the long time random small IO writing
      causes the exhaustion of journal buckets(journal.blocks_free=0),
      In order to release the journal buckets,
      the allocator calls btree_flush_write() to flush keys to
      btree nodes, and waits on c->journal.wait until btree nodes writing
      over or there has already some journal buckets space, then the
      allocator thread goes to sleep. but in btree_flush_write(), since
      bch_journal_replay() is not finished, so no btree nodes have journal
      (condition "if (btree_current_write(b)->journal)" never satisfied),
      so we got no btree node to flush, no journal bucket released,
      and allocator sleep all the times.
      
      Through the above analysis, we can see that:
      1) Register thread wait for allocator thread to allocate buckets of
         RESERVE_BTREE type;
      2) Alloctor thread wait for register thread to replay journal, so it
         can flush btree nodes and get journal bucket.
         then they are all got stuck by waiting for each other.
      
      Hua Rui provided a patch for me, by allocating some buckets of
      RESERVE_BTREE type in advance, so the register thread can get bucket
      when btree node splitting and no need to waiting for the allocator
      thread. I tested it, it has effect, and register thread run a step
      forward, but finally are still got stuck, the reason is only 8 bucket
      of RESERVE_BTREE type were allocated, and in bch_journal_replay(),
      after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left,
      then btree_check_reserve() is not satisfied anymore, so it goes to sleep
      again, and in the same time, alloctor thread did not flush enough btree
      nodes to release a journal bucket, so they all got stuck again.
      
      So we need to allocate more buckets of RESERVE_BTREE type in advance,
      but how much is enough?  By experience and test, I think it should be
      as much as journal buckets. Then I modify the code as this patch,
      and test in the machine, and it works.
      
      This patch modified base on Hua Rui’s patch, and allocate more buckets
      of RESERVE_BTREE type in advance to avoid register thread and allocate
      thread going to wait for each other.
      
      [patch v2] ca->sb.njournal_buckets would be 0 in the first time after
      cache creation, and no journal exists, so just 8 btree buckets is OK.
      Signed-off-by: default avatarHua Rui <huarui.dev@gmail.com>
      Signed-off-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      682811b3
    • Coly Li's avatar
      bcache: set error_limit correctly · 7ba0d830
      Coly Li authored
      Struct cache uses io_errors for two purposes,
      - Error decay: when cache set error_decay is set, io_errors is used to
        generate a small piece of delay when I/O error happens.
      - I/O errors counter: in order to generate big enough value for error
        decay, I/O errors counter value is stored by left shifting 20 bits (a.k.a
        IO_ERROR_SHIFT).
      
      In function bch_count_io_errors(), if I/O errors counter reaches cache set
      error limit, bch_cache_set_error() will be called to retire the whold cache
      set. But current code is problematic when checking the error limit, see the
      following code piece from bch_count_io_errors(),
      
       90     if (error) {
       91             char buf[BDEVNAME_SIZE];
       92             unsigned errors = atomic_add_return(1 << IO_ERROR_SHIFT,
       93                                                 &ca->io_errors);
       94             errors >>= IO_ERROR_SHIFT;
       95
       96             if (errors < ca->set->error_limit)
       97                     pr_err("%s: IO error on %s, recovering",
       98                            bdevname(ca->bdev, buf), m);
       99             else
      100                     bch_cache_set_error(ca->set,
      101                                         "%s: too many IO errors %s",
      102                                         bdevname(ca->bdev, buf), m);
      103     }
      
      At line 94, errors is right shifting IO_ERROR_SHIFT bits, now it is real
      errors counter to compare at line 96. But ca->set->error_limit is initia-
      lized with an amplified value in bch_cache_set_alloc(),
      1545         c->error_limit  = 8 << IO_ERROR_SHIFT;
      
      It means by default, in bch_count_io_errors(), before 8<<20 errors happened
      bch_cache_set_error() won't be called to retire the problematic cache
      device. If the average request size is 64KB, it means bcache won't handle
      failed device until 512GB data is requested. This is too large to be an I/O
      threashold. So I believe the correct error limit should be much less.
      
      This patch sets default cache set error limit to 8, then in
      bch_count_io_errors() when errors counter reaches 8 (if it is default
      value), function bch_cache_set_error() will be called to retire the whole
      cache set. This patch also removes bits shifting when store or show
      io_error_limit value via sysfs interface.
      
      Nowadays most of SSDs handle internal flash failure automatically by LBA
      address re-indirect mapping. If an I/O error can be observed by upper layer
      code, it will be a notable error because that SSD can not re-indirect
      map the problematic LBA address to an available flash block. This situation
      indicates the whole SSD will be failed very soon. Therefore setting 8 as
      the default io error limit value makes sense, it is enough for most of
      cache devices.
      
      Changelog:
      v2: add reviewed-by from Hannes.
      v1: initial version for review.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7ba0d830
    • Coly Li's avatar
      bcache: properly set task state in bch_writeback_thread() · 99361bbf
      Coly Li authored
      Kernel thread routine bch_writeback_thread() has the following code block,
      
      447         down_write(&dc->writeback_lock);
      448~450     if (check conditions) {
      451                 up_write(&dc->writeback_lock);
      452                 set_current_state(TASK_INTERRUPTIBLE);
      453
      454                 if (kthread_should_stop())
      455                         return 0;
      456
      457                 schedule();
      458                 continue;
      459         }
      
      If condition check is true, its task state is set to TASK_INTERRUPTIBLE
      and call schedule() to wait for others to wake up it.
      
      There are 2 issues in current code,
      1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if
         another process changes the condition and call wake_up_process(dc->
         writeback_thread), then at line 452 task state is set back to
         TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be
         waken up.
      2, At line 454 if kthread_should_stop() is true, writeback kernel thread
         will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and
         call do_exit(). It is not good to enter do_exit() with task state
         TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a
         warning message is reported by __might_sleep(): "WARNING: do not call
         blocking ops when !TASK_RUNNING; state=1 set at [xxxx]".
      
      For the first issue, task state should be set before condition checks.
      Ineed because dc->writeback_lock is required when modifying all the
      conditions, calling set_current_state() inside code block where dc->
      writeback_lock is hold is safe. But this is quite implicit, so I still move
      set_current_state() before all the condition checks.
      
      For the second issue, frankley speaking it does not hurt when kernel thread
      exits with TASK_INTERRUPTIBLE state, but this warning message scares users,
      makes them feel there might be something risky with bcache and hurt their
      data.  Setting task state to TASK_RUNNING before returning fixes this
      problem.
      
      In alloc.c:allocator_wait(), there is also a similar issue, and is also
      fixed in this patch.
      
      Changelog:
      v3: merge two similar fixes into one patch
      v2: fix the race issue in v1 patch.
      v1: initial buggy fix.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Cc: Michael Lyle <mlyle@lyle.org>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      99361bbf
    • Tang Junhui's avatar
      bcache: fix high CPU occupancy during journal · c4dc2497
      Tang Junhui authored
      After long time small writing I/O running, we found the occupancy of CPU
      is very high and I/O performance has been reduced by about half:
      
      [root@ceph151 internal]# top
      top - 15:51:05 up 1 day,2:43,  4 users,  load average: 16.89, 15.15, 16.53
      Tasks: 2063 total,   4 running, 2059 sleeping,   0 stopped,   0 zombie
      %Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa,  0.0 hi,  0.5 si,  0.0 st
      KiB Mem : 65450044 total, 24586420 free, 38909008 used,  1954616 buff/cache
      KiB Swap: 65667068 total, 65667068 free,        0 used. 25136812 avail Mem
      
        PID USER PR NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
       2023 root 20  0       0      0      0 S 55.1  0.0   0:04.42 kworker/11:191
      14126 root 20  0       0      0      0 S 42.9  0.0   0:08.72 kworker/10:3
       9292 root 20  0       0      0      0 S 30.4  0.0   1:10.99 kworker/6:1
       8553 ceph 20  0 4242492 1.805g  18804 S 30.0  2.9 410:07.04 ceph-osd
      12287 root 20  0       0      0      0 S 26.7  0.0   0:28.13 kworker/7:85
      31019 root 20  0       0      0      0 S 26.1  0.0   1:30.79 kworker/22:1
       1787 root 20  0       0      0      0 R 25.7  0.0   5:18.45 kworker/8:7
      32169 root 20  0       0      0      0 S 14.5  0.0   1:01.92 kworker/23:1
      21476 root 20  0       0      0      0 S 13.9  0.0   0:05.09 kworker/1:54
       2204 root 20  0       0      0      0 S 12.5  0.0   1:25.17 kworker/9:10
      16994 root 20  0       0      0      0 S 12.2  0.0   0:06.27 kworker/5:106
      15714 root 20  0       0      0      0 R 10.9  0.0   0:01.85 kworker/19:2
       9661 ceph 20  0 4246876 1.731g  18800 S 10.6  2.8 403:00.80 ceph-osd
      11460 ceph 20  0 4164692 2.206g  18876 S 10.6  3.5 360:27.19 ceph-osd
       9960 root 20  0       0      0      0 S 10.2  0.0   0:02.75 kworker/2:139
      11699 ceph 20  0 4169244 1.920g  18920 S 10.2  3.1 355:23.67 ceph-osd
       6843 ceph 20  0 4197632 1.810g  18900 S  9.6  2.9 380:08.30 ceph-osd
      
      The kernel work consumed a lot of CPU, and I found they are running journal
      work, The journal is reclaiming source and flush btree node with surprising
      frequency.
      
      Through further analysis, we found that in btree_flush_write(), we try to
      get a btree node with the smallest fifo idex to flush by traverse all the
      btree nodein c->bucket_hash, after we getting it, since no locker protects
      it, this btree node may have been written to cache device by other works,
      and if this occurred, we retry to traverse in c->bucket_hash and get
      another btree node. When the problem occurrd, the retry times is very high,
      and we consume a lot of CPU in looking for a appropriate btree node.
      
      In this patch, we try to record 128 btree nodes with the smallest fifo idex
      in heap, and pop one by one when we need to flush btree node. It greatly
      reduces the time for the loop to find the appropriate BTREE node, and also
      reduce the occupancy of CPU.
      
      [note by mpl: this triggers a checkpatch error because of adjacent,
      pre-existing style violations]
      Signed-off-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c4dc2497
    • Tang Junhui's avatar
      bcache: add journal statistic · a728eacb
      Tang Junhui authored
      Sometimes, Journal takes up a lot of CPU, we need statistics
      to know what's the journal is doing. So this patch provide
      some journal statistics:
      1) reclaim: how many times the journal try to reclaim resource,
         usually the journal bucket or/and the pin are exhausted.
      2) flush_write: how many times the journal try to flush btree node
         to cache device, usually the journal bucket are exhausted.
      3) retry_flush_write: how many times the journal retry to flush
         the next btree node, usually the previous tree node have been
         flushed by other thread.
      we show these statistic by sysfs interface. Through these statistics
      We can totally see the status of journal module when the CPU is too
      high.
      Signed-off-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a728eacb
  6. 06 Feb, 2018 6 commits
    • Howard McLauchlan's avatar
      block: Add should_fail_bio() for bpf error injection · 30abb3a6
      Howard McLauchlan authored
      The classic error injection mechanism, should_fail_request() does not
      support use cases where more information is required (from the entire
      struct bio, for example).
      
      To that end, this patch introduces should_fail_bio(), which calls
      should_fail_request() under the hood but provides a convenient
      place for kprobes to hook into if they require the entire struct bio.
      This patch also replaces some existing calls to should_fail_request()
      with should_fail_bio() with no degradation in performance.
      Signed-off-by: default avatarHoward McLauchlan <hmclauchlan@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      30abb3a6
    • Jens Axboe's avatar
      blk-wbt: account flush requests correctly · 5235553d
      Jens Axboe authored
      Mikulas reported a workload that saw bad performance, and figured
      out what it was due to various other types of requests being
      accounted as reads. Flush requests, for instance. Due to the
      high latency of those, we heavily throttle the writes to keep
      the latencies in balance. But they really should be accounted
      as writes.
      
      Fix this by checking the exact type of the request. If it's a
      read, account as a read, if it's a write or a flush, account
      as a write. Any other request we disregard. Previously everything
      would have been mistakenly accounted as reads.
      Reported-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5235553d
    • Linus Torvalds's avatar
      Merge tag 'media/v4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · 68c5735e
      Linus Torvalds authored
      Pull media updates from Mauro Carvalho Chehab:
      
       - videobuf2 was moved to a media/common dir, as it is now used by the
         DVB subsystem too
      
       - Digital TV core memory mapped support interface
      
       - new sensor driver: ov7740
      
       - several improvements at ddbridge driver
      
       - new V4L2 driver: IPU3 CIO2 CSI-2 receiver unit, found on some Intel
         SoCs
      
       - new tuner driver: tda18250
      
       - finally got rid of all LIRC staging drivers
      
       - as we don't have old lirc drivers anymore, restruct the lirc device
         code
      
       - add support for UVC metadata
      
       - add a new staging driver for NVIDIA Tegra Video Decoder Engine
      
       - DVB kAPI headers moved to include/media
      
       - synchronize the kAPI and uAPI for the DVB subsystem, removing the gap
         for non-legacy APIs
      
       - reduce the kAPI gap for V4L2
      
       - lots of other driver enhancements, cleanups, etc.
      
      * tag 'media/v4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (407 commits)
        media: v4l2-compat-ioctl32.c: make ctrl_is_pointer work for subdevs
        media: v4l2-compat-ioctl32.c: refactor compat ioctl32 logic
        media: v4l2-compat-ioctl32.c: don't copy back the result for certain errors
        media: v4l2-compat-ioctl32.c: drop pr_info for unknown buffer type
        media: v4l2-compat-ioctl32.c: copy clip list in put_v4l2_window32
        media: v4l2-compat-ioctl32.c: fix ctrl_is_pointer
        media: v4l2-compat-ioctl32.c: copy m.userptr in put_v4l2_plane32
        media: v4l2-compat-ioctl32.c: avoid sizeof(type)
        media: v4l2-compat-ioctl32.c: move 'helper' functions to __get/put_v4l2_format32
        media: v4l2-compat-ioctl32.c: fix the indentation
        media: v4l2-compat-ioctl32.c: add missing VIDIOC_PREPARE_BUF
        media: v4l2-ioctl.c: don't copy back the result for -ENOTTY
        media: v4l2-ioctl.c: use check_fmt for enum/g/s/try_fmt
        media: vivid: fix module load error when enabling fb and no_error_inj=1
        media: dvb_demux: improve debug messages
        media: dvb_demux: Better handle discontinuity errors
        media: cxusb, dib0700: ignore XC2028_I2C_FLUSH
        media: ts2020: avoid integer overflows on 32 bit machines
        media: i2c: ov7740: use gpio/consumer.h instead of gpio.h
        media: entity: Add a nop variant of media_entity_cleanup
        ...
      68c5735e
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 2246edfa
      Linus Torvalds authored
      Pull more rdma updates from Doug Ledford:
       "Items of note:
      
         - two patches fix a regression in the 4.15 kernel. The 4.14 kernel
           worked fine with NVMe over Fabrics and mlx5 adapters. That broke in
           4.15. The fix is here.
      
         - one of the patches (the endian notation patch from Lijun) looks
           like a lot of lines of change, but it's mostly mechanical in
           nature. It amounts to the biggest chunk of change in it (it's about
           2/3rds of the overall pull request).
      
        Summary:
      
         - Clean up some function signatures in rxe for clarity
      
         - Tidy the RDMA netlink header to remove unimplemented constants
      
         - bnxt_re driver fixes, one is a regression this window.
      
         - Minor hns driver fixes
      
         - Various fixes from Dan Carpenter and his tool
      
         - Fix IRQ cleanup race in HFI1
      
         - HF1 performance optimizations and a fix to report counters in the right units
      
         - Fix for an IPoIB startup sequence race with the external manager
      
         - Oops fix for the new kabi path
      
         - Endian cleanups for hns
      
         - Fix for mlx5 related to the new automatic affinity support"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (38 commits)
        net/mlx5: increase async EQ to avoid EQ overrun
        mlx5: fix mlx5_get_vector_affinity to start from completion vector 0
        RDMA/hns: Fix the endian problem for hns
        IB/uverbs: Use the standard kConfig format for experimental
        IB: Update references to libibverbs
        IB/hfi1: Add 16B rcvhdr trace support
        IB/hfi1: Convert kzalloc_node and kcalloc to use kcalloc_node
        IB/core: Avoid a potential OOPs for an unused optional parameter
        IB/core: Map iWarp AH type to undefined in rdma_ah_find_type
        IB/ipoib: Fix for potential no-carrier state
        IB/hfi1: Show fault stats in both TX and RX directions
        IB/hfi1: Remove blind constants from 16B update
        IB/hfi1: Convert PortXmitWait/PortVLXmitWait counters to flit times
        IB/hfi1: Do not override given pcie_pset value
        IB/hfi1: Optimize process_receive_ib()
        IB/hfi1: Remove unnecessary fecn and becn fields
        IB/hfi1: Look up ibport using a pointer in receive path
        IB/hfi1: Optimize packet type comparison using 9B and bypass code paths
        IB/hfi1: Compute BTH only for RDMA_WRITE_LAST/SEND_LAST packet
        IB/hfi1: Remove dependence on qp->s_hdrwords
        ...
      2246edfa
    • Linus Torvalds's avatar
      Merge tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 3ff1b28c
      Linus Torvalds authored
      Pull libnvdimm updates from Ross Zwisler:
      
       - Require struct page by default for filesystem DAX to remove a number
         of surprising failure cases. This includes failures with direct I/O,
         gdb and fork(2).
      
       - Add support for the new Platform Capabilities Structure added to the
         NFIT in ACPI 6.2a. This new table tells us whether the platform
         supports flushing of CPU and memory controller caches on unexpected
         power loss events.
      
       - Revamp vmem_altmap and dev_pagemap handling to clean up code and
         better support future future PCI P2P uses.
      
       - Deprecate the ND_IOCTL_SMART_THRESHOLD command whose payload has
         become out-of-sync with recent versions of the NVDIMM_FAMILY_INTEL
         spec, and instead rely on the generic ND_CMD_CALL approach used by
         the two other IOCTL families, NVDIMM_FAMILY_{HPE,MSFT}.
      
       - Enhance nfit_test so we can test some of the new things added in
         version 1.6 of the DSM specification. This includes testing firmware
         download and simulating the Last Shutdown State (LSS) status.
      
      * tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (37 commits)
        libnvdimm, namespace: remove redundant initialization of 'nd_mapping'
        acpi, nfit: fix register dimm error handling
        libnvdimm, namespace: make min namespace size 4K
        tools/testing/nvdimm: force nfit_test to depend on instrumented modules
        libnvdimm/nfit_test: adding support for unit testing enable LSS status
        libnvdimm/nfit_test: add firmware download emulation
        nfit-test: Add platform cap support from ACPI 6.2a to test
        libnvdimm: expose platform persistence attribute for nd_region
        acpi: nfit: add persistent memory control flag for nd_region
        acpi: nfit: Add support for detect platform CPU cache flush on power loss
        device-dax: Fix trailing semicolon
        libnvdimm, btt: fix uninitialized err_lock
        dax: require 'struct page' by default for filesystem dax
        ext2: auto disable dax instead of failing mount
        ext4: auto disable dax instead of failing mount
        mm, dax: introduce pfn_t_special()
        mm: Fix devm_memremap_pages() collision handling
        mm: Fix memory size alignment in devm_memremap_pages_release()
        memremap: merge find_dev_pagemap into get_dev_pagemap
        memremap: change devm_memremap_pages interface to use struct dev_pagemap
        ...
      3ff1b28c
    • Linus Torvalds's avatar
      Merge tag 'pci-v4.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci · 105cf3c8
      Linus Torvalds authored
      Pull PCI updates from Bjorn Helgaas:
      
       - skip AER driver error recovery callbacks for correctable errors
         reported via ACPI APEI, as we already do for errors reported via the
         native path (Tyler Baicar)
      
       - fix DPC shared interrupt handling (Alex Williamson)
      
       - print full DPC interrupt number (Keith Busch)
      
       - enable DPC only if AER is available (Keith Busch)
      
       - simplify DPC code (Bjorn Helgaas)
      
       - calculate ASPM L1 substate parameter instead of hardcoding it (Bjorn
         Helgaas)
      
       - enable Latency Tolerance Reporting for ASPM L1 substates (Bjorn
         Helgaas)
      
       - move ASPM internal interfaces out of public header (Bjorn Helgaas)
      
       - allow hot-removal of VGA devices (Mika Westerberg)
      
       - speed up unplug and shutdown by assuming Thunderbolt controllers
         don't support Command Completed events (Lukas Wunner)
      
       - add AtomicOps support for GPU and Infiniband drivers (Felix Kuehling,
         Jay Cornwall)
      
       - expose "ari_enabled" in sysfs to help NIC naming (Stuart Hayes)
      
       - clean up PCI DMA interface usage (Christoph Hellwig)
      
       - remove PCI pool API (replaced with DMA pool) (Romain Perier)
      
       - deprecate pci_get_bus_and_slot(), which assumed PCI domain 0 (Sinan
         Kaya)
      
       - move DT PCI code from drivers/of/ to drivers/pci/ (Rob Herring)
      
       - add PCI-specific wrappers for dev_info(), etc (Frederick Lawler)
      
       - remove warnings on sysfs mmap failure (Bjorn Helgaas)
      
       - quiet ROM validation messages (Alex Deucher)
      
       - remove redundant memory alloc failure messages (Markus Elfring)
      
       - fill in types for compile-time VGA and other I/O port resources
         (Bjorn Helgaas)
      
       - make "pci=pcie_scan_all" work for Root Ports as well as Downstream
         Ports to help AmigaOne X1000 (Bjorn Helgaas)
      
       - add SPDX tags to all PCI files (Bjorn Helgaas)
      
       - quirk Marvell 9128 DMA aliases (Alex Williamson)
      
       - quirk broken INTx disable on Ceton InfiniTV4 (Bjorn Helgaas)
      
       - fix CONFIG_PCI=n build by adding dummy pci_irqd_intx_xlate() (Niklas
         Cassel)
      
       - use DMA API to get MSI address for DesignWare IP (Niklas Cassel)
      
       - fix endpoint-mode DMA mask configuration (Kishon Vijay Abraham I)
      
       - fix ARTPEC-6 incorrect IS_ERR() usage (Wei Yongjun)
      
       - add support for ARTPEC-7 SoC (Niklas Cassel)
      
       - add endpoint-mode support for ARTPEC (Niklas Cassel)
      
       - add Cadence PCIe host and endpoint controller driver (Cyrille
         Pitchen)
      
       - handle multiple INTx status bits being set in dra7xx (Vignesh R)
      
       - translate dra7xx hwirq range to fix INTD handling (Vignesh R)
      
       - remove deprecated Exynos PHY initialization code (Jaehoon Chung)
      
       - fix MSI erratum workaround for HiSilicon Hip06/Hip07 (Dongdong Liu)
      
       - fix NULL pointer dereference in iProc BCMA driver (Ray Jui)
      
       - fix Keystone interrupt-controller-node lookup (Johan Hovold)
      
       - constify qcom driver structures (Julia Lawall)
      
       - rework Tegra config space mapping to increase space available for
         endpoints (Vidya Sagar)
      
       - simplify Tegra driver by using bus->sysdata (Manikanta Maddireddy)
      
       - remove PCI_REASSIGN_ALL_BUS usage on Tegra (Manikanta Maddireddy)
      
       - add support for Global Fabric Manager Server (GFMS) event to
         Microsemi Switchtec switch driver (Logan Gunthorpe)
      
       - add IDs for Switchtec PSX 24xG3 and PSX 48xG3 (Kelvin Cao)
      
      * tag 'pci-v4.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (140 commits)
        PCI: cadence: Add EndPoint Controller driver for Cadence PCIe controller
        dt-bindings: PCI: cadence: Add DT bindings for Cadence PCIe endpoint controller
        PCI: endpoint: Fix EPF device name to support multi-function devices
        PCI: endpoint: Add the function number as argument to EPC ops
        PCI: cadence: Add host driver for Cadence PCIe controller
        dt-bindings: PCI: cadence: Add DT bindings for Cadence PCIe host controller
        PCI: Add vendor ID for Cadence
        PCI: Add generic function to probe PCI host controllers
        PCI: generic: fix missing call of pci_free_resource_list()
        PCI: OF: Add generic function to parse and allocate PCI resources
        PCI: Regroup all PCI related entries into drivers/pci/Makefile
        PCI/DPC: Reformat DPC register definitions
        PCI/DPC: Add and use DPC Status register field definitions
        PCI/DPC: Squash dpc_rp_pio_get_info() into dpc_process_rp_pio_error()
        PCI/DPC: Remove unnecessary RP PIO register structs
        PCI/DPC: Push dpc->rp_pio_status assignment into dpc_rp_pio_get_info()
        PCI/DPC: Squash dpc_rp_pio_print_error() into dpc_rp_pio_get_info()
        PCI/DPC: Make RP PIO log size check more generic
        PCI/DPC: Rename local "status" to "dpc_status"
        PCI/DPC: Squash dpc_rp_pio_print_tlp_header() into dpc_rp_pio_print_error()
        ...
      105cf3c8
  7. 05 Feb, 2018 15 commits
  8. 04 Feb, 2018 2 commits
    • Linus Torvalds's avatar
      Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 35277995
      Linus Torvalds authored
      Pull spectre/meltdown updates from Thomas Gleixner:
       "The next round of updates related to melted spectrum:
      
         - The initial set of spectre V1 mitigations:
      
             - Array index speculation blocker and its usage for syscall,
               fdtable and the n180211 driver.
      
             - Speculation barrier and its usage in user access functions
      
         - Make indirect calls in KVM speculation safe
      
         - Blacklisting of known to be broken microcodes so IPBP/IBSR are not
           touched.
      
         - The initial IBPB support and its usage in context switch
      
         - The exposure of the new speculation MSRs to KVM guests.
      
         - A fix for a regression in x86/32 related to the cpu entry area
      
         - Proper whitelisting for known to be safe CPUs from the mitigations.
      
         - objtool fixes to deal proper with retpolines and alternatives
      
         - Exclude __init functions from retpolines which speeds up the boot
           process.
      
         - Removal of the syscall64 fast path and related cleanups and
           simplifications
      
         - Removal of the unpatched paravirt mode which is yet another source
           of indirect unproteced calls.
      
         - A new and undisputed version of the module mismatch warning
      
         - A couple of cleanup and correctness fixes all over the place
      
        Yet another step towards full mitigation. There are a few things still
        missing like the RBS underflow mitigation for Skylake and other small
        details, but that's being worked on.
      
        That said, I'm taking a belated christmas vacation for a week and hope
        that everything is magically solved when I'm back on Feb 12th"
      
      * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
        KVM/SVM: Allow direct access to MSR_IA32_SPEC_CTRL
        KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL
        KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES
        KVM/x86: Add IBPB support
        KVM/x86: Update the reverse_cpuid list to include CPUID_7_EDX
        x86/speculation: Fix typo IBRS_ATT, which should be IBRS_ALL
        x86/pti: Mark constant arrays as __initconst
        x86/spectre: Simplify spectre_v2 command line parsing
        x86/retpoline: Avoid retpolines for built-in __init functions
        x86/kvm: Update spectre-v1 mitigation
        KVM: VMX: make MSR bitmaps per-VCPU
        x86/paravirt: Remove 'noreplace-paravirt' cmdline option
        x86/speculation: Use Indirect Branch Prediction Barrier in context switch
        x86/cpuid: Fix up "virtual" IBRS/IBPB/STIBP feature bits on Intel
        x86/spectre: Fix spelling mistake: "vunerable"-> "vulnerable"
        x86/spectre: Report get_user mitigation for spectre_v1
        nl80211: Sanitize array index in parse_txq_params
        vfs, fdtable: Prevent bounds-check bypass via speculative execution
        x86/syscall: Sanitize syscall table de-references under speculation
        x86/get_user: Use pointer masking to limit speculation
        ...
      35277995
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 0a646e9c
      Linus Torvalds authored
      Pull x86 fixes from Thomas Gleixner:
       "A small set of changes:
      
         - a fixup for kexec related to 5-level paging mode. That covers most
           of the cases except kexec from a 5-level kernel to a 4-level
           kernel. The latter needs more work and is going to come in 4.17
      
         - two trivial fixes for build warnings triggered by LTO and gcc-8"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/power: Fix swsusp_arch_resume prototype
        x86/dumpstack: Avoid uninitlized variable
        x86/kexec: Make kexec (mostly) work in 5-level paging mode
      0a646e9c