1. 13 Oct, 2017 12 commits
  2. 12 Oct, 2017 2 commits
  3. 10 Oct, 2017 5 commits
  4. 09 Oct, 2017 3 commits
    • Paolo Valente's avatar
      block, bfq: fix unbalanced decrements of burst size · 99fead8d
      Paolo Valente authored
      The commit "block, bfq: decrease burst size when queues in burst
      exit" introduced the decrement of burst_size on the removal of a
      bfq_queue from the burst list. Unfortunately, this decrement can
      happen to be performed even when burst size is already equal to 0,
      because of unbalanced decrements. A description follows of the cause
      of these unbalanced decrements, namely a wrong assumption, and of the
      way how this wrong assumption leads to unbalanced decrements.
      
      The wrong assumption is that a bfq_queue can exit only if the process
      associated with the bfq_queue has exited. This is false, because a
      bfq_queue, say Q, may exit also as a consequence of a merge with
      another bfq_queue. In this case, Q exits because the I/O of its
      associated process has been redirected to another bfq_queue.
      
      The decrement unbalance occurs because Q may then be re-created after
      a split, and added back to the current burst list, *without*
      incrementing burst_size. burst_size is not incremented because Q is
      not a new bfq_queue added to the burst list, but a bfq_queue only
      temporarily removed from the list, and, before the commit "bfq-sq,
      bfq-mq: decrease burst size when queues in burst exit", burst_size was
      not decremented when Q was removed.
      
      This commit addresses this issue by just checking whether the exiting
      bfq_queue is a merged bfq_queue, and, in that case, not decrementing
      burst_size. Unfortunately, this still leaves room for unbalanced
      decrements, in the following rarer case: on a split, the bfq_queue
      happens to be inserted into a different burst list than that it was
      removed from when merged. If this happens, the number of elements in
      the new burst list becomes higher than burst_size (by one). When the
      bfq_queue then exits, it is of course not in a merged state any
      longer, thus burst_size is decremented, which results in an unbalanced
      decrement.  To handle this sporadic, unlucky case in a simple way,
      this commit also checks that burst_size is larger than 0 before
      decrementing it.
      
      Finally, this commit removes an useless, extra check: the check that
      the bfq_queue is sync, performed before checking whether the bfq_queue
      is in the burst list. This extra check is redundant, because only sync
      bfq_queues can be inserted into the burst list.
      
      Fixes: 7cb04004 ("block, bfq: decrease burst size when queues in burst exit")
      Reported-by: default avatarPhilip Müller <philm@manjaro.org>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarAngelo Ruocco <angeloruocco90@gmail.com>
      Tested-by: default avatarPhilip Müller <philm@manjaro.org>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: default avatarLee Tibbert <lee.tibbert@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      99fead8d
    • Luca Miccio's avatar
      block,bfq: Disable writeback throttling · b5dc5d4d
      Luca Miccio authored
      Similarly to CFQ, BFQ has its write-throttling heuristics, and it
      is better not to combine them with further write-throttling
      heuristics of a different nature.
      So this commit disables write-back throttling for a device if BFQ
      is used as I/O scheduler for that device.
      Signed-off-by: default avatarLuca Miccio <lucmiccio@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: default avatarLee Tibbert <lee.tibbert@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b5dc5d4d
    • Yafang Shao's avatar
      writeback: schedule periodic writeback with sysctl · 94af5846
      Yafang Shao authored
      After disable periodic writeback by writing 0 to
      dirty_writeback_centisecs, the handler wb_workfn() will not be
      entered again until the dirty background limit reaches or
      sync syscall is executed or no enough free memory available or
      vmscan is triggered.
      
      So the periodic writeback can't be enabled by writing a non-zero
      value to dirty_writeback_centisecs.
      As it can be disabled by sysctl, it should be able to enable by
      sysctl as well.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      94af5846
  5. 06 Oct, 2017 3 commits
  6. 05 Oct, 2017 1 commit
  7. 04 Oct, 2017 4 commits
    • Jens Axboe's avatar
      sysctl: remove /proc/sys/vm/nr_pdflush_threads · b35bd0d9
      Jens Axboe authored
      This tunable has been obsolete since 2.6.32, and writes to the
      file have been failing and complaining in dmesg since then:
      
      nr_pdflush_threads exported in /proc is scheduled for removal
      
      That was 8 years ago. Remove the file ABI obsolete notice, and
      the sysfs file.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b35bd0d9
    • Jens Axboe's avatar
      writeback: eliminate work item allocation in bd_start_writeback() · 85009b4f
      Jens Axboe authored
      Handle start-all writeback like we do periodic or kupdate
      style writeback - by marking the bdi_writeback as needing a full
      flush, and simply waking the thread. This eliminates the need to
      allocate and queue a specific work item just for this purpose.
      
      After this change, we truly only ever have one of them running at
      any point in time. We mark the need to start all flushes, and the
      writeback thread will clear it once it has processed the request.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      85009b4f
    • Jens Axboe's avatar
      blk-mq: document the need to have STARTED and COMPLETED share a byte · fc13457f
      Jens Axboe authored
      For memory ordering guarantees on stores, we need to ensure that
      these two bits share the same byte of storage in the unsigned
      long. Add a comment as to why, and a BUILD_BUG_ON() to ensure that
      we don't violate this requirement.
      Suggested-by: default avatarBoqun Feng <boqun.feng@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fc13457f
    • Peter Zijlstra's avatar
      blk-mq: attempt to fix atomic flag memory ordering · a7af0af3
      Peter Zijlstra authored
      Attempt to untangle the ordering in blk-mq. The patch introducing the
      single smp_mb__before_atomic() is obviously broken in that it doesn't
      clearly specify a pairing barrier and an obtained guarantee.
      
      The comment is further misleading in that it hints that the
      deadline store and the COMPLETE store also need to be ordered, but
      AFAICT there is no such dependency. However what does appear to be
      important is the clear happening _after_ the store, and that worked by
      pure accident.
      
      This clarifies blk_mq_start_request() -- we should not get there with
      STARTING set -- this simplifies the code and makes the barrier usage
      sane (the old code could be read to allow not having _any_ atomic after
      the barrier, in which case the barrier hasn't got anything to order). We
      then also introduce the missing pairing barrier for it.
      
      Also down-grade the barrier to smp_wmb(), this is cheaper for
      PowerPC/ARM and doesn't cost anything extra on x86.
      
      And it documents the STARTING vs COMPLETE ordering. Although I've not
      been entirely successful in reverse engineering the blk-mq state
      machine so there might still be more funnies around timeout vs
      requeue.
      
      If I got anything wrong, feel free to educate me by adding comments to
      clarify things ;-)
      
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Fixes: 538b7534 ("blk-mq: request deadline must be visible before marking rq as started")
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a7af0af3
  8. 03 Oct, 2017 10 commits
    • Christoph Hellwig's avatar
      block: move __elv_next_request to blk-core.c · 9c988374
      Christoph Hellwig authored
      No need to have this helper inline in a header.  Also drop the __ prefix.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9c988374
    • Paolo Valente's avatar
      block, bfq: decrease burst size when queues in burst exit · 7cb04004
      Paolo Valente authored
      If many queues belonging to the same group happen to be created
      shortly after each other, then the concurrent processes associated
      with these queues have typically a common goal, and they get it done
      as soon as possible if not hampered by device idling.  Examples are
      processes spawned by git grep, or by systemd during boot. As for
      device idling, this mechanism is currently necessary for weight
      raising to succeed in its goal: privileging I/O.  In view of these
      facts, BFQ does not provide the above queues with either weight
      raising or device idling.
      
      On the other hand, a burst of queue creations may be caused also by
      the start-up of a complex application. In this case, these queues need
      usually to be served one after the other, and as quickly as possible,
      to maximise responsiveness. Therefore, in this case the best strategy
      is to weight-raise all the queues created during the burst, i.e., the
      exact opposite of the strategy for the above case.
      
      To distinguish between the two cases, BFQ uses an empirical burst-size
      threshold, found through extensive tests and monitoring of daily
      usage. Only large bursts, i.e., burst with a size above this
      threshold, are considered as generated by a high number of parallel
      processes. In this respect, upstart-based boot proved to be rather
      hard to detect as generating a large burst of queue creations, because
      with upstart most of the queues created in a burst exit *before* the
      next queues in the same burst are created. To address this issue, I
      changed the burst-detection mechanism so as to not decrease the size
      of the current burst even if one of the queues in the burst is
      eliminated.
      
      Unfortunately, this missing decrease causes false positives on very
      fast systems: on the start-up of a complex application, such as
      libreoffice writer, so many queues are created, served and exited
      shortly after each other, that a large burst of queue creations is
      wrongly detected as occurring. These false positives just disappear if
      the size of a burst is decreased when one of the queues in the burst
      exits. This commit restores the missing burst-size decrease, relying
      of the fact that upstart is apparently unlikely to be used on systems
      running this and future versions of the kernel.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarMauro Andreolini <mauro.andreolini@unimore.it>
      Signed-off-by: default avatarAngelo Ruocco <angeloruocco90@gmail.com>
      Tested-by: default avatarMirko Montanari <mirkomontanari91@gmail.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: default avatarLee Tibbert <lee.tibbert@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7cb04004
    • Paolo Valente's avatar
      block, bfq: let early-merged queues be weight-raised on split too · 894df937
      Paolo Valente authored
      A just-created bfq_queue, say Q, may happen to be merged with another
      bfq_queue on the very first invocation of the function
      __bfq_insert_request. In such a case, even if Q would clearly deserve
      interactive weight raising (as it has just been created), the function
      bfq_add_request does not make it to be invoked for Q, and thus to
      activate weight raising for Q. As a consequence, when the state of Q
      is saved for a possible future restore, after a split of Q from the
      other bfq_queue(s), such a state happens to be (unjustly)
      non-weight-raised. Then the bfq_queue will not enjoy any weight
      raising on the split, even if should still be in an interactive
      weight-raising period when the split occurs.
      
      This commit solves this problem as follows, for a just-created
      bfq_queue that is being early-merged: it stores directly, in the saved
      state of the bfq_queue, the weight-raising state that would have been
      assigned to the bfq_queue if not early-merged.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Tested-by: default avatarAngelo Ruocco <angeloruocco90@gmail.com>
      Tested-by: default avatarMirko Montanari <mirkomontanari91@gmail.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: default avatarLee Tibbert <lee.tibbert@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      894df937
    • Paolo Valente's avatar
      block, bfq: check and switch back to interactive wr also on queue split · 3e2bdd6d
      Paolo Valente authored
      As already explained in the message of commit "block, bfq: fix
      wrong init of saved start time for weight raising", if a soft
      real-time weight-raising period happens to be nested in a larger
      interactive weight-raising period, then BFQ restores the interactive
      weight raising at the end of the soft real-time weight raising. In
      particular, BFQ checks whether the latter has ended only on request
      dispatches.
      
      Unfortunately, the above scheme fails to restore interactive weight
      raising in the following corner case: if a bfq_queue, say Q,
      1) Is merged with another bfq_queue while it is in a nested soft
      real-time weight-raising period. The weight-raising state of Q is
      then saved, and not considered any longer until a split occurs.
      2) Is split from the other bfq_queue(s) at a time instant when its
      soft real-time weight raising is already finished.
      On the split, while resuming the previous, soft real-time
      weight-raised state of the bfq_queue Q, BFQ checks whether the
      current soft real-time weight-raising period is actually over. If so,
      BFQ switches weight raising off for Q, *without* checking whether the
      soft real-time period was actually nested in a non-yet-finished
      interactive weight-raising period.
      
      This commit addresses this issue by adding the above missing check in
      bfq_queue splits, and restoring interactive weight raising if needed.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Tested-by: default avatarAngelo Ruocco <angeloruocco90@gmail.com>
      Tested-by: default avatarMirko Montanari <mirkomontanari91@gmail.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: default avatarLee Tibbert <lee.tibbert@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3e2bdd6d
    • Paolo Valente's avatar
      block, bfq: fix wrong init of saved start time for weight raising · 4baa8bb1
      Paolo Valente authored
      This commit fixes a bug that causes bfq to fail to guarantee a high
      responsiveness on some drives, if there is heavy random read+write I/O
      in the background. More precisely, such a failure allowed this bug to
      be found [1], but the bug may well cause other yet unreported
      anomalies.
      
      BFQ raises the weight of the bfq_queues associated with soft real-time
      applications, to privilege the I/O, and thus reduce latency, for these
      applications. This mechanism is named soft-real-time weight raising in
      BFQ. A soft real-time period may happen to be nested into an
      interactive weight raising period, i.e., it may happen that, when a
      bfq_queue switches to a soft real-time weight-raised state, the
      bfq_queue is already being weight-raised because deemed interactive
      too. In this case, BFQ saves in a special variable
      wr_start_at_switch_to_srt, the time instant when the interactive
      weight-raising period started for the bfq_queue, i.e., the time
      instant when BFQ started to deem the bfq_queue interactive. This value
      is then used to check whether the interactive weight-raising period
      would still be in progress when the soft real-time weight-raising
      period ends.  If so, interactive weight raising is restored for the
      bfq_queue. This restore is useful, in particular, because it prevents
      bfq_queues from losing their interactive weight raising prematurely,
      as a consequence of spurious, short-lived soft real-time
      weight-raising periods caused by wrong detections as soft real-time.
      
      If, instead, a bfq_queue switches to soft-real-time weight raising
      while it *is not* already in an interactive weight-raising period,
      then the variable wr_start_at_switch_to_srt has no meaning during the
      following soft real-time weight-raising period. Unfortunately the
      handling of this case is wrong in BFQ: not only the variable is not
      flagged somehow as meaningless, but it is also set to the time when
      the switch to soft real-time weight-raising occurs. This may cause an
      interactive weight-raising period to be considered mistakenly as still
      in progress, and thus a spurious interactive weight-raising period to
      start for the bfq_queue, at the end of the soft-real-time
      weight-raising period. In particular the spurious interactive
      weight-raising period will be considered as still in progress, if the
      soft-real-time weight-raising period does not last very long. The
      bfq_queue will then be wrongly privileged and, if I/O bound, will
      unjustly steal bandwidth to truly interactive or soft real-time
      bfq_queues, harming responsiveness and low latency.
      
      This commit fixes this issue by just setting wr_start_at_switch_to_srt
      to minus infinity (farthest past time instant according to jiffies
      macros): when the soft-real-time weight-raising period ends, certainly
      no interactive weight-raising period will be considered as still in
      progress.
      
      [1] Background I/O Type: Random - Background I/O mix: Reads and writes
      - Application to start: LibreOffice Writer in
      http://www.phoronix.com/scan.php?page=news_item&px=Linux-4.13-IO-LaptopSigned-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarAngelo Ruocco <angeloruocco90@gmail.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: default avatarLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: default avatarMirko Montanari <mirkomontanari91@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4baa8bb1
    • Jens Axboe's avatar
      writeback: only allow one inflight and pending full flush · aac8d41c
      Jens Axboe authored
      When someone calls wakeup_flusher_threads() or
      wakeup_flusher_threads_bdi(), they schedule writeback of all dirty
      pages in the system (or on that bdi). If we are tight on memory, we
      can get tons of these queued from kswapd/vmscan. This causes (at
      least) two problems:
      
      1) We consume a ton of memory just allocating writeback work items.
         We've seen as much as 600 million of these writeback work items
         pending. That's a lot of memory to pointlessly hold hostage,
         while the box is under memory pressure.
      
      2) We spend so much time processing these work items, that we
         introduce a softlockup in writeback processing. This is because
         each of the writeback work items don't end up doing any work (it's
         hard when you have millions of identical ones coming in to the
         flush machinery), so we just sit in a tight loop pulling work
         items and deleting/freeing them.
      
      Fix this by adding a 'start_all' bit to the writeback structure, and
      set that when someone attempts to flush all dirty pages. The bit is
      cleared when we start writeback on that work item. If the bit is
      already set when we attempt to queue !nr_pages writeback, then we
      simply ignore it.
      
      This provides us one full flush in flight, with one pending as well,
      and makes for more efficient handling of this type of writeback.
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      aac8d41c
    • Jens Axboe's avatar
      writeback: move nr_pages == 0 logic to one location · e8e8a0c6
      Jens Axboe authored
      Now that we have no external callers of wb_start_writeback(), we
      can shuffle the passing in of 'nr_pages'. Everybody passes in 0
      at this point, so just kill the argument and move the dirty
      count retrieval to that function.
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e8e8a0c6
    • Jens Axboe's avatar
      writeback: make wb_start_writeback() static · 9dfb176f
      Jens Axboe authored
      We don't have any callers outside of fs-writeback.c anymore,
      make it private.
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9dfb176f
    • Jens Axboe's avatar
      writeback: pass in '0' for nr_pages writeback in laptop mode · 0ab29fd0
      Jens Axboe authored
      Laptop mode really wants to writeback the number of dirty
      pages and inodes. Instead of calculating this in the caller,
      just pass in 0 and let wakeup_flusher_threads() handle it.
      
      Use the new wakeup_flusher_threads_bdi() instead of rolling
      our own.
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0ab29fd0
    • Jens Axboe's avatar
      writeback: provide a wakeup_flusher_threads_bdi() · 595043e5
      Jens Axboe authored
      Similar to wakeup_flusher_threads(), except that we only wake
      up the flusher threads on the specified backing device.
      
      No functional changes in this patch.
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      595043e5