1. 01 Aug, 2012 3 commits
    • Vivek Goyal's avatar
      block: add partition resize function to blkpg ioctl · c83f6bf9
      Vivek Goyal authored
      Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
      allows altering the size of an existing partition, even if it is currently
      in use.
      
      This patch converts hd_struct->nr_sects into sequence counter because
      One might extend a partition while IO is happening to it and update of
      nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
      can lead to issues like reading inconsistent size of a partition. Sequence
      counter have been used so that readers don't have to take bdev mutex lock
      as we call sector_in_part() very frequently.
      
      Now all the access to hd_struct->nr_sects should happen using sequence
      counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
      There is one exception though, set_capacity()/get_capacity(). I think
      theoritically race should exist there too but this patch does not
      modify set_capacity()/get_capacity() due to sheer number of call sites
      and I am afraid that change might break something. I have left that as a
      TODO item. We can handle it later if need be. This patch does not introduce
      any new races as such w.r.t set_capacity()/get_capacity().
      
      v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarPhillip Susi <psusi@ubuntu.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c83f6bf9
    • Olof Johansson's avatar
      block: uninitialized ioc->nr_tasks triggers WARN_ON · 4638a83e
      Olof Johansson authored
      Hi,
      
      I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the
      below warning as of 3.5-rc1 and later (3.4 is fine):
      
      [   10.886893] ------------[ cut here ]------------
      [   10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560()
      [   10.886905] Hardware name: Bochs
      [   10.886906] Modules linked in:
      [   10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27
      [   10.886908] Call Trace:
      [   10.886911]  [<ffffffff8107ce8a>] warn_slowpath_common+0x7a/0xb0
      [   10.886912]  [<ffffffff8107ced5>] warn_slowpath_null+0x15/0x20
      [   10.886913]  [<ffffffff8107c088>] copy_process+0x1488/0x1560
      [   10.886914]  [<ffffffff8107c244>] do_fork+0xb4/0x340
      [   10.886918]  [<ffffffff8108effa>] ? recalc_sigpending+0x1a/0x50
      [   10.886919]  [<ffffffff8108f6b2>] ? __set_task_blocked+0x32/0x80
      [   10.886920]  [<ffffffff81091afa>] ? __set_current_blocked+0x3a/0x60
      [   10.886923]  [<ffffffff81051db3>] sys_clone+0x23/0x30
      [   10.886925]  [<ffffffff8179bd73>] stub_clone+0x13/0x20
      [   10.886927]  [<ffffffff8179baa2>] ? system_call_fastpath+0x16/0x1b
      [   10.886928] ---[ end trace 32a14af7ee6a590b ]---
      
      Reproducing is easy, I can hit it on a KVM system with a very basic
      config (x86_64 make defconfig + enable the drivers needed). To hit it,
      just install dump (on debian/ubuntu, not sure what the package might be
      called on Fedora), and:
      
      dump -o -f /tmp/foo /
      
      You'll see the warning in dmesg once it forks off the I/O process and
      starts dumping filesystem contents.
      
      I bisected it down to the following commit:
      
      commit f6e8d01b
      Author: Tejun Heo <tj@kernel.org>
      Date:   Mon Mar 5 13:15:26 2012 -0800
      
          block: add io_context->active_ref
      
          Currently ioc->nr_tasks is used to decide two things - whether an ioc
          is done issuing IOs and whether it's shared by multiple tasks.  This
          patch separate out the first into ioc->active_ref, which is acquired
          and released using {get|put}_io_context_active() respectively.
      
          This will be used to associate bio's with a given task.  This patch
          doesn't introduce any visible behavior change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
          Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      
      It seems like the init of ioc->nr_tasks was removed in that patch,
      so it starts out at 0 instead of 1.
      
      Tejun, is the right thing here to add back the init, or should something else
      be done?
      
      The below patch removes the warning, but I haven't done any more extensive
      testing on it.
      Signed-off-by: default avatarOlof Johansson <olof@lixom.net>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4638a83e
    • Mike Snitzer's avatar
      block: do not artificially constrain max_sectors for stacking drivers · fe86cdce
      Mike Snitzer authored
      blk_set_stacking_limits is intended to allow stacking drivers to build
      up the limits of the stacked device based on the underlying devices'
      limits.  But defaulting 'max_sectors' to BLK_DEF_MAX_SECTORS (1024)
      doesn't allow the stacking driver to inherit a max_sectors larger than
      1024 -- due to blk_stack_limits' use of min_not_zero.
      
      It is now clear that this artificial limit is getting in the way so
      change blk_set_stacking_limits's max_sectors to UINT_MAX (which allows
      stacking drivers like dm-multipath to inherit 'max_sectors' from the
      underlying paths).
      Reported-by: default avatarVijay Chauhan <vijay.chauhan@netapp.com>
      Tested-by: default avatarVijay Chauhan <vijay.chauhan@netapp.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fe86cdce
  2. 26 Jun, 2012 1 commit
    • Tejun Heo's avatar
      blkcg: implement per-blkg request allocation · a051661c
      Tejun Heo authored
      Currently, request_queue has one request_list to allocate requests
      from regardless of blkcg of the IO being issued.  When the unified
      request pool is used up, cfq proportional IO limits become meaningless
      - whoever grabs the next request being freed wins the race regardless
      of the configured weights.
      
      This can be easily demonstrated by creating a blkio cgroup w/ very low
      weight, put a program which can issue a lot of random direct IOs there
      and running a sequential IO from a different cgroup.  As soon as the
      request pool is used up, the sequential IO bandwidth crashes.
      
      This patch implements per-blkg request_list.  Each blkg has its own
      request_list and any IO allocates its request from the matching blkg
      making blkcgs completely isolated in terms of request allocation.
      
      * Root blkcg uses the request_list embedded in each request_queue,
        which was renamed to @q->root_rl from @q->rq.  While making blkcg rl
        handling a bit harier, this enables avoiding most overhead for root
        blkcg.
      
      * Queue fullness is properly per request_list but bdi isn't blkcg
        aware yet, so congestion state currently just follows the root
        blkcg.  As writeback isn't aware of blkcg yet, this works okay for
        async congestion but readahead may get the wrong signals.  It's
        better than blkcg completely collapsing with shared request_list but
        needs to be improved with future changes.
      
      * After this change, each block cgroup gets a full request pool making
        resource consumption of each cgroup higher.  This makes allowing
        non-root users to create cgroups less desirable; however, note that
        allowing non-root users to directly manage cgroups is already
        severely broken regardless of this patch - each block cgroup
        consumes kernel memory and skews IO weight (IO weights are not
        hierarchical).
      
      v2: queue-sysfs.txt updated and patch description udpated as suggested
          by Vivek.
      
      v3: blk_get_rl() wasn't checking error return from
          blkg_lookup_create() and may cause oops on lookup failure.  Fix it
          by falling back to root_rl on blkg lookup failures.  This problem
          was spotted by Rakesh Iyer <rni@google.com>.
      
      v4: Updated to accomodate 458f27a9 "block: Avoid missed wakeup in
          request waitqueue".  blk_drain_queue() now wakes up waiters on all
          blkg->rl on the target queue.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a051661c
  3. 25 Jun, 2012 9 commits
    • Tejun Heo's avatar
      block: prepare for multiple request_lists · 5b788ce3
      Tejun Heo authored
      Request allocation is about to be made per-blkg meaning that there'll
      be multiple request lists.
      
      * Make queue full state per request_list.  blk_*queue_full() functions
        are renamed to blk_*rl_full() and takes @rl instead of @q.
      
      * Rename blk_init_free_list() to blk_init_rl() and make it take @rl
        instead of @q.  Also add @gfp_mask parameter.
      
      * Add blk_exit_rl() instead of destroying rl directly from
        blk_release_queue().
      
      * Add request_list->q and make request alloc/free functions -
        blk_free_request(), [__]freed_request(), __get_request() - take @rl
        instead of @q.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5b788ce3
    • Tejun Heo's avatar
      block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv · 8a5ecdd4
      Tejun Heo authored
      Add q->nr_rqs[] which currently behaves the same as q->rq.count[] and
      move q->rq.elvpriv to q->nr_rqs_elvpriv.  blk_drain_queue() is updated
      to use q->nr_rqs[] instead of q->rq.count[].
      
      These counters separates queue-wide request statistics from the
      request list and allow implementation of per-queue request allocation.
      
      While at it, properly indent fields of struct request_list.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8a5ecdd4
    • Tejun Heo's avatar
      blkcg: inline bio_blkcg() and friends · b1208b56
      Tejun Heo authored
      Make bio_blkcg() and friends inline.  They all are very simple and
      used only in few places.
      
      This patch is to prepare for further updates to request allocation
      path.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b1208b56
    • Tejun Heo's avatar
      block: allocate io_context upfront · 7f4b35d1
      Tejun Heo authored
      Block layer very lazy allocation of ioc.  It waits until the moment
      ioc is absolutely necessary; unfortunately, that time could be inside
      queue lock and __get_request() performs unlock - try alloc - retry
      dancing.
      
      Just allocate it up-front on entry to block layer.  We're not saving
      the rain forest by deferring it to the last possible moment and
      complicating things unnecessarily.
      
      This patch is to prepare for further updates to request allocation
      path.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7f4b35d1
    • Tejun Heo's avatar
      block: refactor get_request[_wait]() · a06e05e6
      Tejun Heo authored
      Currently, there are two request allocation functions - get_request()
      and get_request_wait().  The former tries to allocate a request once
      and the latter keeps retrying until it succeeds.  The latter wraps the
      former and keeps retrying until allocation succeeds.
      
      The combination of two functions deliver fallible non-wait allocation,
      fallible wait allocation and unfailing wait allocation.  However,
      given that forward progress is guaranteed, fallible wait allocation
      isn't all that useful and in fact nobody uses it.
      
      This patch simplifies the interface as follows.
      
      * get_request() is renamed to __get_request() and is only used by the
        wrapper function.
      
      * get_request_wait() is renamed to get_request().  It now takes
        @gfp_mask and retries iff it contains %__GFP_WAIT.
      
      This patch doesn't introduce any functional change and is to prepare
      for further updates to request allocation path.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a06e05e6
    • Tejun Heo's avatar
      block: drop custom queue draining used by scsi_transport_{iscsi|fc} · 86072d81
      Tejun Heo authored
      iscsi_remove_host() uses bsg_remove_queue() which implements custom
      queue draining.  fc_bsg_remove() open-codes mostly identical logic.
      
      The draining logic isn't correct in that blk_stop_queue() doesn't
      prevent new requests from being queued - it just stops processing, so
      nothing prevents new requests to be queued after the logic determines
      that the queue is drained.
      
      blk_cleanup_queue() now implements proper queue draining and these
      custom draining logics aren't necessary.  Drop them and use
      bsg_unregister_queue() + blk_cleanup_queue() instead.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarMike Christie <michaelc@cs.wisc.edu>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: James Smart <james.smart@emulex.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      86072d81
    • Tejun Heo's avatar
      mempool: add @gfp_mask to mempool_create_node() · a91a5ac6
      Tejun Heo authored
      mempool_create_node() currently assumes %GFP_KERNEL.  Its only user,
      blk_init_free_list(), is about to be updated to use other allocation
      flags - add @gfp_mask argument to the function.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a91a5ac6
    • Tejun Heo's avatar
      blkcg: make root blkcg allocation use %GFP_KERNEL · 15974993
      Tejun Heo authored
      Currently, blkcg_activate_policy() depends on %GFP_ATOMIC allocation
      from __blkg_lookup_create() for root blkcg creation.  This could make
      policy fail unnecessarily.
      
      Make blkg_alloc() take @gfp_mask, __blkg_lookup_create() take an
      optional @new_blkg for preallocated blkg, and blkcg_activate_policy()
      preload radix tree and preallocate blkg with %GFP_KERNEL before trying
      to create the root blkg.
      
      v2: __blkg_lookup_create() was returning %NULL on blkg alloc failure
         instead of ERR_PTR() value.  Fixed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      15974993
    • Tejun Heo's avatar
      blkcg: __blkg_lookup_create() doesn't need radix preload · 13589864
      Tejun Heo authored
      There's no point in calling radix_tree_preload() if preloading doesn't
      use more permissible GFP mask.  Drop preloading from
      __blkg_lookup_create().
      
      While at it, drop sparse locking annotation which no longer applies.
      
      v2: Vivek pointed out the odd preload usage.  Instead of updating,
          just drop it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      13589864
  4. 15 Jun, 2012 4 commits
    • Jan Kara's avatar
      scsi: Silence unnecessary warnings about ioctl to partition · 6d935928
      Jan Kara authored
      Sometimes, warnings about ioctls to partition happen often enough that they
      form majority of the warnings in the kernel log and users complain. In some
      cases warnings are about ioctls such as SG_IO so it's not good to get rid of
      the warnings completely as they can ease debugging of userspace problems
      when ioctl is refused.
      
      Since I have seen warnings from lots of commands, including some proprietary
      userspace applications, I don't think disallowing the ioctls for processes
      with CAP_SYS_RAWIO will happen in the near future if ever. So lets just
      stop warning for processes with CAP_SYS_RAWIO for which ioctl is allowed.
      
      CC: Paolo Bonzini <pbonzini@redhat.com>
      CC: James Bottomley <JBottomley@parallels.com>
      CC: linux-scsi@vger.kernel.org
      Acked-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6d935928
    • Asias He's avatar
      block: Drop dead function blk_abort_queue() · 76aaa510
      Asias He authored
      This function was only used by btrfs code in btrfs_abort_devices()
      (seems in a wrong way).
      
      It was removed in commit d07eb911,
      So, Let's remove the dead code to avoid any confusion.
      
      Changes in v2: update commit log, btrfs_abort_devices() was removed
      already.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-kernel@vger.kernel.org
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: linux-btrfs@vger.kernel.org
      Cc: David Sterba <dave@jikos.cz>
      Signed-off-by: default avatarAsias He <asias@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      76aaa510
    • Asias He's avatar
      block: Mitigate lock unbalance caused by lock switching · 5e5cfac0
      Asias He authored
      Commit 777eb1bf disconnects externally
      supplied queue_lock before blk_drain_queue(). Switching the lock would
      introduce lock unbalance because theads which have taken the external
      lock might unlock the internal lock in the during the queue drain. This
      patch mitigate this by disconnecting the lock after the queue draining
      since queue draining makes a lot of request_queue users go away.
      
      However, please note, this patch only makes the problem less likely to
      happen. Anyone who still holds a ref might try to issue a new request on
      a dead queue after the blk_cleanup_queue() finishes draining, the lock
      unbalance might still happen in this case.
      
       =====================================
       [ BUG: bad unlock balance detected! ]
       3.4.0+ #288 Not tainted
       -------------------------------------
       fio/17706 is trying to release lock (&(&q->__queue_lock)->rlock) at:
       [<ffffffff81329372>] blk_queue_bio+0x2a2/0x380
       but there are no more locks to release!
      
       other info that might help us debug this:
       1 lock held by fio/17706:
        #0:  (&(&vblk->lock)->rlock){......}, at: [<ffffffff81327f1a>]
       get_request_wait+0x19a/0x250
      
       stack backtrace:
       Pid: 17706, comm: fio Not tainted 3.4.0+ #288
       Call Trace:
        [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
        [<ffffffff810dea49>] print_unlock_inbalance_bug+0xf9/0x100
        [<ffffffff810dfe4f>] lock_release_non_nested+0x1df/0x330
        [<ffffffff811dae24>] ? dio_bio_end_aio+0x34/0xc0
        [<ffffffff811d6935>] ? bio_check_pages_dirty+0x85/0xe0
        [<ffffffff811daea1>] ? dio_bio_end_aio+0xb1/0xc0
        [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
        [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
        [<ffffffff810e0079>] lock_release+0xd9/0x250
        [<ffffffff81a74553>] _raw_spin_unlock_irq+0x23/0x40
        [<ffffffff81329372>] blk_queue_bio+0x2a2/0x380
        [<ffffffff81328faa>] generic_make_request+0xca/0x100
        [<ffffffff81329056>] submit_bio+0x76/0xf0
        [<ffffffff8115470c>] ? set_page_dirty_lock+0x3c/0x60
        [<ffffffff811d69e1>] ? bio_set_pages_dirty+0x51/0x70
        [<ffffffff811dd1a8>] do_blockdev_direct_IO+0xbf8/0xee0
        [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
        [<ffffffff811dd4e5>] __blockdev_direct_IO+0x55/0x60
        [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
        [<ffffffff811d92e7>] blkdev_direct_IO+0x57/0x60
        [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
        [<ffffffff8114c6ae>] generic_file_aio_read+0x70e/0x760
        [<ffffffff810df7c5>] ? __lock_acquire+0x215/0x5a0
        [<ffffffff811e9924>] ? aio_run_iocb+0x54/0x1a0
        [<ffffffff8114bfa0>] ? grab_cache_page_nowait+0xc0/0xc0
        [<ffffffff811e82cc>] aio_rw_vect_retry+0x7c/0x1e0
        [<ffffffff811e8250>] ? aio_fsync+0x30/0x30
        [<ffffffff811e9936>] aio_run_iocb+0x66/0x1a0
        [<ffffffff811ea9b0>] do_io_submit+0x6f0/0xb80
        [<ffffffff8134de2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
        [<ffffffff811eae50>] sys_io_submit+0x10/0x20
        [<ffffffff81a7c9e9>] system_call_fastpath+0x16/0x1b
      
      Changes since v2: Update commit log to explain how the code is still
                        broken even if we delay the lock switching after the drain.
      Changes since v1: Update commit log as Tejun suggested.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAsias He <asias@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5e5cfac0
    • Asias He's avatar
      block: Avoid missed wakeup in request waitqueue · 458f27a9
      Asias He authored
      After hot-unplug a stressed disk, I found that rl->wait[] is not empty
      while rl->count[] is empty and there are theads still sleeping on
      get_request after the queue cleanup. With simple debug code, I found
      there are exactly nr_sleep - nr_wakeup of theads in D state. So there
      are missed wakeup.
      
        $ dmesg | grep nr_sleep
        [   52.917115] ---> nr_sleep=1046, nr_wakeup=873, delta=173
        $ vmstat 1
        1 173  0 712640  24292  96172 0 0  0  0  419  757  0  0  0 100  0
      
      To quote Tejun:
      
        Ah, okay, freed_request() wakes up single waiter with the assumption
        that after the wakeup there will at least be one successful allocation
        which in turn will continue the wakeup chain until the wait list is
        empty - ie. waiter wakeup is dependent on successful request
        allocation happening after each wakeup.  With queue marked dead, any
        woken up waiter fails the allocation path, so the wakeup chaining is
        lost and we're left with hung waiters. What we need is wake_up_all()
        after drain completion.
      
      This patch fixes the missed wakeup by waking up all the theads which
      are sleeping on wait queue after queue drain.
      
      Changes in v2: Drop waitqueue_active() optimization
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAsias He <asias@redhat.com>
      
      Fixed a bug by me, where stacked devices would oops on calling
      blk_drain_queue() since ->rq.wait[] do not get initialized unless
      it's a full queue setup.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      458f27a9
  5. 13 Jun, 2012 4 commits
  6. 12 Jun, 2012 4 commits
    • Lars Ellenberg's avatar
      drbd: fix null pointer dereference with on-congestion policy when diskless · 0d5934e3
      Lars Ellenberg authored
      We must not look at mdev->actlog, unless we have a get_ldev() reference.
      It also does not make much sense to try to disconnect or pull-ahead of
      the peer, if we don't have good local data.
      
      Only even consider congestion policies, if our local disk is D_UP_TO_DATE.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      0d5934e3
    • Lars Ellenberg's avatar
      drbd: fix list corruption by failing but already aborted reads · 1ed25b26
      Lars Ellenberg authored
      If a read is aborted due to force-detach of a supposedly unresponsive
      local backing device, and retried on the peer, it can happen that the
      local request later still completes (hopefully with an error).
      As it may already have been completed to upper layers meanwhile,
      it must not be retried again now.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      1ed25b26
    • Lars Ellenberg's avatar
      drbd: fix access of unallocated pages and kernel panic · 4eccc579
      Lars Ellenberg authored
      BUG: unable to handle kernel NULL pointer dereference at (null)
      ...
       [<d1e17561>] ? _drbd_bm_set_bits+0x151/0x240 [drbd]
       [<d1e236f8>] ? receive_bitmap+0x4f8/0xbc0 [drbd]
      
      This fixes an off-by-one error in the receive_bitmap() path,
      if run-length encoded bitmap transfer is enabled.
      
      If the bitmap is an exact multiple of PAGE_SIZE, which means the visible
      capacity of the drbd device is an exact multiple of 128 MiB (for 4k page
      size), and bitmap compression (use-rle) is enabled (which became default
      with 8.4), and the very last bit is dirty and reported in an rle
      comressed bitmap packet, we ended up trying to kmap_atomic a page pointer
      that does not exist (bitmap->bm_pages[last index + 1]).
      
      bug introduced by:
          Date:   Fri Jul 24 15:33:24 2009 +0200
          set bits: optimize for complete last word, fix off-by-one-word corner case
      
      made effective by:
          Date:   Thu Dec 16 00:32:38 2010 +0100
          drbd: get rid of unused debug code
      
          Long time ago, we had paranoia code in the bitmap that allocated one
          extra word, assigned a magic value, and checked on every occasion that
          the magic value was still unchanged.
      
          That debug code is unused, the extra long word complicates code a bit.
          Get rid of it.
      
      No-one triggered this bug in the last few years, because a large subset
      of our userbase is unaffected:
       * typically the last few blocks of a device are not modified
         frequently, and remain unset
       * use-rle was disabled by default in drbd < 8.4
       * those with slightly "odd" device sizes, or
       * drbd internal meta data (which will skew the device size slightly,
         thus makes it harder to have a bug relevant device size)
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      4eccc579
    • Konrad Rzeszutek Wilk's avatar
      xen/blkfront: Add WARN to deal with misbehaving backends. · 6878c32e
      Konrad Rzeszutek Wilk authored
      Part of the ring structure is the 'id' field which is under
      control of the frontend. The frontend stamps it with "some"
      value (this some in this implementation being a value less
      than BLK_RING_SIZE), and when it gets a response expects
      said value to be in the response structure. We have a check
      for the id field when spolling new requests but not when
      de-spolling responses.
      
      We also add an extra check in add_id_to_freelist to make
      sure that the 'struct request' was not NULL - as we cannot
      pass a NULL to __blk_end_request_all, otherwise that crashes
      (and all the operations that the response is dealing with
      end up with __blk_end_request_all).
      
      Lastly we also print the name of the operation that failed.
      
      [v1: s/BUG/WARN/ suggested by Stefano]
      [v2: Add extra check in add_id_to_freelist]
      [v3: Redid op_name per Jan's suggestion]
      [v4: add const * and add WARN on failure returns]
      Acked-by: default avatarJan Beulich <jbeulich@suse.com>
      Acked-by: default avatarStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      6878c32e
  7. 06 Jun, 2012 1 commit
  8. 05 Jun, 2012 2 commits
  9. 04 Jun, 2012 8 commits
    • Tejun Heo's avatar
      blkcg: fix blkg_alloc() failure path · 9b2ea86b
      Tejun Heo authored
      When policy data allocation fails in the middle, blkg_alloc() invokes
      blkg_free() to destroy the half constructed blkg.  This ends up
      calling pd_exit_fn() on policy datas which didn't go through
      pd_init_fn().  Fix it by making blkg_alloc() call pd_init_fn()
      immediately after each policy data allocation.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9b2ea86b
    • Tejun Heo's avatar
      block: blkcg_policy_cfq shouldn't be used if !CONFIG_CFQ_GROUP_IOSCHED · ffea73fc
      Tejun Heo authored
      cfq may be built w/ or w/o blkcg support depending on
      CONFIG_CFQ_CGROUP_IOSCHED.  If blkcg support is disabled, most of
      related code is ifdef'd out but some part is left dangling -
      blkcg_policy_cfq is left zero-filled and blkcg_policy_[un]register()
      calls are made on it.
      
      Feeding zero filled policy to blkcg_policy_register() is incorrect and
      triggers the following WARN_ON() if CONFIG_BLK_CGROUP &&
      !CONFIG_CFQ_GROUP_IOSCHED.
      
       ------------[ cut here ]------------
       WARNING: at block/blk-cgroup.c:867
       Modules linked in:
       Modules linked in:
       CPU: 3 Not tainted 3.4.0-09547-gfb21affa #1
       Process swapper/0 (pid: 1, task: 000000003ff80000, ksp: 000000003ff7f8b8)
       Krnl PSW : 0704100180000000 00000000003d76ca (blkcg_policy_register+0xca/0xe0)
      	    R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
       Krnl GPRS: 0000000000000000 00000000014b85ec 00000000014b85b0 0000000000000000
      	    000000000096fb60 0000000000000000 00000000009a8e78 0000000000000048
      	    000000000099c070 0000000000b6f000 0000000000000000 000000000099c0b8
      	    00000000014b85b0 0000000000667580 000000003ff7fd98 000000003ff7fd70
       Krnl Code: 00000000003d76be: a7280001           lhi     %r2,1
      	    00000000003d76c2: a7f4ffdf           brc     15,3d7680
      	   #00000000003d76c6: a7f40001           brc     15,3d76c8
      	   >00000000003d76ca: a7c8ffea           lhi     %r12,-22
      	    00000000003d76ce: a7f4ffce           brc     15,3d766a
      	    00000000003d76d2: a7f40001           brc     15,3d76d4
      	    00000000003d76d6: a7c80000           lhi     %r12,0
      	    00000000003d76da: a7f4ffc2           brc     15,3d765e
       Call Trace:
       ([<0000000000b6f000>] initcall_debug+0x0/0x4)
        [<0000000000989e8a>] cfq_init+0x62/0xd4
        [<00000000001000ba>] do_one_initcall+0x3a/0x170
        [<000000000096fb60>] kernel_init+0x214/0x2bc
        [<0000000000623202>] kernel_thread_starter+0x6/0xc
        [<00000000006231fc>] kernel_thread_starter+0x0/0xc
       no locks held by swapper/0/1.
       Last Breaking-Event-Address:
        [<00000000003d76c6>] blkcg_policy_register+0xc6/0xe0
       ---[ end trace b8ef4903fcbf9dd3 ]---
      
      This patch fixes the problem by ensuring all blkcg support code is
      inside CONFIG_CFQ_GROUP_IOSCHED.
      
      * blkcg_policy_cfq declaration and blkg_to_cfqg() definition are moved
        inside the first CONFIG_CFQ_GROUP_IOSCHED block.  __maybe_unused is
        dropped from blkcg_policy_cfq decl.
      
      * blkcg_deactivate_poilcy() invocation is moved inside ifdef.  This
        also makes the activation logic match cfq_init_queue().
      
      * All blkcg_policy_[un]register() invocations are moved inside ifdef.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      LKML-Reference: <20120601112954.GC3535@osiris.boeblingen.de.ibm.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ffea73fc
    • Tejun Heo's avatar
      block: fix return value on cfq_init() failure · fd794956
      Tejun Heo authored
      cfq_init() would return zero after kmem cache creation failure.  Fix
      so that it returns -ENOMEM.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fd794956
    • Sachin Kamat's avatar
      mtip32xx: Remove version.h header file inclusion · 87c9ea76
      Sachin Kamat authored
      version.h header file inclusion is no longer required.
      Signed-off-by: default avatarSachin Kamat <sachin.kamat@linaro.org>
      87c9ea76
    • Kukjin Kim's avatar
      gpio/samsung: fix the typo 'exynos5_xxx' instead of 'exonys5_xxx' · 5041caa4
      Kukjin Kim authored
      Should be 'exynos5_xxx' instead of 'exonys5_xxx'.
      
      It happened at the commit 30b84288 ("Merge tag 'soc2' of
      git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc")
      during v3.5 merge window.
      Signed-off-by: default avatarKukjin Kim <kgene.kim@samsung.com>
      [ My bad  - Linus ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5041caa4
    • Linus Torvalds's avatar
      Merge branch 'pm-acpi' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 4d578573
      Linus Torvalds authored
      Pull some left-over PM patches from Rafael J. Wysocki.
      
      * 'pm-acpi' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        ACPI / PM: Make acpi_pm_device_sleep_state() follow the specification
        ACPI / PM: Make __acpi_bus_get_power() cover D3cold correctly
        ACPI / PM: Fix error messages in drivers/acpi/bus.c
        rtc-cmos / PM: report wakeup event on ACPI RTC alarm
        ACPI / PM: Generate wakeup events on fixed power button
      4d578573
    • Linus Torvalds's avatar
      Revert "mm: compaction: handle incorrect MIGRATE_UNMOVABLE type pageblocks" · 68e3e926
      Linus Torvalds authored
      This reverts commit 5ceb9ce6.
      
      That commit seems to be the cause of the mm compation list corruption
      issues that Dave Jones reported.  The locking (or rather, absense
      there-of) is dubious, as is the use of the 'page' variable once it has
      been found to be outside the pageblock range.
      
      So revert it for now, we can re-visit this for 3.6.  If we even need to:
      as Minchan Kim says, "The patch wasn't a bug fix and even test workload
      was very theoretical".
      Reported-and-tested-by: default avatarDave Jones <davej@redhat.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68e3e926
    • Hugh Dickins's avatar
      mm: fix warning in __set_page_dirty_nobuffers · 752dc185
      Hugh Dickins authored
      New tmpfs use of !PageUptodate pages for fallocate() is triggering the
      WARNING: at mm/page-writeback.c:1990 when __set_page_dirty_nobuffers()
      is called from migrate_page_copy() for compaction.
      
      It is anomalous that migration should use __set_page_dirty_nobuffers()
      on an address_space that does not participate in dirty and writeback
      accounting; and this has also been observed to insert surprising dirty
      tags into a tmpfs radix_tree, despite tmpfs not using tags at all.
      
      We should probably give migrate_page_copy() a better way to preserve the
      tag and migrate accounting info, when mapping_cap_account_dirty().  But
      that needs some more work: so in the interim, avoid the warning by using
      a simple SetPageDirty on PageSwapBacked pages.
      Reported-and-tested-by: default avatarDave Jones <davej@redhat.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      752dc185
  10. 03 Jun, 2012 3 commits
    • Linus Torvalds's avatar
      vfs: move inode stat information closer together · 2f9d3df8
      Linus Torvalds authored
      The comment above it says "Stat data, not accessed from path walking",
      but in fact some of inode fields we use for the common stat data was way
      down at the end of the inode, causing unnecessary cache misses for the
      common stat operations.
      
      The inode structure is pretty big, and this can change padding depending
      on field width, but at least on the common 64-bit configurations this
      doesn't change the size.  Some of our inode layout has historically been
      to tro to avoid unnecessary padding fields, but cache locality is at
      least as important for layout, if not more.
      
      Noticed by looking at kernel profiles, and noticing that the "i_blkbits"
      access stood out like a sore thumb.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2f9d3df8
    • Linus Torvalds's avatar
      Linux 3.5-rc1 · f8f5701b
      Linus Torvalds authored
      f8f5701b
    • Linus Torvalds's avatar
      Merge tag 'dm-3.5-changes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm · 912afc36
      Linus Torvalds authored
      Pull device-mapper updates from Alasdair G Kergon:
       "Improve multipath's retrying mechanism in some defined circumstances
        and provide a simple reserve/release mechanism for userspace tools to
        access thin provisioning metadata while the pool is in use."
      
      * tag 'dm-3.5-changes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
        dm thin: provide userspace access to pool metadata
        dm thin: use slab mempools
        dm mpath: allow ioctls to trigger pg init
        dm mpath: delay retry of bypassed pg
        dm mpath: reduce size of struct multipath
      912afc36
  11. 02 Jun, 2012 1 commit
    • Joe Thornber's avatar
      dm thin: provide userspace access to pool metadata · cc8394d8
      Joe Thornber authored
      This patch implements two new messages that can be sent to the thin
      pool target allowing it to take a snapshot of the _metadata_.  This,
      read-only snapshot can be accessed by userland, concurrently with the
      live target.
      
      Only one metadata snapshot can be held at a time.  The pool's status
      line will give the block location for the current msnap.
      
      Since version 0.1.5 of the userland thin provisioning tools, the
      thin_dump program displays the msnap as follows:
      
          thin_dump -m <msnap root> <metadata dev>
      
      Available here: https://github.com/jthornber/thin-provisioning-tools
      
      Now that userland can access the metadata we can do various things
      that have traditionally been kernel side tasks:
      
           i) Incremental backups.
      
           By using metadata snapshots we can work out what blocks have
           changed over time.  Combined with data snapshots we can ensure
           the data doesn't change while we back it up.
      
           A short proof of concept script can be found here:
      
           https://github.com/jthornber/thinp-test-suite/blob/master/incremental_backup_example.rb
      
           ii) Migration of thin devices from one pool to another.
      
           iii) Merging snapshots back into an external origin.
      
           iv) Asyncronous replication.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      cc8394d8