1. 22 Jun, 2009 12 commits
    • Mikulas Patocka's avatar
      dm: always hold bdev reference · 32a926da
      Mikulas Patocka authored
      Fix a potential deadlock when creating multiple snapshots by holding a
      reference to struct block_device for the whole lifecycle of every dm
      device instead of obtaining it independently at each point it is needed.
      
      bdget_disk() was called while the device was being suspended, in
      dm_suspend().  However there could be other devices already suspended,
      for example when creating additional snapshots of a device. bdget_disk()
      can wait for IO and allocate memory resulting in waiting for the
      already-suspended device - deadlock.
      
      This patch changes the code so that it gets the reference to struct
      block_device when struct mapped_device is allocated and initialized in
      alloc_dev() where it is always OK to allocate memory or wait for I/O.
      It drops the reference when it is destroyed in free_dev().  Thus there
      is no call to bdget_disk() while any device is suspended.
      
      Previously unlock_fs() was called only if bdev was held.  Now it is
      called unconditionally, but the superfluous calls are harmless because
      it returns immediately if the filesystem was not previously frozen.
      
      This patch also now allows the device size to be changed in a
      noflush suspend because the bdev is held.  This has no adverse effect.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      32a926da
    • Mikulas Patocka's avatar
      dm: rename suspended_bdev to bdev · db8fef4f
      Mikulas Patocka authored
      Rename suspended_bdev to bdev.
      
      This patch doesn't change any functionality, just renames the variable.
      In the next patch, the variable will be used even for non-suspended device.
      
      (Pre-requisite for the per-target barrier support patches.)
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      db8fef4f
    • Jonathan Brassow's avatar
      dm exception store: fix exstore lookup to be case insensitive · f6bd4eb7
      Jonathan Brassow authored
      When snapshots are created using 'p' instead of 'P' as the
      exception store type, the device-mapper table loading fails.
      
      This patch makes the code case insensitive as intended and fixes some
      regressions reported with device-mapper snapshots.
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      f6bd4eb7
    • Mikulas Patocka's avatar
      dm: use i_size_read · 5657e8fa
      Mikulas Patocka authored
      Use i_size_read() instead of reading i_size.
      
      If someone changes the size of the device simultaneously, i_size_read
      is guaranteed to return a valid value (either the old one or the new one).
      
      i_size can return some intermediate invalid value (on 32-bit computers
      with 64-bit i_size, the reads to both halves of i_size can be interleaved
      with updates to i_size, resulting in garbage being returned).
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      5657e8fa
    • Mikulas Patocka's avatar
      dm: avoid unsupported spanning of md stripe boundaries · 8cbeb67a
      Mikulas Patocka authored
      A bio that has two or more vector entries, size less than or equal to
      page size, that crosses a stripe boundary of an underlying md device is
      accepted by device mapper (it conforms to all its limits) but not by the
      underlying device.
      
      The fix is: If device mapper selects the one-page maximum request size,
      it also needs to set its own q->merge_bvec_fn to reject any bios with
      multiple vector entries that span more pages.
      
      The problem was discovered in the following scenario:
        * MD - RAID-0
        * LV on the top of it (raid1, snapshot or striped with chunk
      size/stripe larger than RAID-0 stripe)
        * one of the logical volumes is exported to xen domU
        * inside xen domU it is partitioned, the key point is that the partition
      must be unaligned on page boundary (fdisk normally aligns the partition to
      63 sectors which will trigger it)
        * install the system on the partitioned disk in domU
      This causes I/O failures in dom0.
      Reference: https://bugzilla.redhat.com/show_bug.cgi?id=223947Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      8cbeb67a
    • Mikulas Patocka's avatar
      dm mpath: flush keventd queue in destructor · 53b351f9
      Mikulas Patocka authored
      The commit fe9cf30e moves dm table event
      submission from kmultipath queue to kernel kevent queue to avoid a
      deadlock.
      
      There is a possibility of race condition because kevent queue is not flushed
      in the multipath destructor. The scenario is:
      - some event happens and is queued to keventd
      - keventd thread is delayed due to scheuling latency or some other work
      - multipath device is destroyed
      - keventd now attempts to process work_struct that is residing in already
        released memory.
      
      The patch flushes the keventd queue in multipath constructor.
      I've already fixed similar bug in dm-raid1.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Cc: stable@kernel.org
      53b351f9
    • Mikulas Patocka's avatar
      dm raid1: keep retrying alloc if mempool_alloc failed · a72986c5
      Mikulas Patocka authored
      If the code can't handle allocation failures, use __GFP_NOFAIL so that
      in case of memory pressure the allocator will retry indefinitely and
      won't return NULL which would cause a crash in the function.
      
      This is still not a correct fix, it may cause a classic deadlock when
      memory manager waits for I/O being done and I/O waits for some free memory.
      I/O code shouldn't allocate any memory. But in this case it probably
      doesn't matter much in practice, people usually do not swap on RAID.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      a72986c5
    • Chandra Seetharaman's avatar
      dm mpath: call activate fn for each path in pg_init · e54f77dd
      Chandra Seetharaman authored
      Fixed a problem affecting reinstatement of passive paths.
      
      Before we moved the hardware handler from dm to SCSI, it performed a pg_init
      for a path group and didn't maintain any state about each path in hardware
      handler code.
      
      But in SCSI dh, such state is now maintained, as we want to fail I/O early on a
      path if it is not the active path.
      
      All the hardware handlers have a state now and set to active or some form of
      inactive.  They have prep_fn() which uses this state to fail the I/O without
      it ever being sent to the device.
      
      So in effect when dm-multipath calls scsi_dh_activate(), activate is
      sent to only one path and the "state" of that path is changed appropriately
      to "active" while other paths in the same path group are never changed
      as they never got an "activate".
      
      In order make sure all the paths in a path group gets their state set
      properly when a pg_init happens, we need to call scsi_dh_activate() on
      all paths in a path group.
      
      Doing this at the hardware handler layer is not a good option as we
      want the multipath layer to define the relationship between path and path
      groups and not the hardware handler.
      
      Attached patch sends an "activate" on each path in a path group when a
      path group is switched. It also sends an activate when a path is reinstated.
      Signed-off-by: default avatarChandra Seetharaman <sekharan@us.ibm.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      e54f77dd
    • Hannes Reinecke's avatar
      dm mpath: change attached scsi_dh · a0cf7ea9
      Hannes Reinecke authored
      When specifying a different hardware handler via multipath
      features we should be able to override the built-in defaults.
      
      The problem here is the hardware table from scsi_dh is compiled
      in and cannot be changed from userland. The multipath.conf OTOH
      is purely user-defined and, what's more, the user might have a valid
      reason for modifying it.
      (EG EMC Clariion can well be run in PNR mode even though ALUA is
      active, or the user might want to try ALUA on any as-of-yet unknown
      devices)
      
      So _not_ allowing multipath to override the device handler setting
      will just add to the confusion and makes error tracking even more
      difficult.
      Signed-off-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      a0cf7ea9
    • Milan Broz's avatar
      dm: sysfs skip output when device is being destroyed · 4d89b7b4
      Milan Broz authored
      Do not process sysfs attributes when device is being destroyed.
      
      Otherwise code can cause
        BUG_ON(test_bit(DMF_FREEING, &md->flags));
      in dm_put() call.
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarMilan Broz <mbroz@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      4d89b7b4
    • Mikulas Patocka's avatar
      dm mpath: validate hw_handler argument count · e094f4f1
      Mikulas Patocka authored
      Fix arg count parsing error in hw handlers.
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      e094f4f1
    • Mikulas Patocka's avatar
      dm mpath: validate table argument count · 0e0497c0
      Mikulas Patocka authored
      The parser reads the argument count as a number but doesn't check that
      sufficient arguments are supplied. This command triggers the bug:
      
      dmsetup create mpath --table "0 `blockdev --getsize /dev/mapper/cr0`
          multipath 0 0 2 1 round-robin 1000 0 1 1 /dev/mapper/cr0
          round-robin 0 1 1 /dev/mapper/cr1 1000"
      kernel BUG at drivers/md/dm-mpath.c:530!
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      0e0497c0
  2. 21 Jun, 2009 22 commits
  3. 20 Jun, 2009 6 commits
    • Johannes Weiner's avatar
      mm: page_alloc: clear PG_locked before checking flags on free · c277331d
      Johannes Weiner authored
      da456f14 "page allocator: do not disable interrupts in free_page_mlock()" moved
      the PG_mlocked clearing after the flag sanity checking which makes mlocked
      pages always trigger 'bad page'.  Fix this by clearing the bit up front.
      Reported--and-debugged-by: default avatarPeter Chubb <peter.chubb@nicta.com.au>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Tested-by: default avatarMaxim Levitsky <maximlevitsky@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c277331d
    • Linus Torvalds's avatar
      x86, 64-bit: Clean up user address masking · 9063c61f
      Linus Torvalds authored
      The discussion about using "access_ok()" in get_user_pages_fast() (see
      commit 7f818906: "x86: don't use
      'access_ok()' as a range check in get_user_pages_fast()" for details and
      end result), made us notice that x86-64 was really being very sloppy
      about virtual address checking.
      
      So be way more careful and straightforward about masking x86-64 virtual
      addresses:
      
       - All the VIRTUAL_MASK* variants now cover half of the address
         space, it's not like we can use the full mask on a signed
         integer, and the larger mask just invites mistakes when
         applying it to either half of the 48-bit address space.
      
       - /proc/kcore's kc_offset_to_vaddr() becomes a lot more
         obvious when it transforms a file offset into a
         (kernel-half) virtual address.
      
       - Unify/simplify the 32-bit and 64-bit USER_DS definition to
         be based on TASK_SIZE_MAX.
      
      This cleanup and more careful/obvious user virtual address checking also
      uncovered a buglet in the x86-64 implementation of strnlen_user(): it
      would do an "access_ok()" check on the whole potential area, even if the
      string itself was much shorter, and thus return an error even for valid
      strings. Our sloppy checking had hidden this.
      
      So this fixes 'strnlen_user()' to do this properly, the same way we
      already handled user strings in 'strncpy_from_user()'.  Namely by just
      checking the first byte, and then relying on fault handling for the
      rest.  That always works, since we impose a guard page that cannot be
      mapped at the end of the user space address space (and even if we
      didn't, we'd have the address space hole).
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9063c61f
    • Linus Torvalds's avatar
      Merge branch 'irq-fixes-for-linus' of... · 2453d6ff
      Linus Torvalds authored
      Merge branch 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
      
      * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
        genirq, irq.h: Fix kernel-doc warnings
        genirq: fix comment to say IRQ_WAKE_THREAD
      2453d6ff
    • Linus Torvalds's avatar
      Merge branch 'perfcounters-fixes-for-linus' of... · 12e24f34
      Linus Torvalds authored
      Merge branch 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
      
      * 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (49 commits)
        perfcounter: Handle some IO return values
        perf_counter: Push perf_sample_data through the swcounter code
        perf_counter tools: Define and use our own u64, s64 etc. definitions
        perf_counter: Close race in perf_lock_task_context()
        perf_counter, x86: Improve interactions with fast-gup
        perf_counter: Simplify and fix task migration counting
        perf_counter tools: Add a data file header
        perf_counter: Update userspace callchain sampling uses
        perf_counter: Make callchain samples extensible
        perf report: Filter to parent set by default
        perf_counter tools: Handle lost events
        perf_counter: Add event overlow handling
        fs: Provide empty .set_page_dirty() aop for anon inodes
        perf_counter: tools: Makefile tweaks for 64-bit powerpc
        perf_counter: powerpc: Add processor back-end for MPC7450 family
        perf_counter: powerpc: Make powerpc perf_counter code safe for 32-bit kernels
        perf_counter: powerpc: Change how processor-specific back-ends get selected
        perf_counter: powerpc: Use unsigned long for register and constraint values
        perf_counter: powerpc: Enable use of software counters on 32-bit powerpc
        perf_counter tools: Add and use isprint()
        ...
      12e24f34
    • Linus Torvalds's avatar
      Merge branch 'sched-fixes-for-linus' of... · 1eb51c33
      Linus Torvalds authored
      Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
      
      * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
        sched: Fix out of scope variable access in sched_slice()
        sched: Hide runqueues from direct refer at source code level
        sched: Remove unneeded __ref tag
        sched, x86: Fix cpufreq + sched_clock() TSC scaling
      1eb51c33
    • Linus Torvalds's avatar
      Merge branch 'tracing-fixes-for-linus' of... · b0b7065b
      Linus Torvalds authored
      Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
      
      * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits)
        tracing/urgent: warn in case of ftrace_start_up inbalance
        tracing/urgent: fix unbalanced ftrace_start_up
        function-graph: add stack frame test
        function-graph: disable when both x86_32 and optimize for size are configured
        ring-buffer: have benchmark test print to trace buffer
        ring-buffer: do not grab locks in nmi
        ring-buffer: add locks around rb_per_cpu_empty
        ring-buffer: check for less than two in size allocation
        ring-buffer: remove useless compile check for buffer_page size
        ring-buffer: remove useless warn on check
        ring-buffer: use BUF_PAGE_HDR_SIZE in calculating index
        tracing: update sample event documentation
        tracing/filters: fix race between filter setting and module unload
        tracing/filters: free filter_string in destroy_preds()
        ring-buffer: use commit counters for commit pointer accounting
        ring-buffer: remove unused variable
        ring-buffer: have benchmark test handle discarded events
        ring-buffer: prevent adding write in discarded area
        tracing/filters: strloc should be unsigned short
        tracing/filters: operand can be negative
        ...
      
      Fix up kmemcheck-induced conflict in kernel/trace/ring_buffer.c manually
      b0b7065b