1. 26 Jul, 2012 3 commits
  2. 25 Jul, 2012 2 commits
  3. 22 Jul, 2012 12 commits
    • Linus Torvalds's avatar
      Merge branch 'x86-build-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · e2b34e31
      Linus Torvalds authored
      Pull a x86/build change from Ingo Molnar.
      
      This makes the default stack alignment on x86-64 be just 8, allowing for
      improved code generation (it can avoid some unnecessary extra alignment
      logic and use just pure push/pop sequences) and smaller stack frames.
      
      We can't generally do SSE with 16-byte alignment issues in the kernel anyway.
      
      * 'x86-build-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86-64, gcc: Use -mpreferred-stack-boundary=3 if supported
      e2b34e31
    • Linus Torvalds's avatar
      Merge branch 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 2bd3488f
      Linus Torvalds authored
      Pull x86/uv changes from Ingo Molnar:
       "UV2 BAU productization fixes.
      
        The BAU (Broadcast Assist Unit) is SGI's fancy out of line way on UV
        hardware to do TLB flushes, instead of the normal APIC IPI methods.
        The commits here fix / work around hangs in their latest hardware
        iteration (UV2).
      
        My understanding is that the main purpose of the out of line
        signalling channel is to improve scalability: the UV APIC hardware
        glue does not handle broadcasting to many CPUs very well, and this
        matters most for TLB shootdowns.
      
        [ I don't agree with all aspects of the current approach: in hindsight
          it would have been better to link the BAU at the IPI/APIC driver
          level instead of the TLB shootdown level, where TLB flushes are
          really just one of the uses of broadcast SMP messages.  Doing that
          would improve scalability in some other ways and it would also
          remove a few uglies from the TLB path.  It would also be nice to
          push more is_uv_system() tests into proper x86_init or x86_platform
          callbacks.  Cliff? ]"
      
      * 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/uv: Work around UV2 BAU hangs
        x86/uv: Implement UV BAU runtime enable and disable control via /proc/sgi_uv/
        x86/uv: Fix the UV BAU destination timeout period
      2bd3488f
    • Linus Torvalds's avatar
      Merge branch 'x86-reboot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · d5d96ed2
      Linus Torvalds authored
      Pull x86/reboot changes from Ingo Molnar:
       "Now that the revampted x86 real-mode trampoline code is upstream and
        seems to be working well, we can extend the 64-bit reboot code to be
        as capable as the 32-bit one."
      
      * 'x86-reboot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86-64, reboot: Be more paranoid in 64-bit reboot=bios
        x86, reboot: Drop redundant write of reboot_mode
        x86-64, reboot: Allow reboot=bios and reboot-cpu override on x86-64
      d5d96ed2
    • Linus Torvalds's avatar
      Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · bd3e57f9
      Linus Torvalds authored
      Pull x86 platform changes from Ingo Molnar:
       "This tree mostly involves various APIC driver cleanups/robustization,
        and vSMP motivated platform callback improvements/cleanups"
      
      Fix up trivial conflict due to printk cleanup right next to return value
      change.
      
      * 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
        Revert "x86/early_printk: Replace obsolete simple_strtoul() usage with kstrtoint()"
        x86/apic/x2apic: Use multiple cluster members for the irq destination only with the explicit affinity
        x86/apic/x2apic: Limit the vector reservation to the user specified mask
        x86/apic: Optimize cpu traversal in __assign_irq_vector() using domain membership
        x86/vsmp: Fix vector_allocation_domain's return value
        irq/apic: Use config_enabled(CONFIG_SMP) checks to clean up irq_set_affinity() for UP
        x86/vsmp: Fix linker error when CONFIG_PROC_FS is not set
        x86/apic/es7000: Make apicid of a cluster (not CPU) from a cpumask
        x86/apic/es7000+summit: Always make valid apicid from a cpumask
        x86/apic/es7000+summit: Fix compile warning in cpu_mask_to_apicid()
        x86/apic: Fix ugly casting and branching in cpu_mask_to_apicid_and()
        x86/apic: Eliminate cpu_mask_to_apicid() operation
        x86/x2apic/cluster: Vector_allocation_domain() should return a value
        x86/apic/irq_remap: Silence a bogus pr_err()
        x86/vsmp: Ignore IOAPIC IRQ affinity if possible
        x86/apic: Make cpu_mask_to_apicid() operations check cpu_online_mask
        x86/apic: Make cpu_mask_to_apicid() operations return error code
        x86/apic: Avoid useless scanning thru a cpumask in assign_irq_vector()
        x86/apic: Try to spread IRQ vectors to different priority levels
        x86/apic: Factor out default vector_allocation_domain() operation
        ...
      bd3e57f9
    • Linus Torvalds's avatar
      Merge branch 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 3fad0953
      Linus Torvalds authored
      Pull debug-for-linus git tree from Ingo Molnar.
      
      Fix up trivial conflict in arch/x86/kernel/cpu/perf_event_intel.c due to
      a printk() having changed to a pr_info() differently in the two branches.
      
      * 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86: Move call to print_modules() out of show_regs()
        x86/mm: Mark free_initrd_mem() as __init
        x86/microcode: Mark microcode_id[] as __initconst
        x86/nmi: Clean up register_nmi_handler() usage
        x86: Save cr2 in NMI in case NMIs take a page fault (for i386)
        x86: Remove cmpxchg from i386 NMI nesting code
        x86: Save cr2 in NMI in case NMIs take a page fault
        x86/debug: Add KERN_<LEVEL> to bare printks, convert printks to pr_<level>
      3fad0953
    • Linus Torvalds's avatar
      Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · a065de0d
      Linus Torvalds authored
      Pull x86/asm changes from Ingo Molnar:
       "Assorted single-commit improvements, as usual"
      
      * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/mm/mtrr: Slightly simplify print_mtrr_state()
        x86/mm/mtrr: Fix alignment determination in range_to_mtrr()
        x86/copy_user_generic: Optimize copy_user_generic with CPU erms feature
        x86/alternatives: Use atomic_xchg() instead atomic_dec_and_test() for stop_machine_text_poke()
      a065de0d
    • Linus Torvalds's avatar
      Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 3992c032
      Linus Torvalds authored
      Pull timer core changes from Ingo Molnar:
       "Continued cleanups of the core time and NTP code, plus more nohz work
        preparing for tick-less userspace execution."
      
      * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        time: Rework timekeeping functions to take timekeeper ptr as argument
        time: Move xtime_nsec adjustment underflow handling timekeeping_adjust
        time: Move arch_gettimeoffset() usage into timekeeping_get_ns()
        time: Refactor accumulation of nsecs to secs
        time: Condense timekeeper.xtime into xtime_sec
        time: Explicitly use u32 instead of int for shift values
        time: Whitespace cleanups per Ingo%27s requests
        nohz: Move next idle expiry time record into idle logic area
        nohz: Move ts->idle_calls incrementation into strict idle logic
        nohz: Rename ts->idle_tick to ts->last_tick
        nohz: Make nohz API agnostic against idle ticks cputime accounting
        nohz: Separate idle sleeping time accounting from nohz logic
        timers: Improve get_next_timer_interrupt()
        timers: Add accounting of non deferrable timers
        timers: Consolidate base->next_timer update
        timers: Create detach_if_pending() and use it
      3992c032
    • Linus Torvalds's avatar
      Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 55acdddb
      Linus Torvalds authored
      Pull smp/hotplug changes from Ingo Molnar:
       "Various cleanups to the SMP hotplug code - a continuing effort of
        Thomas et al"
      
      * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        smpboot: Remove leftover declaration
        smp: Remove num_booting_cpus()
        smp: Remove ipi_call_lock[_irq]()/ipi_call_unlock[_irq]()
        POWERPC: Smp: remove call to ipi_call_lock()/ipi_call_unlock()
        SPARC: SMP: Remove call to ipi_call_lock_irq()/ipi_call_unlock_irq()
        ia64: SMP: Remove call to ipi_call_lock_irq()/ipi_call_unlock_irq()
        x86-smp-remove-call-to-ipi_call_lock-ipi_call_unlock
        tile: SMP: Remove call to ipi_call_lock()/ipi_call_unlock()
        S390: Smp: remove call to ipi_call_lock()/ipi_call_unlock()
        parisc: Smp: remove call to ipi_call_lock()/ipi_call_unlock()
        mn10300: SMP: Remove call to ipi_call_lock()/ipi_call_unlock()
        hexagon: SMP: Remove call to ipi_call_lock()/ipi_call_unlock()
      55acdddb
    • Linus Torvalds's avatar
      Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 2eafeb6a
      Linus Torvalds authored
      Pull perf events changes from Ingo Molnar:
      
       "- kernel side:
      
         - Intel uncore PMU support for Nehalem and Sandy Bridge CPUs, we
           support both the events available via the MSR and via the PCI
           access space.
      
         - various uprobes cleanups and restructurings
      
         - PMU driver quirks by microcode version and required x86 microcode
           loader cleanups/robustization
      
         - various tracing robustness updates
      
         - static keys: remove obsolete static_branch()
      
        - tooling side:
      
         - GTK browser improvements
      
         - perf report browser: support screenshots to file
      
         - more automated tests
      
         - perf kvm improvements
      
         - perf bench refinements
      
         - build environment improvements
      
         - pipe mode improvements
      
         - libtraceevent updates, we have now hopefully merged most bits with
           the out of tree forked code base
      
        ... and many other goodies."
      
      * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (138 commits)
        tracing: Check for allocation failure in __tracing_open()
        perf/x86: Fix intel_perfmon_event_mapformatting
        jump label: Remove static_branch()
        tracepoint: Use static_key_false(), since static_branch() is deprecated
        perf/x86: Uncore filter support for SandyBridge-EP
        perf/x86: Detect number of instances of uncore CBox
        perf/x86: Fix event constraint for SandyBridge-EP C-Box
        perf/x86: Use 0xff as pseudo code for fixed uncore event
        perf/x86: Save a few bytes in 'struct x86_pmu'
        perf/x86: Add a microcode revision check for SNB-PEBS
        perf/x86: Improve debug output in check_hw_exists()
        perf/x86/amd: Unify AMD's generic and family 15h pmus
        perf/x86: Move Intel specific code to intel_pmu_init()
        perf/x86: Rename Intel specific macros
        perf/x86: Fix USER/KERNEL tagging of samples
        perf tools: Split event symbols arrays to hw and sw parts
        perf tools: Split out PE_VALUE_SYM parsing token to SW and HW tokens
        perf tools: Add empty rule for new line in event syntax parsing
        perf test: Use ARRAY_SIZE in parse events tests
        tools lib traceevent: Cleanup realloc use
        ...
      2eafeb6a
    • Linus Torvalds's avatar
      Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 16d286e6
      Linus Torvalds authored
      Pull RCU changes from Ingo Molnar:
       "Quoting from Paul, the major features of this series are:
      
        1. Preventing latency spikes of more than 200 microseconds for
           kernels built with NR_CPUS=4096, which is reportedly becoming the
           default for some distros.  This is a first step, as it does not
           help with systems that actually -have- 4096 CPUs (work on this case
           is in progress, but is not yet ready for mainline).
      
           This category also includes improving concurrency of rcu_barrier(),
           placed here due to conflicts.  Posted to LKML at:
      
            https://lkml.org/lkml/2012/6/22/381
      
           Note that patches 18-22 of that series have been defered to 3.7, as
           they have not yet proven themselves to be mainline-ready (and yes,
           these are the ones intended to get rid of RCU's latency spikes for
           systems that actually have 4096 CPUs).
      
        2. Updates to documentation and rcutorture fixes, the latter category
           including improvements to rcu_barrier() testing.  Posted to LKML at
      
            http://lkml.indiana.edu/hypermail/linux/kernel/1206.1/04094.html.
      
        3. Miscellaneous fixes posted to LKML at:
      
            https://lkml.org/lkml/2012/6/22/500
      
           with the exception of the last commit, which was posted here:
      
            http://www.gossamer-threads.com/lists/linux/kernel/1561830
      
        4. RCU_FAST_NO_HZ fixes and improvements.  Posted to LKML at:
      
            http://lkml.indiana.edu/hypermail/linux/kernel/1206.1/00006.html
            http://www.gossamer-threads.com/lists/linux/kernel/1561833
      
           The first four patches of the first series went into 3.5 to fix a
           regression.
      
        5. Code-style fixes.  These were posted to LKML at
      
            http://lkml.indiana.edu/hypermail/linux/kernel/1205.2/01180.html
            http://lkml.indiana.edu/hypermail/linux/kernel/1205.2/01181.html"
      
      * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
        rcu: Fix broken strings in RCU's source code.
        rcu: Fix code-style issues involving "else"
        rcu: Introduce check for callback list/count mismatch
        rcu: Make RCU_FAST_NO_HZ respect nohz= boot parameter
        rcu: Fix qlen_lazy breakage
        rcu: Round FAST_NO_HZ lazy timeout to nearest second
        rcu: The rcu_needs_cpu() function is not a quiescent state
        rcu: Dump only the current CPU's buffers for idle-entry/exit warnings
        rcu: Add check for CPUs going offline with callbacks queued
        rcu: Disable preemption in rcu_blocking_is_gp()
        rcu: Prevent uninitialized string in RCU CPU stall info
        rcu: Fix rcu_is_cpu_idle() #ifdef in TINY_RCU
        rcu: Split RCU core processing out of __call_rcu()
        rcu: Prevent __call_rcu() from invoking RCU core on offline CPUs
        rcu: Make __call_rcu() handle invocation from idle
        rcu: Remove function versions of __kfree_rcu and __is_kfree_rcu_offset
        rcu: Consolidate tree/tiny __rcu_read_{,un}lock() implementations
        rcu: Remove return value from rcu_assign_pointer()
        key: Remove extraneous parentheses from rcu_assign_keypointer()
        rcu: Remove return value from RCU_INIT_POINTER()
        ...
      16d286e6
    • Linus Torvalds's avatar
      Merge branch 'core-iommu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ceee0e95
      Linus Torvalds authored
      Pull core/iommu changes from Ingo Molnar.
      
      * 'core-iommu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        iommu/dmar: Use pr_format() instead of PREFIX to tidy up pr_*() calls
        iommu/dmar: Reserve mmio space used by the IOMMU, if the BIOS forgets to
        iommu/dmar: Replace printks with appropriate pr_*()
      ceee0e95
    • Ingo Molnar's avatar
      Revert "x86/early_printk: Replace obsolete simple_strtoul() usage with kstrtoint()" · 36d93d88
      Ingo Molnar authored
      This reverts commit fbd24153.
      
      This commit is subtly buggy: kstrto*int() can return an error but
      it's not checked in every path. simple_strtoul() on the other hand
      could not fail, so this patch subtly intruduces new failure modes.
      Signed-off-by: default avatarShuah Khan <shuahkhan@gmail.com>
      Link: http://lkml.kernel.org/r/1338424803.3569.5.camel@lorien2Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      36d93d88
  4. 21 Jul, 2012 7 commits
  5. 20 Jul, 2012 13 commits
    • Linus Torvalds's avatar
      Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus · d75e2c9a
      Linus Torvalds authored
      Pull late MIPS fixes from Ralf Baechle:
       "This fixes a number of lose ends in the MIPS code and various bug
        fixes.
      
        Aside of dropping some patch that should not be in this pull request
        everything has sat in -next for quite a while and there are no known
        issues.
      
        The biggest patch in this patch set moves the allocation of an array
        that is aliased to a function (for runtime generated code) to
        assembler code.  This avoids an issue with certain toolchains when
        building for microMIPS."
      
      * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (35 commits)
        MIPS: PCI: Move fixups from __init to __devinit.
        MIPS: Fix bug.h MIPS build regression
        MIPS: sync-r4k: remove redundant irq operation
        MIPS: smp: Warn on too early irq enable
        MIPS: call set_cpu_online() on cpu being brought up with irq disabled
        MIPS: call ->smp_finish() a little late
        MIPS: Yosemite: delay irq enable to ->smp_finish()
        MIPS: SMTC: delay irq enable to ->smp_finish()
        MIPS: BMIPS: delay irq enable to ->smp_finish()
        MIPS: Octeon: delay enable irq to ->smp_finish()
        MIPS: Oprofile: Fix build as a module.
        MIPS: BCM63XX: Fix BCM6368 IPSec clock bit
        MIPS: perf: Fix build error caused by unused counters_per_cpu_to_total()
        MIPS: Fix Magic SysRq L kernel crash.
        MIPS: BMIPS: Fix duplicate header inclusion.
        mips: mark const init data with __initconst instead of __initdata
        MIPS: cmpxchg.h: Add missing include
        MIPS: Malta may also be equipped with MIPS64 R2 processors.
        MIPS: Fix typo multipy -> multiply
        MIPS: Cavium: Fix duplicate ARCH_SPARSEMEM_ENABLE in kconfig.
        ...
      d75e2c9a
    • Linus Torvalds's avatar
      Merge tag 'dm-3.5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm · 93517374
      Linus Torvalds authored
      Pull device-mapper discard fixes from Alasdair G Kergon:
        - avoid a crash in dm-raid1 when discards coincide with mirror
          recovery;
        - avoid discarding shared data that's still needed in dm-thin;
        - don't guarantee that discarded blocks will be wiped in dm-raid1.
      
      * tag 'dm-3.5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
        dm raid1: set discard_zeroes_data_unsupported
        dm thin: do not send discards to shared blocks
        dm raid1: fix crash with mirror recovery and discard
      93517374
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd · ce9f8d6b
      Linus Torvalds authored
      Pull pnfs/ore fixes from Boaz Harrosh:
       "These are catastrophic fixes to the pnfs objects-layout that were just
        discovered.  They are also destined for @stable.
      
        I have found these and worked on them at around RC1 time but
        unfortunately went to the hospital for kidney stones and had a very
        slow recovery.  I refrained from sending them as is, before proper
        testing, and surly I have found a bug just yesterday.
      
        So now they are all well tested, and have my sign-off.  Other then
        fixing the problem at hand, and assuming there are no bugs at the new
        code, there is low risk to any surrounding code.  And in anyway they
        affect only these paths that are now broken.  That is RAID5 in pnfs
        objects-layout code.  It does also affect exofs (which was not broken)
        but I have tested exofs and it is lower priority then objects-layout
        because no one is using exofs, but objects-layout has lots of users."
      
      * 'for-linus' of git://git.open-osd.org/linux-open-osd:
        pnfs-obj: Fix __r4w_get_page when offset is beyond i_size
        pnfs-obj: don't leak objio_state if ore_write/read fails
        ore: Unlock r4w pages in exact reverse order of locking
        ore: Remove support of partial IO request (NFS crash)
        ore: Fix NFS crash by supporting any unaligned RAID IO
      ce9f8d6b
    • Linus Torvalds's avatar
      Merge tag 'upstream-3.5-rc8' of git://git.infradead.org/linux-ubifs · 17934162
      Linus Torvalds authored
      Pull UBIFS free space fix-up bugfix from Artem Bityutskiy:
       "It's been reported already twice recently:
      
          http://lists.infradead.org/pipermail/linux-mtd/2012-May/041408.html
          http://lists.infradead.org/pipermail/linux-mtd/2012-June/042422.html
      
        and we finally have the fix.  I am quite confident the fix is correct
        because I could reproduce the problem with nandsim and verify the fix.
        It was also verified by Iwo (the reporter).
      
        I am also confident that this is OK to merge the fix so late because
        this patch affects only the fixup functionality, which is not used by
        most users."
      
      * tag 'upstream-3.5-rc8' of git://git.infradead.org/linux-ubifs:
        UBIFS: fix a bug in empty space fix-up
      17934162
    • Mikulas Patocka's avatar
      dm raid1: set discard_zeroes_data_unsupported · 7c8d3a42
      Mikulas Patocka authored
      We can't guarantee that REQ_DISCARD on dm-mirror zeroes the data even if
      the underlying disks support zero on discard.  So this patch sets
      ti->discard_zeroes_data_unsupported.
      
      For example, if the mirror is in the process of resynchronizing, it may
      happen that kcopyd reads a piece of data, then discard is sent on the
      same area and then kcopyd writes the piece of data to another leg.
      Consequently, the data is not zeroed.
      
      The flag was made available by commit 983c7db3
      (dm crypt: always disable discard_zeroes_data).
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      7c8d3a42
    • Mikulas Patocka's avatar
      dm thin: do not send discards to shared blocks · 650d2a06
      Mikulas Patocka authored
      When process_discard receives a partial discard that doesn't cover a
      full block, it sends this discard down to that block. Unfortunately, the
      block can be shared and the discard would corrupt the other snapshots
      sharing this block.
      
      This patch detects block sharing and ends the discard with success when
      sending it to the shared block.
      
      The above change means that if the device supports discard it can't be
      guaranteed that a discard request zeroes data. Therefore, we set
      ti->discard_zeroes_data_unsupported.
      
      Thin target discard support with this bug arrived in commit
      104655fd (dm thin: support discards).
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      650d2a06
    • Mikulas Patocka's avatar
      dm raid1: fix crash with mirror recovery and discard · 751f188d
      Mikulas Patocka authored
      This patch fixes a crash when a discard request is sent during mirror
      recovery.
      
      Firstly, some background.  Generally, the following sequence happens during
      mirror synchronization:
      - function do_recovery is called
      - do_recovery calls dm_rh_recovery_prepare
      - dm_rh_recovery_prepare uses a semaphore to limit the number
        simultaneously recovered regions (by default the semaphore value is 1,
        so only one region at a time is recovered)
      - dm_rh_recovery_prepare calls __rh_recovery_prepare,
        __rh_recovery_prepare asks the log driver for the next region to
        recover. Then, it sets the region state to DM_RH_RECOVERING. If there
        are no pending I/Os on this region, the region is added to
        quiesced_regions list. If there are pending I/Os, the region is not
        added to any list. It is added to the quiesced_regions list later (by
        dm_rh_dec function) when all I/Os finish.
      - when the region is on quiesced_regions list, there are no I/Os in
        flight on this region. The region is popped from the list in
        dm_rh_recovery_start function. Then, a kcopyd job is started in the
        recover function.
      - when the kcopyd job finishes, recovery_complete is called. It calls
        dm_rh_recovery_end. dm_rh_recovery_end adds the region to
        recovered_regions or failed_recovered_regions list (depending on
        whether the copy operation was successful or not).
      
      The above mechanism assumes that if the region is in DM_RH_RECOVERING
      state, no new I/Os are started on this region. When I/O is started,
      dm_rh_inc_pending is called, which increases reg->pending count. When
      I/O is finished, dm_rh_dec is called. It decreases reg->pending count.
      If the count is zero and the region was in DM_RH_RECOVERING state,
      dm_rh_dec adds it to the quiesced_regions list.
      
      Consequently, if we call dm_rh_inc_pending/dm_rh_dec while the region is
      in DM_RH_RECOVERING state, it could be added to quiesced_regions list
      multiple times or it could be added to this list when kcopyd is copying
      data (it is assumed that the region is not on any list while kcopyd does
      its jobs). This results in memory corruption and crash.
      
      There already exist bypasses for REQ_FLUSH requests: REQ_FLUSH requests
      do not belong to any region, so they are always added to the sync list
      in do_writes. dm_rh_inc_pending does not increase count for REQ_FLUSH
      requests. In mirror_end_io, dm_rh_dec is never called for REQ_FLUSH
      requests. These bypasses avoid the crash possibility described above.
      
      These bypasses were improperly implemented for REQ_DISCARD when
      the mirror target gained discard support in commit
      5fc2ffea (dm raid1: support discard).
      
      In do_writes, REQ_DISCARD requests is always added to the sync queue and
      immediately dispatched (even if the region is in DM_RH_RECOVERING).  However,
      dm_rh_inc and dm_rh_dec is called for REQ_DISCARD resusts.  So it violates the
      rule that no I/Os are started on DM_RH_RECOVERING regions, and causes the list
      corruption described above.
      
      This patch changes it so that REQ_DISCARD requests follow the same path
      as REQ_FLUSH. This avoids the crash.
      
      Reference: https://bugzilla.redhat.com/837607Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      751f188d
    • Boaz Harrosh's avatar
      pnfs-obj: Fix __r4w_get_page when offset is beyond i_size · c999ff68
      Boaz Harrosh authored
      It is very common for the end of the file to be unaligned on
      stripe size. But since we know it's beyond file's end then
      the XOR should be preformed with all zeros.
      
      Old code used to just read zeros out of the OSD devices, which is a great
      waist. But what scares me more about this situation is that, we now have
      pages attached to the file's mapping that are beyond i_size. I don't
      like the kind of bugs this calls for.
      
      Fix both birds, by returning a global zero_page, if offset is beyond
      i_size.
      
      TODO:
      	Change the API to ->__r4w_get_page() so a NULL can be
      	returned without being considered as error, since XOR API
      	treats NULL entries as zero_pages.
      
      [Bug since 3.2. Should apply the same way to all Kernels since]
      CC: Stable Tree <stable@kernel.org>
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      c999ff68
    • Boaz Harrosh's avatar
      pnfs-obj: don't leak objio_state if ore_write/read fails · 9909d45a
      Boaz Harrosh authored
      [Bug since 3.2 Kernel]
      CC: Stable Tree <stable@kernel.org>
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      9909d45a
    • Boaz Harrosh's avatar
      ore: Unlock r4w pages in exact reverse order of locking · 537632e0
      Boaz Harrosh authored
      The read-4-write pages are locked in address ascending order.
      But where unlocked in a way easiest for coding. Fix that,
      locks should be released in opposite order of locking, .i.e
      descending address order.
      
      I have not hit this dead-lock. It was found by inspecting the
      dbug print-outs. I suspect there is an higher lock at caller that
      protects us, but fix it regardless.
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      537632e0
    • Boaz Harrosh's avatar
      ore: Remove support of partial IO request (NFS crash) · 62b62ad8
      Boaz Harrosh authored
      Do to OOM situations the ore might fail to allocate all resources
      needed for IO of the full request. If some progress was possible
      it would proceed with a partial/short request, for the sake of
      forward progress.
      
      Since this crashes NFS-core and exofs is just fine without it just
      remove this contraption, and fail.
      
      TODO:
      	Support real forward progress with some reserved allocations
      	of resources, such as mem pools and/or bio_sets
      
      [Bug since 3.2 Kernel]
      CC: Stable Tree <stable@kernel.org>
      CC: Benny Halevy <bhalevy@tonian.com>
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      62b62ad8
    • Boaz Harrosh's avatar
      ore: Fix NFS crash by supporting any unaligned RAID IO · 9ff19309
      Boaz Harrosh authored
      In RAID_5/6 We used to not permit an IO that it's end
      byte is not stripe_size aligned and spans more than one stripe.
      .i.e the caller must check if after submission the actual
      transferred bytes is shorter, and would need to resubmit
      a new IO with the remainder.
      
      Exofs supports this, and NFS was supposed to support this
      as well with it's short write mechanism. But late testing has
      exposed a CRASH when this is used with none-RPC layout-drivers.
      
      The change at NFS is deep and risky, in it's place the fix
      at ORE to lift the limitation is actually clean and simple.
      So here it is below.
      
      The principal here is that in the case of unaligned IO on
      both ends, beginning and end, we will send two read requests
      one like old code, before the calculation of the first stripe,
      and also a new site, before the calculation of the last stripe.
      If any "boundary" is aligned or the complete IO is within a single
      stripe. we do a single read like before.
      
      The code is clean and simple by splitting the old _read_4_write
      into 3 even parts:
      1._read_4_write_first_stripe
      2. _read_4_write_last_stripe
      3. _read_4_write_execute
      
      And calling 1+3 at the same place as before. 2+3 before last
      stripe, and in the case of all in a single stripe then 1+2+3
      is preformed additively.
      
      Why did I not think of it before. Well I had a strike of
      genius because I have stared at this code for 2 years, and did
      not find this simple solution, til today. Not that I did not try.
      
      This solution is much better for NFS than the previous supposedly
      solution because the short write was dealt  with out-of-band after
      IO_done, which would cause for a seeky IO pattern where as in here
      we execute in order. At both solutions we do 2 separate reads, only
      here we do it within a single IO request. (And actually combine two
      writes into a single submission)
      
      NFS/exofs code need not change since the ORE API communicates the new
      shorter length on return, what will happen is that this case would not
      occur anymore.
      
      hurray!!
      
      [Stable this is an NFS bug since 3.2 Kernel should apply cleanly]
      CC: Stable Tree <stable@kernel.org>
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      9ff19309
    • Artem Bityutskiy's avatar
      UBIFS: fix a bug in empty space fix-up · c6727932
      Artem Bityutskiy authored
      UBIFS has a feature called "empty space fix-up" which is a quirk to work-around
      limitations of dumb flasher programs. Namely, of those flashers that are unable
      to skip NAND pages full of 0xFFs while flashing, resulting in empty space at
      the end of half-filled eraseblocks to be unusable for UBIFS. This feature is
      relatively new (introduced in v3.0).
      
      The fix-up routine (fixup_free_space()) is executed only once at the very first
      mount if the superblock has the 'space_fixup' flag set (can be done with -F
      option of mkfs.ubifs). It basically reads all the UBIFS data and metadata and
      writes it back to the same LEB. The routine assumes the image is pristine and
      does not have anything in the journal.
      
      There was a bug in 'fixup_free_space()' where it fixed up the log incorrectly.
      All but one LEB of the log of a pristine file-system are empty. And one
      contains just a commit start node. And 'fixup_free_space()' just unmapped this
      LEB, which resulted in wiping the commit start node. As a result, some users
      were unable to mount the file-system next time with the following symptom:
      
      UBIFS error (pid 1): replay_log_leb: first log node at LEB 3:0 is not CS node
      UBIFS error (pid 1): replay_log_leb: log error detected while replaying the log at LEB 3:0
      
      The root-cause of this bug was that 'fixup_free_space()' wrongly assumed
      that the beginning of empty space in the log head (c->lhead_offs) was known
      on mount. However, it is not the case - it was always 0. UBIFS does not store
      in it the master node and finds out by scanning the log on every mount.
      
      The fix is simple - just pass commit start node size instead of 0 to
      'fixup_leb()'.
      Signed-off-by: default avatarArtem Bityutskiy <Artem.Bityutskiy@linux.intel.com>
      Cc: stable@vger.kernel.org [v3.0+]
      Reported-by: default avatarIwo Mergler <Iwo.Mergler@netcommwireless.com>
      Tested-by: default avatarIwo Mergler <Iwo.Mergler@netcommwireless.com>
      Reported-by: default avatarJames Nute <newten82@gmail.com>
      c6727932
  6. 19 Jul, 2012 3 commits