1. 26 Sep, 2014 40 commits
    • Jason Gunthorpe's avatar
      tpm: Provide a generic means to override the chip returned timeouts · 6b2a55ea
      Jason Gunthorpe authored
      Some Atmel TPMs provide completely wrong timeouts from their
      TPM_CAP_PROP_TIS_TIMEOUT query. This patch detects that and returns
      new correct values via a DID/VID table in the TIS driver.
      
      Tested on ARM using an AT97SC3204T FW version 37.16
      
      Cc: <stable@vger.kernel.org>
      [PHuewe: without this fix these 'broken' Atmel TPMs won't function on
      older kernels]
      Signed-off-by: default avatar"Berg, Christopher" <Christopher.Berg@atmel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: default avatarPeter Huewe <peterhuewe@gmx.de>
      
      (cherry picked from commit 8e54caf4)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      6b2a55ea
    • Linus Torvalds's avatar
      vfs: avoid non-forwarding large load after small store in path lookup · 8aaa881c
      Linus Torvalds authored
      The performance regression that Josef Bacik reported in the pathname
      lookup (see commit 99d263d4 "vfs: fix bad hashing of dentries") made
      me look at performance stability of the dcache code, just to verify that
      the problem was actually fixed.  That turned up a few other problems in
      this area.
      
      There are a few cases where we exit RCU lookup mode and go to the slow
      serializing case when we shouldn't, Al has fixed those and they'll come
      in with the next VFS pull.
      
      But my performance verification also shows that link_path_walk() turns
      out to have a very unfortunate 32-bit store of the length and hash of
      the name we look up, followed by a 64-bit read of the combined hash_len
      field.  That screws up the processor store to load forwarding, causing
      an unnecessary hickup in this critical routine.
      
      It's caused by the ugly calling convention for the "hash_name()"
      function, and easily fixed by just making hash_name() fill in the whole
      'struct qstr' rather than passing it a pointer to just the hash value.
      
      With that, the profile for this function looks much smoother.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      
      Merge branch 'parisc-3.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
      
      Pull parisc updates from Helge Deller:
       "The most important patch is a new Light Weigth Syscall (LWS) for 8,
        16, 32 and 64 bit atomic CAS operations which is required in order to
        be able to implement the atomic gcc builtins on our platform.
      
        Other than that, we wire up the seccomp, getrandom and memfd_create
        syscalls, fixes a minor off-by-one bug and a wrong printk string"
      
      * 'parisc-3.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
        parisc: Implement new LWS CAS supporting 64 bit operations.
        parisc: Wire up seccomp, getrandom and memfd_create syscalls
        parisc: dino: fix %d confusingly prefixed with 0x in format string
        parisc: sys_hpux: NUL terminator is one past the end
      
      Merge tag 'ntb-3.17' of git://github.com/jonmason/ntb
      
      Pull ntb driver bugfixes from Jon Mason:
       "NTB driver fixes for queue spread and buffer alignment.  Also, update
        to MAINTAINERS to reflect new e-mail address"
      
      * tag 'ntb-3.17' of git://github.com/jonmason/ntb:
        ntb: Add alignment check to meet hardware requirement
        MAINTAINERS: update NTB info
        NTB: correct the spread of queues over mw's
      
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
      
      Pull ARM irq chip fixes from Thomas Gleixner:
       "Another pile of ARM specific irq chip fixlets:
      
         - off by one bugs in the crossbar driver
         - missing annotations
         - a bunch of "make it compile" updates
      
        I pulled the lot today from Jason, but it has been in -next for at
        least a week"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqchip: gic-v3: Declare rdist as __percpu pointer to __iomem pointer
        irqchip: gic: Make gic_default_routable_irq_domain_ops static
        irqchip: exynos-combiner: Fix compilation error on ARM64
        irqchip: crossbar: Off by one bugs in init
        irqchip: gic-v3: Tag all low level accessors __maybe_unused
        irqchip: gic-v3: Only define gic_peek_irq() when building SMP
      
      Merge tag 'irqchip-urgent-3.17' of git://git.infradead.org/users/jcooper/linux into irq/urgent
      
      irqchip fixes for v3.17 from Jason Cooper
      
       - GIC/GICV3: Various fixlets
       - crossbar: Fix off-by-one bug
       - exynos-combiner: Fix arm64 build error
      
      ntb: Add alignment check to meet hardware requirement
      
      The NTB translate register must have the value to be BAR size aligned.
      This alignment check make sure that the DMA memory allocated has the
      proper alignment. Another requirement for NTB to function properly with
      memory window BAR size greater or equal to 4M is to use the CMA feature
      in 3.16 kernel with the appropriate CONFIG_CMA_ALIGNMENT and
      CONFIG_CMA_SIZE_MBYTES set.
      Signed-off-by: default avatarDave Jiang <dave.jiang@intel.com>
      Signed-off-by: default avatarJon Mason <jdmason@kudzu.us>
      
      MAINTAINERS: update NTB info
      
      Update my contact info to my personal email address and add Dave Jiang.
      Signed-off-by: default avatarJon Mason <jon.mason@intel.com>
      Signed-off-by: default avatarDave Jiang <dave.jiang@intel.com>
      
      NTB: correct the spread of queues over mw's
      
      The detection of an uneven number of queues on the given memory windows
      was not correct.  The mw_num is zero based and the mod should be
      division to spread them evenly over the mw's.
      Signed-off-by: default avatarJon Mason <jon.mason@intel.com>
      
      Merge branches 'locking-urgent-for-linus' and 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
      
      Pull futex and timer fixes from Thomas Gleixner:
       "A oneliner bugfix for the jinxed futex code:
      
         - Drop hash bucket lock in the error exit path.  I really could slap
           myself for intruducing that bug while fixing all the other horror
           in that code three month ago ...
      
        and the timer department is not too proud about the following fixes:
      
         - Deal with a long standing rounding bug in the timeval to jiffies
           conversion.  It's a real issue and this fix fell through the cracks
           for quite some time.
      
         - Another round of alarmtimer fixes.  Finally this code gets used
           more widely and the subtle issues hidden for quite some time are
           noticed and fixed.  Nothing really exciting, just the itty bitty
           details which bite the serious users here and there"
      
      * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        futex: Unlock hb->lock in futex_wait_requeue_pi() error path
      
      * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        alarmtimer: Lock k_itimer during timer callback
        alarmtimer: Do not signal SIGEV_NONE timers
        alarmtimer: Return relative times in timer_gettime
        jiffies: Fix timeval conversion to jiffies
      
      parisc: Implement new LWS CAS supporting 64 bit operations.
      
      The current LWS cas only works correctly for 32bit. The new LWS allows
      for CAS operations of variable size.
      Signed-off-by: default avatarGuy Martin <gmsoft@tuxicoman.be>
      Cc: <stable@vger.kernel.org> # 3.13+
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      
      vfs: fix bad hashing of dentries
      
      Josef Bacik found a performance regression between 3.2 and 3.10 and
      narrowed it down to commit bfcfaa77 ("vfs: use 'unsigned long'
      accesses for dcache name comparison and hashing"). He reports:
      
       "The test case is essentially
      
            for (i = 0; i < 1000000; i++)
                    mkdir("a$i");
      
        On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k
        dir/sec with 3.10.  This is because we spend waaaaay more time in
        __d_lookup on 3.10 than in 3.2.
      
        The new hashing function for strings is suboptimal for <
        sizeof(unsigned long) string names (and hell even > sizeof(unsigned
        long) string names that I've tested).  I broke out the old hashing
        function and the new one into a userspace helper to get real numbers
        and this is what I'm getting:
      
            Old hash table had 1000000 entries, 0 dupes, 0 max dupes
            New hash table had 12628 entries, 987372 dupes, 900 max dupes
            We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash
      
        My test does the hash, and then does the d_hash into a integer pointer
        array the same size as the dentry hash table on my system, and then
        just increments the value at the address we got to see how many
        entries we overlap with.
      
        As you can see the old hash function ended up with all 1 million
        entries in their own bucket, whereas the new one they are only
        distributed among ~12.5k buckets, which is why we're using so much
        more CPU in __d_lookup".
      
      The reason for this hash regression is two-fold:
      
       - On 64-bit architectures the down-mixing of the original 64-bit
         word-at-a-time hash into the final 32-bit hash value is very
         simplistic and suboptimal, and just adds the two 32-bit parts
         together.
      
         In particular, because there is no bit shuffling and the mixing
         boundary is also a byte boundary, similar character patterns in the
         low and high word easily end up just canceling each other out.
      
       - the old byte-at-a-time hash mixed each byte into the final hash as it
         hashed the path component name, resulting in the low bits of the hash
         generally being a good source of hash data.  That is not true for the
         word-at-a-time case, and the hash data is distributed among all the
         bits.
      
      The fix is the same in both cases: do a better job of mixing the bits up
      and using as much of the hash data as possible.  We already have the
      "hash_32|64()" functions to do that.
      Reported-by: default avatarJosef Bacik <jbacik@fb.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      
      alarmtimer: Lock k_itimer during timer callback
      
      Locks the k_itimer's it_lock member when handling the alarm timer's
      expiry callback.
      
      The regular posix timers defined in posix-timers.c have this lock held
      during timout processing because their callbacks are routed through
      posix_timer_fn().  The alarm timers follow a different path, so they
      ought to grab the lock somewhere else.
      
      Cc: stable@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Sharvil Nanavati <sharvil@google.com>
      Signed-off-by: default avatarRichard Larocque <rlarocque@google.com>
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      
      alarmtimer: Do not signal SIGEV_NONE timers
      
      Avoids sending a signal to alarm timers created with sigev_notify set to
      SIGEV_NONE by checking for that special case in the timeout callback.
      
      The regular posix timers avoid sending signals to SIGEV_NONE timers by
      not scheduling any callbacks for them in the first place.  Although it
      would be possible to do something similar for alarm timers, it's simpler
      to handle this as a special case in the timeout.
      
      Prior to this patch, the alarm timer would ignore the sigev_notify value
      and try to deliver signals to the process anyway.  Even worse, the
      sanity check for the value of sigev_signo is skipped when SIGEV_NONE was
      specified, so the signal number could be bogus.  If sigev_signo was an
      unitialized value (as it often would be if SIGEV_NONE is used), then
      it's hard to predict which signal will be sent.
      
      Cc: stable@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Sharvil Nanavati <sharvil@google.com>
      Signed-off-by: default avatarRichard Larocque <rlarocque@google.com>
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      
      alarmtimer: Return relative times in timer_gettime
      
      Returns the time remaining for an alarm timer, rather than the time at
      which it is scheduled to expire.  If the timer has already expired or it
      is not currently scheduled, the it_value's members are set to zero.
      
      This new behavior matches that of the other posix-timers and the POSIX
      specifications.
      
      This is a change in user-visible behavior, and may break existing
      applications.  Hopefully, few users rely on the old incorrect behavior.
      
      Cc: stable@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Sharvil Nanavati <sharvil@google.com>
      Signed-off-by: default avatarRichard Larocque <rlarocque@google.com>
      [jstultz: minor style tweak]
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      
      jiffies: Fix timeval conversion to jiffies
      
      timeval_to_jiffies tried to round a timeval up to an integral number
      of jiffies, but the logic for doing so was incorrect: intervals
      corresponding to exactly N jiffies would become N+1. This manifested
      itself particularly repeatedly stopping/starting an itimer:
      
      setitimer(ITIMER_PROF, &val, NULL);
      setitimer(ITIMER_PROF, NULL, &val);
      
      would add a full tick to val, _even if it was exactly representable in
      terms of jiffies_ (say, the result of a previous rounding.)  Doing
      this repeatedly would cause unbounded growth in val.  So fix the math.
      
      Here's what was wrong with the conversion: we essentially computed
      (eliding seconds)
      
      jiffies = usec  * (NSEC_PER_USEC/TICK_NSEC)
      
      by using scaling arithmetic, which took the best approximation of
      NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
      x/(2^USEC_JIFFIE_SC), and computed:
      
      jiffies = (usec * x) >> USEC_JIFFIE_SC
      
      and rounded this calculation up in the intermediate form (since we
      can't necessarily exactly represent TICK_NSEC in usec.) But the
      scaling arithmetic is a (very slight) *over*approximation of the true
      value; that is, instead of dividing by (1 usec/ 1 jiffie), we
      effectively divided by (1 usec/1 jiffie)-epsilon (rounding
      down). This would normally be fine, but we want to round timeouts up,
      and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
      would be fine if our division was exact, but dividing this by the
      slightly smaller factor was equivalent to adding just _over_ 1 to the
      final result (instead of just _under_ 1, as desired.)
      
      In particular, with HZ=1000, we consistently computed that 10000 usec
      was 11 jiffies; the same was true for any exact multiple of
      TICK_NSEC.
      
      We could possibly still round in the intermediate form, adding
      something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
      convert usec->nsec, round in nanoseconds, and then convert using
      time*spec*_to_jiffies.  This adds one constant multiplication, and is
      not observably slower in microbenchmarks on recent x86 hardware.
      
      Tested: the following program:
      
      int main() {
        struct itimerval zero = {{0, 0}, {0, 0}};
        /* Initially set to 10 ms. */
        struct itimerval initial = zero;
        initial.it_interval.tv_usec = 10000;
        setitimer(ITIMER_PROF, &initial, NULL);
        /* Save and restore several times. */
        for (size_t i = 0; i < 10; ++i) {
          struct itimerval prev;
          setitimer(ITIMER_PROF, &zero, &prev);
          /* on old kernels, this goes up by TICK_USEC every iteration */
          printf("previous value: %ld %ld %ld %ld\n",
                 prev.it_interval.tv_sec, prev.it_interval.tv_usec,
                 prev.it_value.tv_sec, prev.it_value.tv_usec);
          setitimer(ITIMER_PROF, &prev, NULL);
        }
          return 0;
      }
      
      Cc: stable@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Reviewed-by: default avatarPaul Turner <pjt@google.com>
      Reported-by: default avatarAaron Jacobs <jacobsa@google.com>
      Signed-off-by: default avatarAndrew Hunter <ahh@google.com>
      [jstultz: Tweaked to apply to 3.17-rc]
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      
      futex: Unlock hb->lock in futex_wait_requeue_pi() error path
      
      futex_wait_requeue_pi() calls futex_wait_setup(). If
      futex_wait_setup() succeeds it returns with hb->lock held and
      preemption disabled. Now the sanity check after this does:
      
              if (match_futex(&q.key, &key2)) {
      	   	ret = -EINVAL;
      		goto out_put_keys;
      	}
      
      which releases the keys but does not release hb->lock.
      
      So we happily return to user space with hb->lock held and therefor
      preemption disabled.
      
      Unlock hb->lock before taking the exit route.
      Reported-by: default avatarDave "Trinity" Jones <davej@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarDarren Hart <dvhart@linux.intel.com>
      Reviewed-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1409112318500.4178@nanosSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      
      irqchip: gic-v3: Declare rdist as __percpu pointer to __iomem pointer
      
      The __percpu __iomem annotations on the rdist base are contradictory
      and confuse static checkers such as sparse.
      
      This patch fixes the anotations so that rdist is described as a __percpu
      pointer to an __iomem pointer.
      
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Link: https://lkml.kernel.org/r/1409062410-25891-9-git-send-email-will.deacon@arm.comSigned-off-by: default avatarJason Cooper <jason@lakedaemon.net>
      
      irqchip: gic: Make gic_default_routable_irq_domain_ops static
      
      The internal irq domain ops for the GIC are not used directly anywhere
      else, so make them static. This gets rid of a sparse warning on the
      file.
      
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Link: https://lkml.kernel.org/r/1409062410-25891-8-git-send-email-will.deacon@arm.comSigned-off-by: default avatarJason Cooper <jason@lakedaemon.net>
      
      irqchip: exynos-combiner: Fix compilation error on ARM64
      
      The following compilation error occurs on 64-bit Exynos7 SoC:
      
      drivers/irqchip/exynos-combiner.c: In function ‘combiner_irq_domain_map’:
      drivers/irqchip/exynos-combiner.c:162:2: error: implicit declaration of function ‘set_irq_flags’ [-Werror=implicit-function-declaration]
        set_irq_flags(irq, IRQF_VALID | IRQF_PROBE);
        ^
      drivers/irqchip/exynos-combiner.c:162:21: error: ‘IRQF_VALID’ undeclared (first use in this function)
        set_irq_flags(irq, IRQF_VALID | IRQF_PROBE);
                           ^
      drivers/irqchip/exynos-combiner.c:162:21: note: each undeclared identifier is reported only once for each function it appears in
      drivers/irqchip/exynos-combiner.c:162:34: error: ‘IRQF_PROBE’ undeclared (first use in this function)
        set_irq_flags(irq, IRQF_VALID | IRQF_PROBE);
      
      Fix the build error by including linux/interrupt.h.
      Signed-off-by: default avatarNaveen Krishna Chatradhi <ch.naveen@samsung.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Link: https://lkml.kernel.org/r/1409722329-18309-1-git-send-email-ch.naveen@samsung.comSigned-off-by: default avatarJason Cooper <jason@lakedaemon.net>
      
      parisc: Wire up seccomp, getrandom and memfd_create syscalls
      
      With secure computing we only support the SECCOMP_MODE_STRICT mode for
      now.
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      
      parisc: dino: fix %d confusingly prefixed with 0x in format string
      Signed-off-by: default avatarHans Wennborg <hans@hanshq.net>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      
      parisc: sys_hpux: NUL terminator is one past the end
      
      We allocate "len" number of chars so we should put the NUL at "len - 1"
      to avoid corrupting memory.  Btw, strlen_user() is different from the
      normal strlen() function because it includes NUL terminator in the
      count.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      
      irqchip: crossbar: Off by one bugs in init
      
      My static checker complains that the ">" should be ">=" or else we go
      beyond the end of the cb->irq_map[] array on the next line.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarJason Cooper <jason@lakedaemon.net>
      
      irqchip: gic-v3: Tag all low level accessors __maybe_unused
      
      This is only really needed for gic_write_sgi1r in the !SMP case since it
      is only referenced in the SMP initialisation code but it seems better to
      have these functions all next to each other and declared consistently.
      Signed-off-by: default avatarMark Brown <broonie@linaro.org>
      Link: https://lkml.kernel.org/r/1406748194-21094-1-git-send-email-broonie@kernel.orgSigned-off-by: default avatarJason Cooper <jason@lakedaemon.net>
      
      irqchip: gic-v3: Only define gic_peek_irq() when building SMP
      
      If building with CONFIG_SMP disbled (for example, with allnoconfig) then
      GCC complains that the static function gic_peek_irq() is defined but not
      used since the only reference is in the SMP initialisation code. Fix this
      by moving the function definition inside the ifdef.
      Signed-off-by: default avatarMark Brown <broonie@linaro.org>
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Link: https://lkml.kernel.org/r/1406480224-24628-1-git-send-email-broonie@kernel.orgSigned-off-by: default avatarJason Cooper <jason@lakedaemon.net>
      
      (cherry picked from commit 9226b5b4
      99d263d4)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      8aaa881c
    • Al Viro's avatar
      dcache.c: get rid of pointless macros · 145fce8d
      Al Viro authored
      D_HASH{MASK,BITS} are used once each, both in the same function (d_hash()).
      At this point they are actively misguiding - they imply that values are
      compiler constants, which is no longer true.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      
      (cherry picked from commit 482db906)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      145fce8d
    • Tejun Heo's avatar
      blkcg: don't call into policy draining if root_blkg is already gone · 3f2c76f9
      Tejun Heo authored
      While a queue is being destroyed, all the blkgs are destroyed and its
      ->root_blkg pointer is set to NULL.  If someone else starts to drain
      while the queue is in this state, the following oops happens.
      
        NULL pointer dereference at 0000000000000028
        IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
        PGD e4a1067 PUD b773067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
        CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
        RIP: 0010:[<ffffffff8144e944>]  [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
        RSP: 0018:ffff88000efd7bf0  EFLAGS: 00010046
        RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
        RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
        R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
        R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
        FS:  00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
        Stack:
         ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
         ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
         ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
        Call Trace:
         [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60
         [<ffffffff81427641>] __blk_drain_queue+0x71/0x180
         [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0
         [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120
         [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50
         [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40
         [<ffffffff8142d476>] blk_release_queue+0x26/0xd0
         [<ffffffff81454968>] kobject_cleanup+0x38/0x70
         [<ffffffff81454848>] kobject_put+0x28/0x60
         [<ffffffff81427505>] blk_put_queue+0x15/0x20
         [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0
         [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0
         [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20
         [<ffffffff817930e2>] device_release+0x32/0xa0
         [<ffffffff81454968>] kobject_cleanup+0x38/0x70
         [<ffffffff81454848>] kobject_put+0x28/0x60
         [<ffffffff817934d7>] put_device+0x17/0x20
         [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0
         [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40
         [<ffffffff817d1257>] sdev_store_delete+0x27/0x30
         [<ffffffff81792ca8>] dev_attr_store+0x18/0x30
         [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50
         [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170
         [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0
         [<ffffffff811f69bd>] SyS_write+0x4d/0xc0
         [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b
      
      776687bc ("block, blk-mq: draining can't be skipped even if
      bypass_depth was non-zero") made it easier to trigger this bug by
      making blk_queue_bypass_start() drain even when it loses the first
      bypass test to blk_cleanup_queue(); however, the bug has always been
      there even before the commit as blk_queue_bypass_start() could race
      against queue destruction, win the initial bypass test but perform the
      actual draining after blk_cleanup_queue() already destroyed all blkgs.
      
      Fix it by skippping calling into policy draining if all the blkgs are
      already gone.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarShirish Pargaonkar <spargaonkar@suse.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Reported-by: default avatarJet Chen <jet.chen@intel.com>
      Cc: stable@vger.kernel.org
      Tested-by: default avatarShirish Pargaonkar <spargaonkar@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      Revert "bio: modify __bio_add_page() to accept pages that don't start a new segment"
      
      This reverts commit 254c4407.
      
      It causes crashes with cryptsetup, even after a few iterations and
      updates. Drop it for now.
      
      blkcg: don't call into policy draining if root_blkg is already gone
      
      While a queue is being destroyed, all the blkgs are destroyed and its
      ->root_blkg pointer is set to NULL.  If someone else starts to drain
      while the queue is in this state, the following oops happens.
      
        NULL pointer dereference at 0000000000000028
        IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
        PGD e4a1067 PUD b773067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
        CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
        RIP: 0010:[<ffffffff8144e944>]  [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
        RSP: 0018:ffff88000efd7bf0  EFLAGS: 00010046
        RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
        RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
        R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
        R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
        FS:  00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
        Stack:
         ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
         ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
         ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
        Call Trace:
         [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60
         [<ffffffff81427641>] __blk_drain_queue+0x71/0x180
         [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0
         [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120
         [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50
         [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40
         [<ffffffff8142d476>] blk_release_queue+0x26/0xd0
         [<ffffffff81454968>] kobject_cleanup+0x38/0x70
         [<ffffffff81454848>] kobject_put+0x28/0x60
         [<ffffffff81427505>] blk_put_queue+0x15/0x20
         [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0
         [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0
         [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20
         [<ffffffff817930e2>] device_release+0x32/0xa0
         [<ffffffff81454968>] kobject_cleanup+0x38/0x70
         [<ffffffff81454848>] kobject_put+0x28/0x60
         [<ffffffff817934d7>] put_device+0x17/0x20
         [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0
         [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40
         [<ffffffff817d1257>] sdev_store_delete+0x27/0x30
         [<ffffffff81792ca8>] dev_attr_store+0x18/0x30
         [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50
         [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170
         [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0
         [<ffffffff811f69bd>] SyS_write+0x4d/0xc0
         [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b
      
      776687bc ("block, blk-mq: draining can't be skipped even if
      bypass_depth was non-zero") made it easier to trigger this bug by
      making blk_queue_bypass_start() drain even when it loses the first
      bypass test to blk_cleanup_queue(); however, the bug has always been
      there even before the commit as blk_queue_bypass_start() could race
      against queue destruction, win the initial bypass test but perform the
      actual draining after blk_cleanup_queue() already destroyed all blkgs.
      
      Fix it by skippping calling into policy draining if all the blkgs are
      already gone.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarShirish Pargaonkar <spargaonkar@suse.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Reported-by: default avatarJet Chen <jet.chen@intel.com>
      Cc: stable@vger.kernel.org
      Tested-by: default avatarShirish Pargaonkar <spargaonkar@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      bio: modify __bio_add_page() to accept pages that don't start a new segment
      
      The original behaviour is to refuse to add a new page if the maximum
      number of segments has been reached, regardless of the fact the page we
      are going to add can be merged into the last segment or not.
      
      Unfortunately, when the system runs under heavy memory fragmentation
      conditions, a driver may try to add multiple pages to the last segment.
      The original code won't accept them and EBUSY will be reported to
      userspace.
      
      This patch modifies the function so it refuses to add a page only in case
      the latter starts a new segment and the maximum number of segments has
      already been reached.
      
      The bug can be easily reproduced with the st driver:
      
      1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE  to 16
      2) modprobe st buffer_kbs=1024
      3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10
         dd: error writing `/dev/st0': Device or resource busy
      
      [ming.lei@canonical.com: update bi_iter.bi_size before recounting segments]
      Signed-off-by: default avatarMaurizio Lombardi <mlombard@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Tested-by: default avatarDongsu Park <dongsu.park@profitbricks.com>
      Tested-by: default avatarJet Chen <jet.chen@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      block: fix SG_[GS]ET_RESERVED_SIZE ioctl when max_sectors is huge
      
      SG_GET_RESERVED_SIZE and SG_SET_RESERVED_SIZE ioctls access a reserved
      buffer in bytes as int type.  The value needs to be capped at the request
      queue's max_sectors.  But integer overflow is not correctly handled in
      the calculation when converting max_sectors from sectors to bytes.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Douglas Gilbert <dgilbert@interlog.com>
      Cc: linux-scsi@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      block: fix BLKSECTGET ioctl when max_sectors is greater than USHRT_MAX
      
      BLKSECTGET ioctl loads the request queue's max_sectors as unsigned
      short value to the argument pointer.  So if the max_sector is greater
      than USHRT_MAX, the upper 16 bits of that is just discarded.
      
      In such case, USHRT_MAX is more preferable than the lower 16 bits of
      max_sectors.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Douglas Gilbert <dgilbert@interlog.com>
      Cc: linux-scsi@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      block/partitions/efi.c: kerneldoc fixing
      
      Adding function documentation and fixing kerneldoc warnings
      ('field: description' uniformization).
      
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      block/partitions/msdos.c: code clean-up
      
      checkpatch fixing:
      WARNING: Missing a blank line after declarations
      WARNING: space prohibited between function name and open parenthesis '('
      ERROR: spaces required around that '<' (ctx:VxV)
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      block/partitions/amiga.c: replace nolevel printk by pr_err
      
      Also add no prefix pr_fmt to avoid any future default format update
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      block/partitions/aix.c: replace count*size kzalloc by kcalloc
      
      kcalloc manages count*sizeof overflow.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      bio-integrity: add "bip_max_vcnt" into struct bio_integrity_payload
      
      Commit 08778795 ("block: Fix nr_vecs for inline integrity vectors") from
      Martin introduces the function bip_integrity_vecs(get the useful vectors)
      to fix the issue about nr_vecs for inline integrity vectors that reported
      by David Milburn.
      
      But it seems that bip_integrity_vecs() will return the wrong number if the
      bio is not based on any bio_set for some reason(bio->bi_pool == NULL),
      because in that case, the bip_inline_vecs[0] is malloced directly.  So
      here we add the bip_max_vcnt to record the count of vector slots, and
      cleanup the function bip_integrity_vecs().
      Signed-off-by: default avatarGu Zheng <guz.fnst@cn.fujitsu.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      blk-mq: use percpu_ref for mq usage count
      
      Currently, blk-mq uses a percpu_counter to keep track of how many
      usages are in flight.  The percpu_counter is drained while freezing to
      ensure that no usage is left in-flight after freezing is complete.
      blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
      per-cpu gating mechanism.
      
      This type of code has relatively high chance of subtle bugs which are
      extremely difficult to trigger and it's way too hairy to be open coded
      in blk-mq.  percpu_ref can serve the same purpose after the recent
      changes.  This patch replaces the open-coded per-cpu usage counting
      and draining mechanism with percpu_ref.
      
      blk_mq_queue_enter() performs tryget_live on the ref and exit()
      performs put.  blk_mq_freeze_queue() kills the ref and waits until the
      reference count reaches zero.  blk_mq_unfreeze_queue() revives the ref
      and wakes up the waiters.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue()
      
      Keeping __blk_mq_drain_queue() as a separate function doesn't buy us
      anything and it's gonna be further simplified.  Let's flatten it into
      its caller.
      
      This patch doesn't make any functional change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      blk-mq: decouble blk-mq freezing from generic bypassing
      
      blk_mq freezing is entangled with generic bypassing which bypasses
      blkcg and io scheduler and lets IO requests fall through the block
      layer to the drivers in FIFO order.  This allows forward progress on
      IOs with the advanced features disabled so that those features can be
      configured or altered without worrying about stalling IO which may
      lead to deadlock through memory allocation.
      
      However, generic bypassing doesn't quite fit blk-mq.  blk-mq currently
      doesn't make use of blkcg or ioscheds and it maps bypssing to
      freezing, which blocks request processing and drains all the in-flight
      ones.  This causes problems as bypassing assumes that request
      processing is online.  blk-mq works around this by conditionally
      allowing request processing for the problem case - during queue
      initialization.
      
      Another weirdity is that except for during queue cleanup, bypassing
      started on the generic side prevents blk-mq from processing new
      requests but doesn't drain the in-flight ones.  This shouldn't break
      anything but again highlights that something isn't quite right here.
      
      The root cause is conflating blk-mq freezing and generic bypassing
      which are two different mechanisms.  The only intersecting purpose
      that they serve is during queue cleanup.  Let's properly separate
      blk-mq freezing from generic bypassing and simply use it where
      necessary.
      
      * request_queue->mq_freeze_depth is added and
        blk_mq_[un]freeze_queue() now operate on this counter instead of
        ->bypass_depth.  The replacement for QUEUE_FLAG_BYPASS isn't added
        but the counter is tested directly.  This will be further updated by
        later changes.
      
      * blk_mq_drain_queue() is dropped and "__" prefix is dropped from
        blk_mq_freeze_queue().  Queue cleanup path now calls
        blk_mq_freeze_queue() directly.
      
      * blk_queue_enter()'s fast path condition is simplified to simply
        check @q->mq_freeze_depth.  Previously, the condition was
      
      	!blk_queue_dying(q) &&
      	    (!blk_queue_bypass(q) || !blk_queue_init_done(q))
      
        mq_freeze_depth is incremented right after dying is set and
        blk_queue_init_done() exception isn't necessary as blk-mq doesn't
        start frozen, which only leaves the blk_queue_bypass() test which
        can be replaced by @q->mq_freeze_depth test.
      
      This change simplifies the code and reduces confusion in the area.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      block, blk-mq: draining can't be skipped even if bypass_depth was non-zero
      
      Currently, both blk_queue_bypass_start() and blk_mq_freeze_queue()
      skip queue draining if bypass_depth was already above zero.  The
      assumption is that the one which bumped the bypass_depth should have
      performed draining already; however, there's nothing which prevents a
      new instance of bypassing/freezing from starting before the previous
      one finishes draining.  The current code may allow the later
      bypassing/freezing instances to complete while there still are
      in-flight requests which haven't finished draining.
      
      Fix it by draining regardless of bypass_depth.  We still skip draining
      from blk_queue_bypass_start() while the queue is initializing to avoid
      introducing excessive delays during boot.  INIT_DONE setting is moved
      above the initial blk_queue_bypass_end() so that bypassing attempts
      can't slip inbetween.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      blk-mq: fix a memory ordering bug in blk_mq_queue_enter()
      
      blk-mq uses a percpu_counter to keep track of how many usages are in
      flight.  The percpu_counter is drained while freezing to ensure that
      no usage is left in-flight after freezing is complete.
      
      blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
      per-cpu gating mechanism; unfortunately, it contains a subtle bug -
      smp_wmb() in blk_mq_queue_enter() doesn't prevent prevent the cpu from
      fetching @q->bypass_depth before incrementing @q->mq_usage_counter and
      if freezing happens inbetween the caller can slip through and freezing
      can be complete while there are active users.
      
      Use smp_mb() instead so that bypass_depth and mq_usage_counter
      modifications and tests are properly interlocked.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into for-3.17/core
      
      Merge the percpu_ref changes from Tejun, he says they are stable now.
      
      percpu-refcount: implement percpu_ref_reinit() and percpu_ref_is_zero()
      
      Now that explicit invocation of percpu_ref_exit() is necessary to free
      the percpu counter, we can implement percpu_ref_reinit() which
      reinitializes a released percpu_ref.  This can be used implement
      scalable gating switch which can be drained and then re-opened without
      worrying about memory allocation failures.
      
      percpu_ref_is_zero() is added to be used in a sanity check in
      percpu_ref_exit().  As this function will be useful for other purposes
      too, make it a public interface.
      
      v2: Use smp_read_barrier_depends() instead of smp_load_acquire().  We
          only need data dep barrier and smp_load_acquire() is stronger and
          heavier on some archs.  Spotted by Lai Jiangshan.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      
      percpu-refcount: require percpu_ref to be exited explicitly
      
      Currently, a percpu_ref undoes percpu_ref_init() automatically by
      freeing the allocated percpu area when the percpu_ref is killed.
      While seemingly convenient, this has the following niggles.
      
      * It's impossible to re-init a released reference counter without
        going through re-allocation.
      
      * In the similar vein, it's impossible to initialize a percpu_ref
        count with static percpu variables.
      
      * We need and have an explicit destructor anyway for failure paths -
        percpu_ref_cancel_init().
      
      This patch removes the automatic percpu counter freeing in
      percpu_ref_kill_rcu() and repurposes percpu_ref_cancel_init() into a
      generic destructor now named percpu_ref_exit().  percpu_ref_destroy()
      is considered but it gets confusing with percpu_ref_kill() while
      "exit" clearly indicates that it's the counterpart of
      percpu_ref_init().
      
      All percpu_ref_cancel_init() users are updated to invoke
      percpu_ref_exit() instead and explicit percpu_ref_exit() calls are
      added to the destruction path of all percpu_ref users.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Cc: Li Zefan <lizefan@huawei.com>
      
      percpu-refcount: use unsigned long for pcpu_count pointer
      
      percpu_ref->pcpu_count is a percpu pointer with a status flag in its
      lowest bit.  As such, it always goes through arithmetic operations
      which is very cumbersome to do on a pointer.  It has to be first
      casted to unsigned long and then back.
      
      Let's just make the field unsigned long so that we can skip the first
      casts.  While at it, rename it to pcpu_counter_ptr to clarify that
      it's a pointer value.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      
      percpu-refcount: add helpers for ->percpu_count accesses
      
      * All four percpu_ref_*() operations implemented in the header file
        perform the same operation to determine whether the percpu_ref is
        alive and extract the percpu pointer.  Factor out the common logic
        into __pcpu_ref_alive().  This doesn't change the generated code.
      
      * There are a couple places in percpu-refcount.c which masks out
        PCPU_REF_DEAD to obtain the percpu pointer.  Factor it out into
        pcpu_count_ptr().
      
      * The above changes make the WARN_ON_ONCE() conditional at the top of
        percpu_ref_kill_and_confirm() the only user of REF_STATUS().  Test
        PCPU_REF_DEAD directly and remove REF_STATUS().
      
      This patch doesn't introduce any functional change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      
      percpu-refcount: one bit is enough for REF_STATUS
      
      percpu-refcount currently reserves two lowest bits of its percpu
      pointer to indicate its state; however, only one bit is used for
      PCPU_REF_DEAD.
      
      Simplify it by removing PCPU_STATUS_BITS/MASK and testing
      PCPU_REF_DEAD directly.  This also allows the compiler to choose a
      more efficient instruction depending on the architecture.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      
      percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc()
      
      ioctx_alloc() reaches inside percpu_ref and directly frees
      ->pcpu_count in its failure path, which is quite gross.  percpu_ref
      has been providing a proper interface to do this,
      percpu_ref_cancel_init(), for quite some time now.  Let's use that
      instead.
      
      This patch doesn't introduce any behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Cc: Kent Overstreet <kmo@daterainc.com>
      
      workqueue: stronger test in process_one_work()
      
      After the recent changes, when POOL_DISASSOCIATED is cleared, the
      running worker's local CPU should be the same as pool->cpu without any
      exception even during cpu-hotplug.  Update the sanity check in
      process_one_work() accordingly.
      
      This patch changes "(proposition_A && proposition_B && proposition_C)"
      to "(proposition_B && proposition_C)", so if the old compound
      proposition is true, the new one must be true too. so this will not
      hide any possible bug which can be caught by the old test.
      
      tj: Minor updates to the description.
      
      CC: Jason J. Herne <jjherne@linux.vnet.ibm.com>
      CC: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      
      workqueue: clear POOL_DISASSOCIATED in rebind_workers()
      
      The commit a9ab775b ("workqueue: directly restore CPU affinity of
      workers from CPU_ONLINE") moved the pool->lock into rebind_workers()
      without also moving "pool->flags &= ~POOL_DISASSOCIATED".
      
      There is nothing wrong with "pool->flags &= ~POOL_DISASSOCIATED" not
      being moved together, but there isn't any benefit either. We move it
      into rebind_workers() and achieve these benefits:
      
      1) Better readability.  POOL_DISASSOCIATED is cleared in
         rebind_workers() as expected.
      
      2) When POOL_DISASSOCIATED is cleared, we can ensure that all the
         running workers of the pool are on the local CPU (pool->cpu).
      
      tj: Cosmetic updates to the code and description.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      
      percpu: Use ALIGN macro instead of hand coding alignment calculation
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      
      percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations
      
      __verify_pcpu_ptr() is used to verify that a specified parameter is
      actually an percpu pointer by percpu accessor and operation
      implementations.  Currently, where it's called isn't clearly defined
      and we just ensure that it's invoked at least once for all accessors
      and operations.
      
      The lack of clarity on when it should be called isn't nice and given
      that this is a completely generic issue, there's no reason to make
      archs worry about it.
      
      This patch updates __verify_pcpu_ptr() invocations such that it's
      always invoked from the final generic wrapper once per access or
      operation.  As this is already the case for {raw|this}_cpu_*()
      definitions through __pcpu_size_*(), only the {raw|per|this}_cpu_ptr()
      accessors need to be updated.
      
      This change makes it unnecessary for archs to worry about
      __verify_pcpu_ptr().  x86's arch_raw_cpu_ptr() is updated accordingly.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      
      percpu: preffity percpu header files
      
      percpu macros are difficult to read.  It's partly because they're
      fairly complex but also because they simply lack visual and
      conventional consistency to an unusual degree.  The preceding patches
      tried to organize macro definitions consistently by their roles.  This
      patch makes the following cosmetic changes to improve overall
      readability.
      
      * Use consistent convention for multi-line macro definitions - "do {"
        or "({" are now put on their own lines and the line continuing '\'
        are all put on the same column.
      
      * Temp variables used inside macro are consistently given "__" prefix.
      
      * When a macro argument is passed to another macro or a function,
        putting extra parenthses around it doesn't help anything.  Don't put
        them.
      
      * _this_cpu_generic_*() are renamed to this_cpu_generic_*() so that
        they're consistent with raw_cpu_generic_*().
      
      * Reorganize raw_cpu_*() and this_cpu_*() definitions so that trivial
        wrappers are collected in one place after actual operation
        definitions.
      
      * Other misc cleanups including reorganizing comments.
      
      All changes in this patch are cosmetic and cause no functional
      difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      
      percpu: use raw_cpu_*() to define __this_cpu_*()
      
      __this_cpu_*() operations are the same as raw_cpu_*() operations
      except for the added __this_cpu_preempt_check().  Curiously, these
      were defined using __pcu_size_call_*() instead of being layered on top
      of raw_cpu_*().
      
      Let's layer them so that __this_cpu_*() are defined in terms of
      raw_cpu_*().  It's simpler and less error-prone this way.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      
      percpu: reorder macros in percpu header files
      
      * In include/asm-generic/percpu.h, collect {raw|_this}_cpu_generic*()
        macros into one place.  They were dispersed through
        {raw|this}_cpu_*_N() definitions and the visiual inconsistency was
        making following the code unnecessarily difficult.
      
      * In include/linux/percpu-defs.h, move __verify_pcpu_ptr() later in
        the file so that it's right above accessor definitions where it's
        actually used.
      
      This is pure reorganization.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      
      percpu: move {raw|this}_cpu_*() definitions to include/linux/percpu-defs.h
      
      We're in the process of moving all percpu accessors and operations to
      include/linux/percpu-defs.h so that they're available to arch headers
      without having to include full include/linux/percpu.h which may cause
      cyclic inclusion dependency.
      
      This patch moves {raw|this}_cpu_*() definitions from
      include/linux/percpu.h to include/linux/percpu-defs.h.  The code is
      moved mostly verbatim; however, raw_cpu_*() are placed above
      this_cpu_*() which is more conventional as the raw operations may be
      used to defined other variants.
      
      This is pure reorganization.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      
      percpu: move generic {raw|this}_cpu_*_N() definitions to include/asm-generic/percpu.h
      
      {raw|this}_cpu_*_N() operations are expected to be provided by archs
      and the generic definitions are provided as fallbacks.  As such, these
      firmly belong to include/asm-generic/percpu.h.
      
      Move the generic definitions to include/asm-generic/percpu.h.  The
      code is moved mostly verbatim; however, raw_cpu_*_N() are placed above
      this_cpu_*_N() which is more conventional as the raw operations may be
      used to defined other variants.
      
      This is pure reorganization.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      
      percpu: only allow sized arch overrides for {raw|this}_cpu_*() ops
      
      Currently, percpu allows two separate methods for overriding
      {raw|this}_cpu_*() ops - for a given operation, an arch can provide
      whole replacement or sized sub operations to override specific parts
      of it.  e.g. arch either can provide this_cpu_add() or
      this_cpu_add_4() to override only the 4 byte operation.
      
      While quite flexible on a glance, the dual-overriding scheme
      complicates the code path for no actual gain.  It compilcates the
      already complex operation definitions and if an arch wants to override
      all sizes, it can easily provide all variants anyway.  In fact, no
      arch is actually making use of whole operation override.
      
      Another oddity is that __this_cpu_*() operations are defined in the
      same way as raw_cpu_*() but ignores full overrides of the raw_cpu_*()
      and doesn't allow full operation override, so if an arch provides
      whole overrides for raw_cpu_*() operations __this_cpu_*() ends up
      using the generic implementations.
      
      More importantly, it takes away the layering between arch-specific and
      generic parts making it impossible for the generic part to implement
      arch-independent features on top of arch-specific overrides.
      
      This patch removes the support for whole operation overrides.  As no
      arch is using it, this doesn't cause any actual difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      
      percpu: reorganize include/linux/percpu-defs.h
      
      Reorganize for better readability.
      
      * Accessor definitions are collected into one place and SMP and UP now
        define them in the same order.
      
      * Definitions are layered when possible - e.g. per_cpu() is now
        defined in terms of this_cpu_ptr().
      
      * Rather pointless comment dropped.
      
      * per_cpu(), __raw_get_cpu_var() and __get_cpu_var() are defined in a
        way which can be shared between SMP and UP and moved out of
        CONFIG_SMP blocks.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      
      percpu: move accessors from include/linux/percpu.h to percpu-defs.h
      
      include/linux/percpu-defs.h is gonna host all accessors and operations
      so that arch headers can make use of them too without worrying about
      circular dependency through include/linux/percpu.h.
      
      This patch moves the following accessors from include/linux/percpu.h
      to include/linux/percpu-defs.h.
      
      * get/put_cpu_var()
      * get/put_cpu_ptr()
      * per_cpu_ptr()
      
      This is pure reorgniazation.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      
      percpu: include/asm-generic/percpu.h should contain only arch-overridable parts
      
      The roles of the various percpu header files has become unclear.
      There are four header files involved.
      
       include/linux/percpu-defs.h
       include/linux/percpu.h
       include/asm-generic/percpu.h
       arch/*/include/asm/percpu.h
      
      The original intention for include/asm-generic/percpu.h is providing
      generic definitions for arch-overridable parts; however, it now hosts
      various stuff which can't be overridden by archs.
      
      Also, include/linux/percpu-defs.h was initially added to contain
      section and percpu variable definition macros so that arch header
      files can make use of them without worrying about introducing cyclic
      inclusion dependency by including include/linux/percpu.h; however,
      arch headers sometimes need to access percpu variables too and this is
      one of the reasons why some accessors were implemented in
      include/linux/asm-generic/percpu.h.
      
      Let's clear up the situation by making include/asm-generic/percpu.h
      contain only arch-overridable parts and moving accessors and
      operations into include/linux/percpu-defs.  Note that this patch only
      moves things from include/asm-generic/percpu.h.
      include/linux/percpu.h will be taken care of by later patches.
      
      This patch moves the followings.
      
      * SHIFT_PERCPU_PTR() / VERIFY_PERCPU_PTR()
      * per_cpu()
      * raw_cpu_ptr()
      * this_cpu_ptr()
      * __get_cpu_var()
      * __raw_get_cpu_var()
      * __this_cpu_ptr()
      * PER_CPU_[SHARED_]ALIGNED_SECTION
      * PER_CPU_[SHARED_]ALIGNED_SECTION
      * PER_CPU_FIRST_SECTION
      
      This patch is pure reorganization.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      
      percpu: introduce arch_raw_cpu_ptr()
      
      Currently, archs can override raw_cpu_ptr() directly; however, we
      wanna build a layer of indirection in the generic part of percpu so
      that we can implement generic features there without affecting archs.
      
      Introduce arch_raw_cpu_ptr() which is used to define raw_cpu_ptr() by
      generic percpu code.  The two are identical for now.  x86 is currently
      the only arch which overrides raw_cpu_ptr() and is converted to
      define arch_raw_cpu_ptr() instead.
      
      This doesn't introduce any functional difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      
      percpu: disallow archs from overriding SHIFT_PERCPU_PTR()
      
      It has been about half a decade since all archs started using the
      dynamic percpu allocator and thus the same SHIFT_PERCPU_PTR()
      implementation.  There's no benefit in overriding SHIFT_PERCPU_PTR()
      anymore.
      
      Remove #ifndef around it to clarify that this is identical regardless
      of the arch.
      
      This patch doesn't cause any functional difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      
      (cherry picked from commit 2a1b4cf2
      0b462c89)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      3f2c76f9
    • NeilBrown's avatar
      md/raid1,raid10: always abort recover on write error. · 49213bcb
      NeilBrown authored
      Currently we don't abort recovery on a write error if the write error
      to the recovering device was triggerd by normal IO (as opposed to
      recovery IO).
      
      This means that for one bitmap region, the recovery might write to the
      recovering device for a few sectors, then not bother for subsequent
      sectors (as it never writes to failed devices).  In this case
      the bitmap bit will be cleared, but it really shouldn't.
      
      The result is that if the recovering device fails and is then re-added
      (after fixing whatever hardware problem triggerred the failure),
      the second recovery won't redo the region it was in the middle of,
      so some of the device will not be recovered properly.
      
      If we abort the recovery, the region being processes will be cancelled
      (bit not cleared) and the whole region will be retried.
      
      As the bug can result in data corruption the patch is suitable for
      -stable.  For kernels prior to 3.11 there is a conflict in raid10.c
      which will require care.
      
      Original-from: jiao hui <jiaohui@bwstor.com.cn>
      Reported-and-tested-by: default avatarjiao hui <jiaohui@bwstor.com.cn>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Cc: stable@vger.kernel.org
      
      (cherry picked from commit 2446dba0)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      49213bcb
    • Eric W. Biederman's avatar
      mnt: Correct permission checks in do_remount · 5061831c
      Eric W. Biederman authored
      While invesgiating the issue where in "mount --bind -oremount,ro ..."
      would result in later "mount --bind -oremount,rw" succeeding even if
      the mount started off locked I realized that there are several
      additional mount flags that should be locked and are not.
      
      In particular MNT_NOSUID, MNT_NODEV, MNT_NOEXEC, and the atime
      flags in addition to MNT_READONLY should all be locked.  These
      flags are all per superblock, can all be changed with MS_BIND,
      and should not be changable if set by a more privileged user.
      
      The following additions to the current logic are added in this patch.
      - nosuid may not be clearable by a less privileged user.
      - nodev  may not be clearable by a less privielged user.
      - noexec may not be clearable by a less privileged user.
      - atime flags may not be changeable by a less privileged user.
      
      The logic with atime is that always setting atime on access is a
      global policy and backup software and auditing software could break if
      atime bits are not updated (when they are configured to be updated),
      and serious performance degradation could result (DOS attack) if atime
      updates happen when they have been explicitly disabled.  Therefore an
      unprivileged user should not be able to mess with the atime bits set
      by a more privileged user.
      
      The additional restrictions are implemented with the addition of
      MNT_LOCK_NOSUID, MNT_LOCK_NODEV, MNT_LOCK_NOEXEC, and MNT_LOCK_ATIME
      mnt flags.
      
      Taken together these changes and the fixes for MNT_LOCK_READONLY
      should make it safe for an unprivileged user to create a user
      namespace and to call "mount --bind -o remount,... ..." without
      the danger of mount flags being changed maliciously.
      
      Cc: stable@vger.kernel.org
      Acked-by: default avatarSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      
      (cherry picked from commit 9566d674)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      5061831c
    • Eric W. Biederman's avatar
      mnt: Only change user settable mount flags in remount · 6fca9e95
      Eric W. Biederman authored
      Kenton Varda <kenton@sandstorm.io> discovered that by remounting a
      read-only bind mount read-only in a user namespace the
      MNT_LOCK_READONLY bit would be cleared, allowing an unprivileged user
      to the remount a read-only mount read-write.
      
      Correct this by replacing the mask of mount flags to preserve
      with a mask of mount flags that may be changed, and preserve
      all others.   This ensures that any future bugs with this mask and
      remount will fail in an easy to detect way where new mount flags
      simply won't change.
      
      Cc: stable@vger.kernel.org
      Acked-by: default avatarSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      
      (cherry picked from commit a6138db8)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      6fca9e95
    • Ralf Baechle's avatar
      MIPS: Fix accessing to per-cpu data when flushing the cache · b8a9e7ba
      Ralf Baechle authored
      This fixes the following issue
      
      BUG: using smp_processor_id() in preemptible [00000000] code: kjournald/1761
      caller is blast_dcache32+0x30/0x254
      Call Trace:
      [<8047f02c>] dump_stack+0x8/0x34
      [<802e7e40>] debug_smp_processor_id+0xe0/0xf0
      [<80114d94>] blast_dcache32+0x30/0x254
      [<80118484>] r4k_dma_cache_wback_inv+0x200/0x288
      [<80110ff0>] mips_dma_map_sg+0x108/0x180
      [<80355098>] ide_dma_prepare+0xf0/0x1b8
      [<8034eaa4>] do_rw_taskfile+0x1e8/0x33c
      [<8035951c>] ide_do_rw_disk+0x298/0x3e4
      [<8034a3c4>] do_ide_request+0x2e0/0x704
      [<802bb0dc>] __blk_run_queue+0x44/0x64
      [<802be000>] queue_unplugged.isra.36+0x1c/0x54
      [<802beb94>] blk_flush_plug_list+0x18c/0x24c
      [<802bec6c>] blk_finish_plug+0x18/0x48
      [<8026554c>] journal_commit_transaction+0x3b8/0x151c
      [<80269648>] kjournald+0xec/0x238
      [<8014ac00>] kthread+0xb8/0xc0
      [<8010268c>] ret_from_kernel_thread+0x14/0x1c
      
      Caches in most systems are identical - but not always, so we can't avoid
      the use of smp_call_function() by just looking at the boot CPU's data,
      have to fiddle with preemption instead.
      Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Cc: Markos Chandras <markos.chandras@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/5835
      
      (cherry picked from commit ff522058)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      b8a9e7ba
    • Aaro Koskinen's avatar
      MIPS: OCTEON: make get_system_type() thread-safe · 43ed8029
      Aaro Koskinen authored
      get_system_type() is not thread-safe on OCTEON. It uses static data,
      also more dangerous issue is that it's calling cvmx_fuse_read_byte()
      every time without any synchronization. Currently it's possible to get
      processes stuck looping forever in kernel simply by launching multiple
      readers of /proc/cpuinfo:
      
      	(while true; do cat /proc/cpuinfo > /dev/null; done) &
      	(while true; do cat /proc/cpuinfo > /dev/null; done) &
      	...
      
      Fix by initializing the system type string only once during the early
      boot.
      Signed-off-by: default avatarAaro Koskinen <aaro.koskinen@nsn.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarMarkos Chandras <markos.chandras@imgtec.com>
      Patchwork: http://patchwork.linux-mips.org/patch/7437/Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      
      MIPS: CPS: Initialize EVA before bringing up VPEs from secondary cores
      
      The CPS code is doing several memory loads when configuring the VPEs
      from secondary cores, so the segmentation control registers must be
      initialized in time otherwise the kernel will crash with strange
      TLB exceptions.
      Reviewed-by: default avatarPaul Burton <paul.burton@imgtec.com>
      Signed-off-by: default avatarMarkos Chandras <markos.chandras@imgtec.com>
      Patchwork: http://patchwork.linux-mips.org/patch/7424/Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      
      MIPS: Malta: EVA: Rename 'eva_entry' to 'platform_eva_init'
      
      Rename 'eva_entry' to 'platform_eva_init' as required by the new
      'eva_init' macro in the eva.h header. Since this macro is now used
      in a platform dependent way, it must not depend on its caller so move
      the t1 register initialization inside this macro. Also set the .reorder
      assembler option in case the caller may have previously set .noreorder.
      This may allow a few assembler optimizations. Finally include missing
      headers and document the register usage for this macro.
      Reviewed-by: default avatarPaul Burton <paul.burton@imgtec.com>
      Signed-off-by: default avatarMarkos Chandras <markos.chandras@imgtec.com>
      Patchwork: http://patchwork.linux-mips.org/patch/7423/Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      
      MIPS: EVA: Add new EVA header
      
      Generic code may need to perform certain operations when EVA is
      enabled, for example, configure the segmentation registers during
      boot. In order to avoid using more CONFIG_EVA ifdefs in the arch code,
      such functions will be added in this header instead.
      Initially this header contains a macro which will be used by generic
      code later on during VPEs configuration on secondary cores.
      All it does is to call the platform specific EVA init code in case
      EVA is enabled.
      Reviewed-by: default avatarPaul Burton <paul.burton@imgtec.com>
      Signed-off-by: default avatarMarkos Chandras <markos.chandras@imgtec.com>
      Patchwork: http://patchwork.linux-mips.org/patch/7422/Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      
      MIPS: scall64-o32: Fix indirect syscall detection
      
      Commit 4c21b8fd (MIPS: seccomp: Handle indirect system calls (o32))
      added indirect syscall detection for O32 processes running on MIPS64
      but it did not work as expected. The reason is the the scall64-o32
      implementation differs compared to scall32-o32. In the former, the v0
      (syscall number) register contains the absolute syscall number
      (4000 + X) whereas in the latter it contains the relative syscall
      number (X). Fix the code to avoid doing an extra addition, and load
      the v0 register directly to the first argument for syscall_trace_enter.
      Moreover, set the .reorder assembler option in order to have better
      control on this part of the assembly code.
      Signed-off-by: default avatarMarkos Chandras <markos.chandras@imgtec.com>
      Patchwork: http://patchwork.linux-mips.org/patch/7481/
      Cc: <stable@vger.kernel.org> # v3.15+
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      
      MIPS: syscall: Fix AUDIT value for O32 processes on MIPS64
      
      On MIPS64, O32 processes set both TIF_32BIT_ADDR and
      TIF_32BIT_REGS so the previous condition treated O32 applications
      as N32 when evaluating seccomp filters. Fix the condition to check
      both TIF_32BIT_{REGS, ADDR} for the N32 AUDIT flag.
      Signed-off-by: default avatarMarkos Chandras <markos.chandras@imgtec.com>
      Patchwork: http://patchwork.linux-mips.org/patch/7480/
      Cc: <stable@vger.kernel.org> # v3.15+
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      
      MIPS: Loongson: Fix COP2 usage for preemptible kernel
      
      In preemptible kernel, only TIF_USEDFPU flag is reliable to distinguish
      whether _init_fpu()/_restore_fp() is needed. Because the value of the
      CP0_Status.CU1 isn't changed during preemption.
      
      V2: Fix coding style.
      Signed-off-by: default avatarHuacai Chen <chenhc@lemote.com>
      Cc: John Crispin <john@phrozen.org>
      Cc: Steven J. Hill <Steven.Hill@imgtec.com>
      Cc: Aurelien Jarno <aurelien@aurel32.net>
      Cc: linux-mips@linux-mips.org
      Cc: Fuxin Zhang <zhangfx@lemote.com>
      Cc: Zhangjin Wu <wuzhangjin@gmail.com>
      Patchwork: https://patchwork.linux-mips.org/patch/7515/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      
      MIPS: NL: Fix nlm_xlp_defconfig build error
      
      The nlm_xlp_defconfig build fails with
      
      ./arch/mips/include/asm/mach-netlogic/topology.h:15:0:
      			error: "topology_core_id" redefined [-Werror]
      In file included from include/linux/smp.h:59:0,
      	[ ...]
                       from arch/mips/mm/dma-default.c:12:
      ./arch/mips/include/asm/smp.h:41:0:
      			note: this is the location of the previous definition
      
      and similar errors.
      
      This is caused by commit bda4584c ("MIPS: Support CPU topology files
      in sysfs") which adds the defines to arch/mips/include/asm/smp.h.
      
      Remove the defines from arch/mips/include/asm/mach-netlogic/topology.h
      as no longer necessary.
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Cc: Huacai Chen <chenhc@lemote.com>
      Cc: Andreas Herrmann <andreas.herrmann@caviumnetworks.com>
      Cc: linux-mips@linux-mips.org
      Cc: linux-kernel@vger.kernel.org
      Patchwork: https://patchwork.linux-mips.org/patch/7513/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      
      MIPS: Remove race window in page fault handling
      
      Multicore MIPSes without I/D hardware coherency suffered from a race
      condition in the page fault handler. The page table entry was
      published before any pending lazy D-cache flush was committed, hence
      it allowed execution of stale page cache data by other VPEs in the
      system.
      
      To make the cache handling safe we need to perform flushing already in
      the set_pte_at function. MIPSes without coherent I-caches can get a
      small increase in flushes due to the unavailability of the execute
      flag in set_pte_at.
      
      [ralf@linux-mips.org: outlining set_pte_at() saves a good k in a test
      build, so I moved its definition from pgtable.h to cache.c.]
      Signed-off-by: default avatarLars Persson <larper@axis.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/7511/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      
      MIPS: Malta: Improve system memory detection for '{e, }memsize' >= 2G
      
      Using kstrtol to parse the "{e,}memsize" variables was wrong because this
      parses signed long numbers. In case of '{e,}memsize' >= 2G, the top bit
      is set, resulting to -ERANGE errors and possibly random system memory
      boundaries. We fix this by replacing "kstrtol" with "kstrtoul".
      We also improve the code to check the kstrtoul return value and
      print a warning if an error was returned.
      Signed-off-by: default avatarMarkos Chandras <markos.chandras@imgtec.com>
      Cc: <stable@vger.kernel.org> # v3.15+
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/7543/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      
      MIPS: Alchemy: Fix db1200 PSC clock enablement
      
      Enable PSC0 (I2C/SPI) clock and leave PSC1 (Audio) alone.  This patch
      restores functionality to both Audio and I2C/SPI.
      Signed-off-by: default avatarManuel Lauss <manuel.lauss@gmail.com>
      Cc: Linux-MIPS <linux-mips@linux-mips.org>
      Patchwork: https://patchwork.linux-mips.org/patch/7544/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      
      MIPS: BCM47XX: Fix reboot problem on BCM4705/BCM4785
      
      This adds some code based on code from the Broadcom GPL tar to fix the
      reboot problems on BCM4705/BCM4785. I tried rebooting my device for ~10
      times and have never seen a problem. This reverts the changes in the
      previous commit and adds the real fix as suggested by Rafał.
      
      Setting bit 22 in Reg 22, sel 4 puts the BIU (Bus Interface Unit) into
      async mode.
      
      The previous commit was 316cad5c [MIPS:
      BCM47XX: make reboot more relaiable]
      Signed-off-by: default avatarHauke Mehrtens <hauke@hauke-m.de>
      Cc: jogo@openwrt.org
      Cc: zajec5@gmail.com
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/7545/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      
      MIPS: Remove duplicated include from numa.c
      Signed-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Cc: Huacai Chen <chenhc@lemote.com>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Cc: linux-mips@linux-mips.org
      Cc: linux-kernel@vger.kernel.org
      Patchwork: https://patchwork.linux-mips.org/patch/7537/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      
      MIPS: Add common plat_irq_dispatch declaration
      
      Add common declaration to get rid of following sparse warning: "symbol
      'plat_irq_dispatch' was not declared. Should it be static?"
      Signed-off-by: default avatarSergey Ryazanov <ryazanov.s.a@gmail.com>
      Cc: Linux MIPS <linux-mips@linux-mips.org>
      Patchwork: https://patchwork.linux-mips.org/patch/7539/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      
      MIPS: MSP71xx: remove unused plat_irq_dispatch() argument
      
      Remove unused argument to make the plat_irq_dispatch() function
      declaration similar to the realization of other platforms.
      Signed-off-by: default avatarSergey Ryazanov <ryazanov.s.a@gmail.com>
      Cc: Linux MIPS <linux-mips@linux-mips.org>
      Patchwork: https://patchwork.linux-mips.org/patch/7538/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      
      MIPS: GIC: Remove useless parens from GICBIS().
      Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      
      MIPS: perf: Mark pmu interupt IRQF_NO_THREAD
      
      In RT kernel, I ran into the following calltrace, so PMU interrupts cannot
      be threaded
      
      in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/0
      INFO: lockdep is turned off.
      Call Trace:
      [<ffffffff8088595c>] dump_stack+0x1c/0x50
      [<ffffffff801a958c>] __might_sleep+0x13c/0x148
      [<ffffffff80891c54>] rt_spin_lock+0x3c/0xb0
      [<ffffffff801ad29c>] __wake_up+0x3c/0x80
      [<ffffffff80243ba4>] perf_event_wakeup+0x8c/0xf8
      [<ffffffff80243c50>] perf_pending_event+0x40/0x78
      [<ffffffff8023d88c>] irq_work_run+0x74/0xc0
      [<ffffffff80152640>] mipsxx_pmu_handle_shared_irq+0x110/0x228
      [<ffffffff8015276c>] mipsxx_pmu_handle_irq+0x14/0x30
      [<ffffffff801ffda4>] handle_irq_event_percpu+0xbc/0x470
      [<ffffffff80204478>] handle_percpu_irq+0x98/0xc8
      [<ffffffff801ff284>] generic_handle_irq+0x4c/0x68
      [<ffffffff8089748c>] do_IRQ+0x2c/0x48
      [<ffffffff80105864>] plat_irq_dispatch+0x64/0xd0
      
      [ralf@linux-mips.org: I don't see why based on this register dump the
      handler should be marked IRQF_NO_THREAD - but the handler is manipulating
      per-CPU resources so we don't want it to be rescheduled to another CPU.]
      Signed-off-by: default avatarYang Wei <Wei.Yang@windriver.com>
      Cc: a.p.zijlstra@chello.nl
      Cc: paulus@samba.org
      Cc: mingo@redhat.com
      Cc: acme@kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/7506/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      
      MIPS: kdump: Set correct value to kexec_indirection_page variable
      
      Since there is not indirection page in crash type, so the vaule of the head
      field of kimage structure is not equal to the address of indirection page but
      IND_DONE. so we have to set kexec_indirection_page variable to the address of
      the head field of image structure.
      
      [ralf@linux-mips.org: Don't add pointless empty line, fix trailing
      whitespace damage.]
      Signed-off-by: default avatarYang Wei <Wei.Yang@windriver.com>
      Cc: linux-mips@linux-mips.org
      Cc: linux-kernel@vger.kernel.org
      Patchwork: https://patchwork.linux-mips.org/patch/7499/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      
      MIPS: OCTEON: make get_system_type() thread-safe
      
      get_system_type() is not thread-safe on OCTEON. It uses static data,
      also more dangerous issue is that it's calling cvmx_fuse_read_byte()
      every time without any synchronization. Currently it's possible to get
      processes stuck looping forever in kernel simply by launching multiple
      readers of /proc/cpuinfo:
      
      	(while true; do cat /proc/cpuinfo > /dev/null; done) &
      	(while true; do cat /proc/cpuinfo > /dev/null; done) &
      	...
      
      Fix by initializing the system type string only once during the early
      boot.
      Signed-off-by: default avatarAaro Koskinen <aaro.koskinen@nsn.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarMarkos Chandras <markos.chandras@imgtec.com>
      Patchwork: http://patchwork.linux-mips.org/patch/7437/Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      
      (cherry picked from commit 33d9a530
      60830868)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      43ed8029
    • Jarkko Sakkinen's avatar
      tpm: missing tpm_chip_put in tpm_get_random() · 2273ecb5
      Jarkko Sakkinen authored
      Regression in 41ab999c. Call to tpm_chip_put is missing. This
      will cause TPM device driver not to unload if tmp_get_random()
      is called.
      
      Cc: <stable@vger.kernel.org> # 3.7+
      Signed-off-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Signed-off-by: default avatarPeter Huewe <peterhuewe@gmx.de>
      
      (cherry picked from commit 3e14d83e)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      2273ecb5
    • Paolo Bonzini's avatar
      Revert "KVM: x86: Increase the number of fixed MTRR regs to 10" · 19249df2
      Paolo Bonzini authored
      This reverts commit 682367c4,
      which causes 32-bit SMP Windows 7 guests to panic.
      
      SeaBIOS has a limit on the number of MTRRs that it can handle,
      and this patch exceeded the limit.  Better revert it.
      Thanks to Nadav Amit for debugging the cause.
      
      Cc: stable@nongnu.org
      Reported-by: default avatarWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      
      (cherry picked from commit 0d234daf)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      19249df2
    • H. Peter Anvin's avatar
      x86, espfix: Make espfix64 a Kconfig option, fix UML · 60b42911
      H. Peter Anvin authored
      Make espfix64 a hidden Kconfig option.  This fixes the x86-64 UML
      build which had broken due to the non-existence of init_espfix_bsp()
      in UML: since UML uses its own Kconfig, this option does not appear in
      the UML build.
      
      This also makes it possible to make support for 16-bit segments a
      configuration option, for the people who want to minimize the size of
      the kernel.
      Reported-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Cc: Richard Weinberger <richard@nod.at>
      Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
      
      (cherry picked from commit 197725de)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      60b42911
    • Bart Van Assche's avatar
      IB/srp: Fix deadlock between host removal and multipathd · bcb9a1f7
      Bart Van Assche authored
      If scsi_remove_host() is invoked after a SCSI device has been blocked,
      if the fast_io_fail_tmo or dev_loss_tmo work gets scheduled on the
      workqueue executing srp_remove_work() and if an I/O request is
      scheduled after the SCSI device had been blocked by e.g. multipathd
      then the following deadlock can occur:
      
          kworker/6:1     D ffff880831f3c460     0   195      2 0x00000000
          Call Trace:
           [<ffffffff814aafd9>] schedule+0x29/0x70
           [<ffffffff814aa0ef>] schedule_timeout+0x10f/0x2a0
           [<ffffffff8105af6f>] msleep+0x2f/0x40
           [<ffffffff8123b0ae>] __blk_drain_queue+0x4e/0x180
           [<ffffffff8123d2d5>] blk_cleanup_queue+0x225/0x230
           [<ffffffffa0010732>] __scsi_remove_device+0x62/0xe0 [scsi_mod]
           [<ffffffffa000ed2f>] scsi_forget_host+0x6f/0x80 [scsi_mod]
           [<ffffffffa0002eba>] scsi_remove_host+0x7a/0x130 [scsi_mod]
           [<ffffffffa07cf5c5>] srp_remove_work+0x95/0x180 [ib_srp]
           [<ffffffff8106d7aa>] process_one_work+0x1ea/0x6c0
           [<ffffffff8106dd9b>] worker_thread+0x11b/0x3a0
           [<ffffffff810758bd>] kthread+0xed/0x110
           [<ffffffff814b972c>] ret_from_fork+0x7c/0xb0
          multipathd      D ffff880096acc460     0  5340      1 0x00000000
          Call Trace:
           [<ffffffff814aafd9>] schedule+0x29/0x70
           [<ffffffff814aa0ef>] schedule_timeout+0x10f/0x2a0
           [<ffffffff814ab79b>] io_schedule_timeout+0x9b/0xf0
           [<ffffffff814abe1c>] wait_for_completion_io_timeout+0xdc/0x110
           [<ffffffff81244b9b>] blk_execute_rq+0x9b/0x100
           [<ffffffff8124f665>] sg_io+0x1a5/0x450
           [<ffffffff8124fd21>] scsi_cmd_ioctl+0x2a1/0x430
           [<ffffffff8124fef2>] scsi_cmd_blk_ioctl+0x42/0x50
           [<ffffffffa00ec97e>] sd_ioctl+0xbe/0x140 [sd_mod]
           [<ffffffff8124bd04>] blkdev_ioctl+0x234/0x840
           [<ffffffff811cb491>] block_ioctl+0x41/0x50
           [<ffffffff811a0df0>] do_vfs_ioctl+0x300/0x520
           [<ffffffff811a1051>] SyS_ioctl+0x41/0x80
           [<ffffffff814b9962>] tracesys+0xd0/0xd5
      
      Fix this by scheduling removal work on another workqueue than the
      transport layer timers.
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Reviewed-by: default avatarDavid Dillow <dave@thedillows.org>
      Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      
      (cherry picked from commit bcc05910)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      bcb9a1f7
    • Jeff Moyer's avatar
      dm table: propagate QUEUE_FLAG_NO_SG_MERGE · d4ac51ae
      Jeff Moyer authored
      Commit 05f1dd53 ("block: add queue flag for disabling SG merging")
      introduced a new queue flag: QUEUE_FLAG_NO_SG_MERGE.  This gets set by
      default in blk_mq_init_queue for mq-enabled devices.  The effect of
      the flag is to bypass the SG segment merging.  Instead, the
      bio->bi_vcnt is used as the number of hardware segments.
      
      With a device mapper target on top of a device with
      QUEUE_FLAG_NO_SG_MERGE set, we can end up sending down more segments
      than a driver is prepared to handle.  I ran into this when backporting
      the virtio_blk mq support.  It triggerred this BUG_ON, in
      virtio_queue_rq:
      
              BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
      
      The queue's max is set here:
              blk_queue_max_segments(q, vblk->sg_elems-2);
      
      Basically, what happens is that a bio is built up for the dm device
      (which does not have the QUEUE_FLAG_NO_SG_MERGE flag set) using
      bio_add_page.  That path will call into __blk_recalc_rq_segments, so
      what you end up with is bi_phys_segments being much smaller than bi_vcnt
      (and bi_vcnt grows beyond the maximum sg elements).  Then, when the bio
      is submitted, it gets cloned.  When the cloned bio is submitted, it will
      end up in blk_recount_segments, here:
      
              if (test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags))
                      bio->bi_phys_segments = bio->bi_vcnt;
      
      and now we've set bio->bi_phys_segments to a number that is beyond what
      was registered as queue_max_segments by the driver.
      
      The right way to fix this is to propagate the queue flag up the stack.
      
      The rules for propagating the flag are simple:
      - if the flag is set for any underlying device, it must be set for the
        upper device
      - consequently, if the flag is not set for any underlying device, it
        should not be set for the upper device.
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 3.16+
      
      (cherry picked from commit 200612ec)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      d4ac51ae
    • Roger Quadros's avatar
      mtd: nand: omap: Fix 1-bit Hamming code scheme, omap_calculate_ecc() · ebdeb9a7
      Roger Quadros authored
      commit 65b97cf6 introduced in v3.7 caused a regression
      by using a reversed CS_MASK thus causing omap_calculate_ecc to
      always fail. As the NAND base driver never checks for .calculate()'s
      return value, the zeroed ECC values are used as is without showing
      any error to the user. However, this won't work and the NAND device
      won't be guarded by any error code.
      
      Fix the issue by using the correct mask.
      
      Code was tested on omap3beagle using the following procedure
      - flash the primary bootloader (MLO) from the kernel to the first
      NAND partition using nandwrite.
      - boot the board from NAND. This utilizes OMAP ROM loader that
      relies on 1-bit Hamming code ECC.
      
      Fixes: 65b97cf6 (mtd: nand: omap2: handle nand on gpmc)
      
      Cc: <stable@vger.kernel.org>	[3.7+]
      Signed-off-by: default avatarRoger Quadros <rogerq@ti.com>
      Signed-off-by: default avatarTony Lindgren <tony@atomide.com>
      
      (cherry picked from commit 40ddbf50)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      ebdeb9a7
    • Kevin Hao's avatar
      mtd/ftl: fix the double free of the buffers allocated in build_maps() · cad68b9b
      Kevin Hao authored
      I got the following panic on my fsl p5020ds board.
      
        Unable to handle kernel paging request for data at address 0x7375627379737465
        Faulting instruction address: 0xc000000000100778
        Oops: Kernel access of bad area, sig: 11 [#1]
        SMP NR_CPUS=24 CoreNet Generic
        Modules linked in:
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.15.0-next-20140613 #145
        task: c0000000fe080000 ti: c0000000fe088000 task.ti: c0000000fe088000
        NIP: c000000000100778 LR: c00000000010073c CTR: 0000000000000000
        REGS: c0000000fe08aa00 TRAP: 0300   Not tainted  (3.15.0-next-20140613)
        MSR: 0000000080029000 <CE,EE,ME>  CR: 24ad2e24  XER: 00000000
        DEAR: 7375627379737465 ESR: 0000000000000000 SOFTE: 1
        GPR00: c0000000000c99b0 c0000000fe08ac80 c0000000009598e0 c0000000fe001d80
        GPR04: 00000000000000d0 0000000000000913 c000000007902b20 0000000000000000
        GPR08: c0000000feaae888 0000000000000000 0000000007091000 0000000000200200
        GPR12: 0000000028ad2e28 c00000000fff4000 c0000000007abe08 0000000000000000
        GPR16: c0000000007ab160 c0000000007aaf98 c00000000060ba68 c0000000007abda8
        GPR20: c0000000007abde8 c0000000feaea6f8 c0000000feaea708 c0000000007abd10
        GPR24: c000000000989370 c0000000008c6228 00000000000041ed c0000000fe00a400
        GPR28: c00000000017c1cc 00000000000000d0 7375627379737465 c0000000fe001d80
        NIP [c000000000100778] .__kmalloc_track_caller+0x70/0x168
        LR [c00000000010073c] .__kmalloc_track_caller+0x34/0x168
        Call Trace:
        [c0000000fe08ac80] [c00000000087e6b8] uevent_sock_list+0x0/0x10 (unreliable)
        [c0000000fe08ad20] [c0000000000c99b0] .kstrdup+0x44/0x90
        [c0000000fe08adc0] [c00000000017c1cc] .__kernfs_new_node+0x4c/0x130
        [c0000000fe08ae70] [c00000000017d7e4] .kernfs_new_node+0x2c/0x64
        [c0000000fe08aef0] [c00000000017db00] .kernfs_create_dir_ns+0x34/0xc8
        [c0000000fe08af80] [c00000000018067c] .sysfs_create_dir_ns+0x58/0xcc
        [c0000000fe08b010] [c0000000002c711c] .kobject_add_internal+0xc8/0x384
        [c0000000fe08b0b0] [c0000000002c7644] .kobject_add+0x64/0xc8
        [c0000000fe08b140] [c000000000355ebc] .device_add+0x11c/0x654
        [c0000000fe08b200] [c0000000002b5988] .add_disk+0x20c/0x4b4
        [c0000000fe08b2c0] [c0000000003a21d4] .add_mtd_blktrans_dev+0x340/0x514
        [c0000000fe08b350] [c0000000003a3410] .mtdblock_add_mtd+0x74/0xb4
        [c0000000fe08b3e0] [c0000000003a32cc] .blktrans_notify_add+0x64/0x94
        [c0000000fe08b470] [c00000000039b5b4] .add_mtd_device+0x1d4/0x368
        [c0000000fe08b520] [c00000000039b830] .mtd_device_parse_register+0xe8/0x104
        [c0000000fe08b5c0] [c0000000003b8408] .of_flash_probe+0x72c/0x734
        [c0000000fe08b750] [c00000000035ba40] .platform_drv_probe+0x38/0x84
        [c0000000fe08b7d0] [c0000000003599a4] .really_probe+0xa4/0x29c
        [c0000000fe08b870] [c000000000359d3c] .__driver_attach+0x100/0x104
        [c0000000fe08b900] [c00000000035746c] .bus_for_each_dev+0x84/0xe4
        [c0000000fe08b9a0] [c0000000003593c0] .driver_attach+0x24/0x38
        [c0000000fe08ba10] [c000000000358f24] .bus_add_driver+0x1c8/0x2ac
        [c0000000fe08bab0] [c00000000035a3a4] .driver_register+0x8c/0x158
        [c0000000fe08bb30] [c00000000035b9f4] .__platform_driver_register+0x6c/0x80
        [c0000000fe08bba0] [c00000000084e080] .of_flash_driver_init+0x1c/0x30
        [c0000000fe08bc10] [c000000000001864] .do_one_initcall+0xbc/0x238
        [c0000000fe08bd00] [c00000000082cdc0] .kernel_init_freeable+0x188/0x268
        [c0000000fe08bdb0] [c0000000000020a0] .kernel_init+0x1c/0xf7c
        [c0000000fe08be30] [c000000000000884] .ret_from_kernel_thread+0x58/0xd4
        Instruction dump:
        41bd0010 480000c8 4bf04eb5 60000000 e94d0028 e93f0000 7cc95214 e8a60008
        7fc9502a 2fbe0000 419e00c8 e93f0022 <7f7e482a> 39200000 88ed06b2 992d06b2
        ---[ end trace b4c9a94804a42d40 ]---
      
      It seems that the corrupted partition header on my mtd device triggers
      a bug in the ftl. In function build_maps() it will allocate the buffers
      needed by the mtd partition, but if something goes wrong such as kmalloc
      failure, mtd read error or invalid partition header parameter, it will
      free all allocated buffers and then return non-zero. In my case, it
      seems that partition header parameter 'NumTransferUnits' is invalid.
      
      And the ftl_freepart() is a function which free all the partition
      buffers allocated by build_maps(). Given the build_maps() is a self
      cleaning function, so there is no need to invoke this function even
      if build_maps() return with error. Otherwise it will causes the
      buffers to be freed twice and then weird things would happen.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarKevin Hao <haokexin@gmail.com>
      Signed-off-by: default avatarBrian Norris <computersforpeace@gmail.com>
      
      (cherry picked from commit a152056c)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      cad68b9b
    • Pavel Shilovsky's avatar
      CIFS: Fix wrong restart readdir for SMB1 · 7a6868e4
      Pavel Shilovsky authored
      The existing code calls server->ops->close() that is not
      right. This causes XFS test generic/310 to fail. Fix this
      by using server->ops->closedir() function.
      
      Cc: <stable@vger.kernel.org> # v3.7+
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarPavel Shilovsky <pshilovsky@samba.org>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      
      (cherry picked from commit f736906a)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      7a6868e4
    • Pavel Shilovsky's avatar
      CIFS: Fix wrong filename length for SMB2 · f62540d9
      Pavel Shilovsky authored
      The existing code uses the old MAX_NAME constant. This causes
      XFS test generic/013 to fail. Fix it by replacing MAX_NAME with
      PATH_MAX that SMB1 uses. Also remove an unused MAX_NAME constant
      definition.
      
      Cc: <stable@vger.kernel.org> # v3.7+
      Signed-off-by: default avatarPavel Shilovsky <pshilovsky@samba.org>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      
      (cherry picked from commit 1bbe4997)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      f62540d9
    • Pavel Shilovsky's avatar
      CIFS: Fix directory rename error · cc49c032
      Pavel Shilovsky authored
      CIFS servers process nlink counts differently for files and directories.
      In cifs_rename() if we the request fails on the existing target, we
      try to remove it through cifs_unlink() but this is not what we want
      to do for directories. As the result the following sequence of commands
      
      mkdir {1,2}; mv -T 1 2; rmdir {1,2}; mkdir {1,2}; echo foo > 2/bar
      
      and XFS test generic/023 fail with -ENOENT error. That's why the second
      mkdir reuses the existing inode (target inode of the mv -T command) with
      S_DEAD flag.
      
      Fix this by checking whether the target is directory or not and
      calling cifs_rmdir() rather than cifs_unlink() for directories.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarPavel Shilovsky <pshilovsky@samba.org>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      
      (cherry picked from commit a07d3220)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      cc49c032
    • Pavel Shilovsky's avatar
      CIFS: Fix wrong directory attributes after rename · 4f071738
      Pavel Shilovsky authored
      When we requests rename we also need to update attributes
      of both source and target parent directories. Not doing it
      causes generic/309 xfstest to fail on SMB2 mounts. Fix this
      by marking these directories for force revalidating.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarPavel Shilovsky <pshilovsky@samba.org>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      
      (cherry picked from commit b46799a8)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      4f071738
    • Pavel Shilovsky's avatar
      CIFS: Fix async reading on reconnects · 90413455
      Pavel Shilovsky authored
      If we get into read_into_pages() from cifs_readv_receive() and then
      loose a network, we issue cifs_reconnect that moves all mids to
      a private list and issue their callbacks. The callback of the async
      read request sets a mid to retry, frees it and wakes up a process
      that waits on the rdata completion.
      
      After the connection is established we return from read_into_pages()
      with a short read, use the mid that was freed before and try to read
      the remaining data from the a newly created socket. Both actions are
      not what we want to do. In reconnect cases (-EAGAIN) we should not
      mask off the error with a short read but should return the error
      code instead.
      Acked-by: default avatarJeff Layton <jlayton@samba.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPavel Shilovsky <pshilovsky@samba.org>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      
      (cherry picked from commit 038bc961)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      90413455
    • Pavel Shilovsky's avatar
      CIFS: Fix STATUS_CANNOT_DELETE error mapping for SMB2 · e3af5d83
      Pavel Shilovsky authored
      The existing mapping causes unlink() call to return error after delete
      operation. Changing the mapping to -EACCES makes the client process
      the call like CIFS protocol does - reset dos attributes with ATTR_READONLY
      flag masked off and retry the operation.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPavel Shilovsky <pshilovsky@samba.org>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      
      (cherry picked from commit 21496687)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      e3af5d83
    • Ilya Dryomov's avatar
      libceph: do not hard code max auth ticket len · 0518223c
      Ilya Dryomov authored
      We hard code cephx auth ticket buffer size to 256 bytes.  This isn't
      enough for any moderate setups and, in case tickets themselves are not
      encrypted, leads to buffer overflows (ceph_x_decrypt() errors out, but
      ceph_decode_copy() doesn't - it's just a memcpy() wrapper).  Since the
      buffer is allocated dynamically anyway, allocated it a bit later, at
      the point where we know how much is going to be needed.
      
      Fixes: http://tracker.ceph.com/issues/8979
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: default avatarSage Weil <sage@redhat.com>
      
      (cherry picked from commit c27a3e4d)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      0518223c
    • Ilya Dryomov's avatar
      libceph: add process_one_ticket() helper · 5c06b980
      Ilya Dryomov authored
      Add a helper for processing individual cephx auth tickets.  Needed for
      the next commit, which deals with allocating ticket buffers.  (Most of
      the diff here is whitespace - view with git diff -b).
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: default avatarSage Weil <sage@redhat.com>
      
      (cherry picked from commit 597cda35)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      5c06b980
    • Sage Weil's avatar
      libceph: gracefully handle large reply messages from the mon · 19efa81f
      Sage Weil authored
      We preallocate a few of the message types we get back from the mon.  If we
      get a larger message than we are expecting, fall back to trying to allocate
      a new one instead of blindly using the one we have.
      
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarSage Weil <sage@redhat.com>
      Reviewed-by: default avatarIlya Dryomov <ilya.dryomov@inktank.com>
      
      (cherry picked from commit 73c3d481)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      19efa81f
    • Chris Mason's avatar
      xfs: don't zero partial page cache pages during O_DIRECT writes · 925f62c7
      Chris Mason authored
      Similar to direct IO reads, direct IO writes are using
      truncate_pagecache_range to invalidate the page cache. This is
      incorrect due to the sub-block zeroing in the page cache that
      truncate_pagecache_range() triggers.
      
      This patch fixes things by using invalidate_inode_pages2_range
      instead.  It preserves the page cache invalidation, but won't zero
      any pages.
      
      cc: stable@vger.kernel.org
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      xfs: don't zero partial page cache pages during O_DIRECT writes
      
      xfs is using truncate_pagecache_range to invalidate the page cache
      during DIO reads.  This is different from the other filesystems who
      only invalidate pages during DIO writes.
      
      truncate_pagecache_range is meant to be used when we are freeing the
      underlying data structs from disk, so it will zero any partial
      ranges in the page.  This means a DIO read can zero out part of the
      page cache page, and it is possible the page will stay in cache.
      
      buffered reads will find an up to date page with zeros instead of
      the data actually on disk.
      
      This patch fixes things by using invalidate_inode_pages2_range
      instead.  It preserves the page cache invalidation, but won't zero
      any pages.
      
      [dchinner: catch error and warn if it fails. Comment.]
      
      cc: stable@vger.kernel.org
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      (cherry picked from commit 834ffca6
      85e584da)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      925f62c7
    • Dave Chinner's avatar
      xfs: don't zero partial page cache pages during O_DIRECT writes · fef33816
      Dave Chinner authored
      Similar to direct IO reads, direct IO writes are using
      truncate_pagecache_range to invalidate the page cache. This is
      incorrect due to the sub-block zeroing in the page cache that
      truncate_pagecache_range() triggers.
      
      This patch fixes things by using invalidate_inode_pages2_range
      instead.  It preserves the page cache invalidation, but won't zero
      any pages.
      
      cc: stable@vger.kernel.org
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      xfs: don't zero partial page cache pages during O_DIRECT writes
      
      xfs is using truncate_pagecache_range to invalidate the page cache
      during DIO reads.  This is different from the other filesystems who
      only invalidate pages during DIO writes.
      
      truncate_pagecache_range is meant to be used when we are freeing the
      underlying data structs from disk, so it will zero any partial
      ranges in the page.  This means a DIO read can zero out part of the
      page cache page, and it is possible the page will stay in cache.
      
      buffered reads will find an up to date page with zeros instead of
      the data actually on disk.
      
      This patch fixes things by using invalidate_inode_pages2_range
      instead.  It preserves the page cache invalidation, but won't zero
      any pages.
      
      [dchinner: catch error and warn if it fails. Comment.]
      
      cc: stable@vger.kernel.org
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      (cherry picked from commit 834ffca6
      85e584da)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      fef33816
    • Dave Chinner's avatar
      xfs: don't dirty buffers beyond EOF · e52c37f8
      Dave Chinner authored
      generic/263 is failing fsx at this point with a page spanning
      EOF that cannot be invalidated. The operations are:
      
      1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
      1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
      1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)
      
      where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
      write attempts to invalidate the cached page over this range, it
      fails with -EBUSY and so any attempt to do page invalidation fails.
      
      The real question is this: Why can't that page be invalidated after
      it has been written to disk and cleaned?
      
      Well, there's data on the first two buffers in the page (1k block
      size, 4k page), but the third buffer on the page (i.e. beyond EOF)
      is failing drop_buffers because it's bh->b_state == 0x3, which is
      BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
      what?
      
      OK, set_buffer_dirty() is called on all buffers from
      __set_page_buffers_dirty(), regardless of whether the buffer is
      beyond EOF or not, which means that when we get to ->writepage,
      we have buffers marked dirty beyond EOF that we need to clean.
      So, we need to implement our own .set_page_dirty method that
      doesn't dirty buffers beyond EOF.
      
      This is messy because the buffer code is not meant to be shared
      and it has interesting locking issues on the buffer dirty bits.
      So just copy and paste it and then modify it to suit what we need.
      
      Note: the solutions the other filesystems and generic block code use
      of marking the buffers clean in ->writepage does not work for XFS.
      It still leaves dirty buffers beyond EOF and invalidations still
      fail. Hence rather than play whack-a-mole, this patch simply
      prevents those buffers from being dirtied in the first place.
      
      cc: <stable@kernel.org>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      (cherry picked from commit 22e757a4)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      e52c37f8
    • Dave Chinner's avatar
      xfs: quotacheck leaves dquot buffers without verifiers · a1d44560
      Dave Chinner authored
      When running xfs/305, I noticed that quotacheck was flushing dquot
      buffers that did not have the xfs_dquot_buf_ops verifiers attached:
      
      XFS (vdb): _xfs_buf_ioapply: no ops on block 0x1dc8/0x1dc8
      ffff880052489000: 44 51 01 04 00 00 65 b8 00 00 00 00 00 00 00 00  DQ....e.........
      ffff880052489010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      ffff880052489020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      ffff880052489030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      CPU: 1 PID: 2376 Comm: mount Not tainted 3.16.0-rc2-dgc+ #306
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
       ffff88006fe38000 ffff88004a0ffae8 ffffffff81cf1cca 0000000000000001
       ffff88004a0ffb88 ffffffff814d50ca 000010004a0ffc70 0000000000000000
       ffff88006be56dc4 0000000000000021 0000000000001dc8 ffff88007c773d80
      Call Trace:
       [<ffffffff81cf1cca>] dump_stack+0x45/0x56
       [<ffffffff814d50ca>] _xfs_buf_ioapply+0x3ca/0x3d0
       [<ffffffff810db520>] ? wake_up_state+0x20/0x20
       [<ffffffff814d51f5>] ? xfs_bdstrat_cb+0x55/0xb0
       [<ffffffff814d513b>] xfs_buf_iorequest+0x6b/0xd0
       [<ffffffff814d51f5>] xfs_bdstrat_cb+0x55/0xb0
       [<ffffffff814d53ab>] __xfs_buf_delwri_submit+0x15b/0x220
       [<ffffffff814d6040>] ? xfs_buf_delwri_submit+0x30/0x90
       [<ffffffff814d6040>] xfs_buf_delwri_submit+0x30/0x90
       [<ffffffff8150f89d>] xfs_qm_quotacheck+0x17d/0x3c0
       [<ffffffff81510591>] xfs_qm_mount_quotas+0x151/0x1e0
       [<ffffffff814ed01c>] xfs_mountfs+0x56c/0x7d0
       [<ffffffff814f0f12>] xfs_fs_fill_super+0x2c2/0x340
       [<ffffffff811c9fe4>] mount_bdev+0x194/0x1d0
       [<ffffffff814f0c50>] ? xfs_finish_flags+0x170/0x170
       [<ffffffff814ef0f5>] xfs_fs_mount+0x15/0x20
       [<ffffffff811ca8c9>] mount_fs+0x39/0x1b0
       [<ffffffff811e4d67>] vfs_kern_mount+0x67/0x120
       [<ffffffff811e757e>] do_mount+0x23e/0xad0
       [<ffffffff8117abde>] ? __get_free_pages+0xe/0x50
       [<ffffffff811e71e6>] ? copy_mount_options+0x36/0x150
       [<ffffffff811e8103>] SyS_mount+0x83/0xc0
       [<ffffffff81cfd40b>] tracesys+0xdd/0xe2
      
      This was caused by dquot buffer readahead not attaching a verifier
      structure to the buffer when readahead was issued, resulting in the
      followup read of the buffer finding a valid buffer and so not
      attaching new verifiers to the buffer as part of the read.
      
      Also, when a verifier failure occurs, we then read the buffer
      without verifiers. Attach the verifiers manually after this read so
      that if the buffer is then written it will be verified that the
      corruption has been repaired.
      
      Further, when flushing a dquot we don't ask for a verifier when
      reading in the dquot buffer the dquot belongs to. Most of the time
      this isn't an issue because the buffer is still cached, but when it
      is not cached it will result in writing the dquot buffer without
      having the verfier attached.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      (cherry picked from commit 5fd364fe)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      a1d44560
    • Doug Ledford's avatar
      RDMA/uapi: Include socket.h in rdma_user_cm.h · 22393801
      Doug Ledford authored
      added struct sockaddr_storage to rdma_user_cm.h without also adding an
      include for linux/socket.h to make sure it is defined.  Systemtap
      needs the header files to build standalone and cannot rely on other
      files to pre-include other headers, so add linux/socket.h to the list
      of includes in this file.
      
      Fixes: ee7aed45 ("RDMA/ucma: Support querying for AF_IB addresses")
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      
      (cherry picked from commit db1044d4)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      22393801
    • Steve Wise's avatar
      RDMA/iwcm: Use a default listen backlog if needed · 0bdc342c
      Steve Wise authored
      If the user creates a listening cm_id with backlog of 0 the IWCM ends
      up not allowing any connection requests at all.  The correct behavior
      is for the IWCM to pick a default value if the user backlog parameter
      is zero.
      
      Lustre from version 1.8.8 onward uses a backlog of 0, which breaks
      iwarp support without this fix.
      Signed-off-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      
      (cherry picked from commit 2f0304d2)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      0bdc342c
    • NeilBrown's avatar
      md/raid10: Fix memory leak when raid10 reshape completes. · 21b2d992
      NeilBrown authored
      When a raid10 commences a resync/recovery/reshape it allocates
      some buffer space.
      When a resync/recovery completes the buffer space is freed.  But not
      when the reshape completes.
      This can result in a small memory leak.
      
      There is a subtle side-effect of this bug.  When a RAID10 is reshaped
      to a larger array (more devices), the reshape is immediately followed
      by a "resync" of the new space.  This "resync" will use the buffer
      space which was allocated for "reshape".  This can cause problems
      including a "BUG" in the SCSI layer.  So this is suitable for -stable.
      
      Cc: stable@vger.kernel.org (v3.5+)
      Fixes: 3ea7daa5Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      
      (cherry picked from commit b3968552)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      21b2d992
    • NeilBrown's avatar
      md/raid10: fix memory leak when reshaping a RAID10. · 670acfbc
      NeilBrown authored
      raid10 reshape clears unwanted bits from a bio->bi_flags using
      a method which, while clumsy, worked until 3.10 when BIO_OWNS_VEC
      was added.
      Since then it clears that bit but shouldn't.  This results in a
      memory leak.
      
      So change to used the approved method of clearing unwanted bits.
      
      As this causes a memory leak which can consume all of memory
      the fix is suitable for -stable.
      
      Fixes: a38352e0
      Cc: stable@vger.kernel.org (v3.10+)
      Reported-by: mdraid.pkoch@dfgh.net (Peter Koch)
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      
      (cherry picked from commit ce0b0a46)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      670acfbc
    • NeilBrown's avatar
      md/raid6: avoid data corruption during recovery of double-degraded RAID6 · d21eb3fb
      NeilBrown authored
      During recovery of a double-degraded RAID6 it is possible for
      some blocks not to be recovered properly, leading to corruption.
      
      If a write happens to one block in a stripe that would be written to a
      missing device, and at the same time that stripe is recovering data
      to the other missing device, then that recovered data may not be written.
      
      This patch skips, in the double-degraded case, an optimisation that is
      only safe for single-degraded arrays.
      
      Bug was introduced in 2.6.32 and fix is suitable for any kernel since
      then.  In an older kernel with separate handle_stripe5() and
      handle_stripe6() functions the patch must change handle_stripe6().
      
      Cc: stable@vger.kernel.org (2.6.32+)
      Fixes: 6c0069c0
      Cc: Yuri Tikhonov <yur@emcraft.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Reported-by: default avatar"Manibalan P" <pmanibalan@amiindia.co.in>
      Tested-by: default avatar"Manibalan P" <pmanibalan@amiindia.co.in>
      Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      
      (cherry picked from commit 9c4bdf69)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      d21eb3fb
    • Vladimir Davydov's avatar
      Bluetooth: never linger on process exit · a4c06307
      Vladimir Davydov authored
      If the current process is exiting, lingering on socket close will make
      it unkillable, so we should avoid it.
      
      Reproducer:
      
        #include <sys/types.h>
        #include <sys/socket.h>
      
        #define BTPROTO_L2CAP   0
        #define BTPROTO_SCO     2
        #define BTPROTO_RFCOMM  3
      
        int main()
        {
                int fd;
                struct linger ling;
      
                fd = socket(PF_BLUETOOTH, SOCK_STREAM, BTPROTO_RFCOMM);
                //or: fd = socket(PF_BLUETOOTH, SOCK_DGRAM, BTPROTO_L2CAP);
                //or: fd = socket(PF_BLUETOOTH, SOCK_SEQPACKET, BTPROTO_SCO);
      
                ling.l_onoff = 1;
                ling.l_linger = 1000000000;
                setsockopt(fd, SOL_SOCKET, SO_LINGER, &ling, sizeof(ling));
      
                return 0;
        }
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      Cc: stable@vger.kernel.org
      
      (cherry picked from commit 093facf3)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      a4c06307
    • Eric W. Biederman's avatar
      mnt: Change the default remount atime from relatime to the existing value · 3d55f426
      Eric W. Biederman authored
      Since March 2009 the kernel has treated the state that if no
      MS_..ATIME flags are passed then the kernel defaults to relatime.
      
      Defaulting to relatime instead of the existing atime state during a
      remount is silly, and causes problems in practice for people who don't
      specify any MS_...ATIME flags and to get the default filesystem atime
      setting.  Those users may encounter a permission error because the
      default atime setting does not work.
      
      A default that does not work and causes permission problems is
      ridiculous, so preserve the existing value to have a default
      atime setting that is always guaranteed to work.
      
      Using the default atime setting in this way is particularly
      interesting for applications built to run in restricted userspace
      environments without /proc mounted, as the existing atime mount
      options of a filesystem can not be read from /proc/mounts.
      
      In practice this fixes user space that uses the default atime
      setting on remount that are broken by the permission checks
      keeping less privileged users from changing more privileged users
      atime settings.
      
      Cc: stable@vger.kernel.org
      Acked-by: default avatarSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      
      (cherry picked from commit ffbc6f0e)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      3d55f426
    • Eric W. Biederman's avatar
      mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount · f6ac37e1
      Eric W. Biederman authored
      There are no races as locked mount flags are guaranteed to never change.
      
      Moving the test into do_remount makes it more visible, and ensures all
      filesystem remounts pass the MNT_LOCK_READONLY permission check.  This
      second case is not an issue today as filesystem remounts are guarded
      by capable(CAP_DAC_ADMIN) and thus will always fail in less privileged
      mount namespaces, but it could become an issue in the future.
      
      Cc: stable@vger.kernel.org
      Acked-by: default avatarSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      
      (cherry picked from commit 07b64558)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      f6ac37e1
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Up rb_iter_peek() loop count to 3 · 0c1377c6
      Steven Rostedt (Red Hat) authored
      After writting a test to try to trigger the bug that caused the
      ring buffer iterator to become corrupted, I hit another bug:
      
       WARNING: CPU: 1 PID: 5281 at kernel/trace/ring_buffer.c:3766 rb_iter_peek+0x113/0x238()
       Modules linked in: ipt_MASQUERADE sunrpc [...]
       CPU: 1 PID: 5281 Comm: grep Tainted: G        W     3.16.0-rc3-test+ #143
       Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
        0000000000000000 ffffffff81809a80 ffffffff81503fb0 0000000000000000
        ffffffff81040ca1 ffff8800796d6010 ffffffff810c138d ffff8800796d6010
        ffff880077438c80 ffff8800796d6010 ffff88007abbe600 0000000000000003
       Call Trace:
        [<ffffffff81503fb0>] ? dump_stack+0x4a/0x75
        [<ffffffff81040ca1>] ? warn_slowpath_common+0x7e/0x97
        [<ffffffff810c138d>] ? rb_iter_peek+0x113/0x238
        [<ffffffff810c138d>] ? rb_iter_peek+0x113/0x238
        [<ffffffff810c14df>] ? ring_buffer_iter_peek+0x2d/0x5c
        [<ffffffff810c6f73>] ? tracing_iter_reset+0x6e/0x96
        [<ffffffff810c74a3>] ? s_start+0xd7/0x17b
        [<ffffffff8112b13e>] ? kmem_cache_alloc_trace+0xda/0xea
        [<ffffffff8114cf94>] ? seq_read+0x148/0x361
        [<ffffffff81132d98>] ? vfs_read+0x93/0xf1
        [<ffffffff81132f1b>] ? SyS_read+0x60/0x8e
        [<ffffffff8150bf9f>] ? tracesys+0xdd/0xe2
      
      Debugging this bug, which triggers when the rb_iter_peek() loops too
      many times (more than 2 times), I discovered there's a case that can
      cause that function to legitimately loop 3 times!
      
      rb_iter_peek() is different than rb_buffer_peek() as the rb_buffer_peek()
      only deals with the reader page (it's for consuming reads). The
      rb_iter_peek() is for traversing the buffer without consuming it, and as
      such, it can loop for one more reason. That is, if we hit the end of
      the reader page or any page, it will go to the next page and try again.
      
      That is, we have this:
      
       1. iter->head > iter->head_page->page->commit
          (rb_inc_iter() which moves the iter to the next page)
          try again
      
       2. event = rb_iter_head_event()
          event->type_len == RINGBUF_TYPE_TIME_EXTEND
          rb_advance_iter()
          try again
      
       3. read the event.
      
      But we never get to 3, because the count is greater than 2 and we
      cause the WARNING and return NULL.
      
      Up the counter to 3.
      
      Cc: stable@vger.kernel.org # 2.6.37+
      Fixes: 69d1b839 "ring-buffer: Bind time extend and data events together"
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      
      (cherry picked from commit 021de3d9)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      0c1377c6
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Always reset iterator to reader page · bc630959
      Steven Rostedt (Red Hat) authored
      When performing a consuming read, the ring buffer swaps out a
      page from the ring buffer with a empty page and this page that
      was swapped out becomes the new reader page. The reader page
      is owned by the reader and since it was swapped out of the ring
      buffer, writers do not have access to it (there's an exception
      to that rule, but it's out of scope for this commit).
      
      When reading the "trace" file, it is a non consuming read, which
      means that the data in the ring buffer will not be modified.
      When the trace file is opened, a ring buffer iterator is allocated
      and writes to the ring buffer are disabled, such that the iterator
      will not have issues iterating over the data.
      
      Although the ring buffer disabled writes, it does not disable other
      reads, or even consuming reads. If a consuming read happens, then
      the iterator is reset and starts reading from the beginning again.
      
      My tests would sometimes trigger this bug on my i386 box:
      
      WARNING: CPU: 0 PID: 5175 at kernel/trace/trace.c:1527 __trace_find_cmdline+0x66/0xaa()
      Modules linked in:
      CPU: 0 PID: 5175 Comm: grep Not tainted 3.16.0-rc3-test+ #8
      Hardware name:                  /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
       00000000 00000000 f09c9e1c c18796b3 c1b5d74c f09c9e4c c103a0e3 c1b5154b
       f09c9e78 00001437 c1b5d74c 000005f7 c10bd85a c10bd85a c1cac57c f09c9eb0
       ed0e0000 f09c9e64 c103a185 00000009 f09c9e5c c1b5154b f09c9e78 f09c9e80^M
      Call Trace:
       [<c18796b3>] dump_stack+0x4b/0x75
       [<c103a0e3>] warn_slowpath_common+0x7e/0x95
       [<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
       [<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
       [<c103a185>] warn_slowpath_fmt+0x33/0x35
       [<c10bd85a>] __trace_find_cmdline+0x66/0xaa^M
       [<c10bed04>] trace_find_cmdline+0x40/0x64
       [<c10c3c16>] trace_print_context+0x27/0xec
       [<c10c4360>] ? trace_seq_printf+0x37/0x5b
       [<c10c0b15>] print_trace_line+0x319/0x39b
       [<c10ba3fb>] ? ring_buffer_read+0x47/0x50
       [<c10c13b1>] s_show+0x192/0x1ab
       [<c10bfd9a>] ? s_next+0x5a/0x7c
       [<c112e76e>] seq_read+0x267/0x34c
       [<c1115a25>] vfs_read+0x8c/0xef
       [<c112e507>] ? seq_lseek+0x154/0x154
       [<c1115ba2>] SyS_read+0x54/0x7f
       [<c188488e>] syscall_call+0x7/0xb
      ---[ end trace 3f507febd6b4cc83 ]---
      >>>> ##### CPU 1 buffer started ####
      
      Which was the __trace_find_cmdline() function complaining about the pid
      in the event record being negative.
      
      After adding more test cases, this would trigger more often. Strangely
      enough, it would never trigger on a single test, but instead would trigger
      only when running all the tests. I believe that was the case because it
      required one of the tests to be shutting down via delayed instances while
      a new test started up.
      
      After spending several days debugging this, I found that it was caused by
      the iterator becoming corrupted. Debugging further, I found out why
      the iterator became corrupted. It happened with the rb_iter_reset().
      
      As consuming reads may not read the full reader page, and only part
      of it, there's a "read" field to know where the last read took place.
      The iterator, must also start at the read position. In the rb_iter_reset()
      code, if the reader page was disconnected from the ring buffer, the iterator
      would start at the head page within the ring buffer (where writes still
      happen). But the mistake there was that it still used the "read" field
      to start the iterator on the head page, where it should always start
      at zero because readers never read from within the ring buffer where
      writes occur.
      
      I originally wrote a patch to have it set the iter->head to 0 instead
      of iter->head_page->read, but then I questioned why it wasn't always
      setting the iter to point to the reader page, as the reader page is
      still valid.  The list_empty(reader_page->list) just means that it was
      successful in swapping out. But the reader_page may still have data.
      
      There was a bug report a long time ago that was not reproducible that
      had something about trace_pipe (consuming read) not matching trace
      (iterator read). This may explain why that happened.
      
      Anyway, the correct answer to this bug is to always use the reader page
      an not reset the iterator to inside the writable ring buffer.
      
      Cc: stable@vger.kernel.org # 2.6.28+
      Fixes: d769041f "ring_buffer: implement new locking"
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      
      (cherry picked from commit 651e22f2)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      bc630959
    • Hans de Goede's avatar
      ACPI / video: Disable native_backlight on HP ENVY 15 Notebook PC · 646d4ee3
      Hans de Goede authored
      Link: https://bugs.freedesktop.org/show_bug.cgi?id=81515Reported-and-tested-by: default avatarHohahiu <rakothedin@gmail.com>
      Cc: 3.16+ <stable@vger.kernel.org> # 3.16+
      Signed-off-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      
      (cherry picked from commit 84c34858)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      646d4ee3