1. 10 Aug, 2017 18 commits
    • Peter Zijlstra's avatar
      locking: Introduce smp_mb__after_spinlock() · d89e588c
      Peter Zijlstra authored
      Since its inception, our understanding of ACQUIRE, esp. as applied to
      spinlocks, has changed somewhat. Also, I wonder if, with a simple
      change, we cannot make it provide more.
      
      The problem with the comment is that the STORE done by spin_lock isn't
      itself ordered by the ACQUIRE, and therefore a later LOAD can pass over
      it and cross with any prior STORE, rendering the default WMB
      insufficient (pointed out by Alan).
      
      Now, this is only really a problem on PowerPC and ARM64, both of
      which already defined smp_mb__before_spinlock() as a smp_mb().
      
      At the same time, we can get a much stronger construct if we place
      that same barrier _inside_ the spin_lock(). In that case we upgrade
      the RCpc spinlock to an RCsc.  That would make all schedule() calls
      fully transitive against one another.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d89e588c
    • Peter Zijlstra's avatar
      overlayfs, locking: Remove smp_mb__before_spinlock() usage · ff7a5fb0
      Peter Zijlstra authored
      While we could replace the smp_mb__before_spinlock() with the new
      smp_mb__after_spinlock(), the normal pattern is to use
      smp_store_release() to publish an object that is used for
      lockless_dereference() -- and mirrors the regular rcu_assign_pointer()
      / rcu_dereference() patterns.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ff7a5fb0
    • Peter Zijlstra's avatar
      mm, locking: Rework {set,clear,mm}_tlb_flush_pending() · 8b1b436d
      Peter Zijlstra authored
      Commit:
      
        af2c1401 ("mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates")
      
      added smp_mb__before_spinlock() to set_tlb_flush_pending(). I think we
      can solve the same problem without this barrier.
      
      If instead we mandate that mm_tlb_flush_pending() is used while
      holding the PTL we're guaranteed to observe prior
      set_tlb_flush_pending() instances.
      
      For this to work we need to rework migrate_misplaced_transhuge_page()
      a little and move the test up into do_huge_pmd_numa_page().
      
      NOTE: this relies on flush_tlb_range() to guarantee:
      
         (1) it ensures that prior page table updates are visible to the
             page table walker and
         (2) it ensures that subsequent memory accesses are only made
             visible after the invalidation has completed
      
      This is required for architectures that implement TRANSPARENT_HUGEPAGE
      (arc, arm, arm64, mips, powerpc, s390, sparc, x86) or otherwise use
      mm_tlb_flush_pending() in their page-table operations (arm, arm64,
      x86).
      
      This appears true for:
      
       - arm (DSB ISB before and after),
       - arm64 (DSB ISHST before, and DSB ISH after),
       - powerpc (PTESYNC before and after),
       - s390 and x86 TLB invalidate are serializing instructions
      
      But I failed to understand the situation for:
      
       - arc, mips, sparc
      
      Now SPARC64 is a wee bit special in that flush_tlb_range() is a no-op
      and it flushes the TLBs using arch_{enter,leave}_lazy_mmu_mode()
      inside the PTL. It still needs to guarantee the PTL unlock happens
      _after_ the invalidate completes.
      
      Vineet, Ralf and Dave could you guys please have a look?
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8b1b436d
    • Peter Zijlstra's avatar
      Documentation/locking/atomic: Add documents for new atomic_t APIs · 706eeb3e
      Peter Zijlstra authored
      Since we've vastly expanded the atomic_t interface in recent years the
      existing documentation is woefully out of date and people seem to get
      confused a bit.
      
      Start a new document to hopefully better explain the current state of
      affairs.
      
      The old atomic_ops.txt also covers bitmaps and a few more details so
      this is not a full replacement and we'll therefore keep that document
      around until such a time that we've managed to write more text to cover
      its entire.
      
      Also please, ReST people, go away.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      706eeb3e
    • Marc Zyngier's avatar
      clocksource/arm_arch_timer: Use static_branch_enable_cpuslocked() · 450f9689
      Marc Zyngier authored
      Use the new static_branch_enable_cpuslocked() function to switch
      the workaround static key on the CPU hotplug path.
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Leo Yan <leo.yan@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Link: http://lkml.kernel.org/r/20170801080257.5056-5-marc.zyngier@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      450f9689
    • Marc Zyngier's avatar
      jump_label: Provide hotplug context variants · 5a40527f
      Marc Zyngier authored
      As using the normal static key API under the hotplug lock is
      pretty much impossible, let's provide a variant of some of them
      that require the hotplug lock to have already been taken.
      
      These function are only meant to be used in CPU hotplug callbacks.
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Leo Yan <leo.yan@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Link: http://lkml.kernel.org/r/20170801080257.5056-4-marc.zyngier@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5a40527f
    • Marc Zyngier's avatar
      jump_label: Split out code under the hotplug lock · 8b7b4128
      Marc Zyngier authored
      In order to later introduce an "already locked" version of some
      of the static key funcions, let's split the code into the core stuff
      (the *_cpuslocked functions) and the usual helpers, which now
      take/release the hotplug lock and call into the _cpuslocked
      versions.
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Leo Yan <leo.yan@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Link: http://lkml.kernel.org/r/20170801080257.5056-3-marc.zyngier@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8b7b4128
    • Marc Zyngier's avatar
      jump_label: Move CPU hotplug locking · b70cecf4
      Marc Zyngier authored
      As we're about to rework the locking, let's move the taking and
      release of the CPU hotplug lock to locations that will make its
      reworking completely obvious.
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Leo Yan <leo.yan@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Link: http://lkml.kernel.org/r/20170801080257.5056-2-marc.zyngier@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b70cecf4
    • Peter Zijlstra's avatar
      jump_label: Add RELEASE barrier after text changes · d0646a6f
      Peter Zijlstra authored
      In the unlikely case text modification does not fully order things,
      add some extra ordering of our own to ensure we only enabled the fast
      path after all text is visible.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d0646a6f
    • Paolo Bonzini's avatar
      cpuset: Make nr_cpusets private · be040bea
      Paolo Bonzini authored
      Any use of key->enabled (that is static_key_enabled and static_key_count)
      outside jump_label_lock should handle its own serialization.  In the case
      of cpusets_enabled_key, the key is always incremented/decremented under
      cpuset_mutex, and hence the same rule applies to nr_cpusets.  The rule
      *is* respected currently, but the mutex is static so nr_cpusets should
      be static too.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1501601046-35683-4-git-send-email-pbonzini@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      be040bea
    • Paolo Bonzini's avatar
      jump_label: Do not use unserialized static_key_enabled() · 7a34bcb8
      Paolo Bonzini authored
      Any use of key->enabled (that is static_key_enabled and static_key_count)
      outside jump_label_lock should handle its own serialization.  The only
      two that are not doing so are the UDP encapsulation static keys.  Change
      them to use static_key_enable, which now correctly tests key->enabled under
      the jump label lock.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1501601046-35683-3-git-send-email-pbonzini@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7a34bcb8
    • Paolo Bonzini's avatar
      jump_label: Fix concurrent static_key_enable/disable() · 1dbb6704
      Paolo Bonzini authored
      static_key_enable/disable are trying to cap the static key count to
      0/1.  However, their use of key->enabled is outside jump_label_lock
      so they do not really ensure that.
      
      Rewrite them to do a quick check for an already enabled (respectively,
      already disabled), and then recheck under the jump label lock.  Unlike
      static_key_slow_inc/dec, a failed check under the jump label lock does
      not modify key->enabled.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1501601046-35683-2-git-send-email-pbonzini@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1dbb6704
    • Kirill Tkhai's avatar
      locking/rwsem-xadd: Add killable versions of rwsem_down_read_failed() · 83ced169
      Kirill Tkhai authored
      Rename rwsem_down_read_failed() in __rwsem_down_read_failed_common()
      and teach it to abort waiting in case of pending signals and killable
      state argument passed.
      
      Note, that we shouldn't wake anybody up in EINTR path, as:
      
      We check for (waiter.task) under spinlock before we go to out_nolock
      path. Current task wasn't able to be woken up, so there are
      a writer, owning the sem, or a writer, which is the first waiter.
      In the both cases we shouldn't wake anybody. If there is a writer,
      owning the sem, and we were the only waiter, remove RWSEM_WAITING_BIAS,
      as there are no waiters anymore.
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arnd@arndb.de
      Cc: avagin@virtuozzo.com
      Cc: davem@davemloft.net
      Cc: fenghua.yu@intel.com
      Cc: gorcunov@virtuozzo.com
      Cc: heiko.carstens@de.ibm.com
      Cc: hpa@zytor.com
      Cc: ink@jurassic.park.msu.ru
      Cc: mattst88@gmail.com
      Cc: rth@twiddle.net
      Cc: schwidefsky@de.ibm.com
      Cc: tony.luck@intel.com
      Link: http://lkml.kernel.org/r/149789534632.9059.2901382369609922565.stgit@localhost.localdomainSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      83ced169
    • Kirill Tkhai's avatar
      locking/rwsem-spinlock: Add killable versions of __down_read() · 0aa1125f
      Kirill Tkhai authored
      Rename __down_read() in __down_read_common() and teach it
      to abort waiting in case of pending signals and killable
      state argument passed.
      
      Note, that we shouldn't wake anybody up in EINTR path, as:
      
      We check for signal_pending_state() after (!waiter.task)
      test and under spinlock. So, current task wasn't able to
      be woken up. It may be in two cases: a writer is owner
      of the sem, or a writer is a first waiter of the sem.
      
      If a writer is owner of the sem, no one else may work
      with it in parallel. It will wake somebody, when it
      call up_write() or downgrade_write().
      
      If a writer is the first waiter, it will be woken up,
      when the last active reader releases the sem, and
      sem->count became 0.
      
      Also note, that set_current_state() may be moved down
      to schedule() (after !waiter.task check), as all
      assignments in this type of semaphore (including wake_up),
      occur under spinlock, so we can't miss anything.
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arnd@arndb.de
      Cc: avagin@virtuozzo.com
      Cc: davem@davemloft.net
      Cc: fenghua.yu@intel.com
      Cc: gorcunov@virtuozzo.com
      Cc: heiko.carstens@de.ibm.com
      Cc: hpa@zytor.com
      Cc: ink@jurassic.park.msu.ru
      Cc: mattst88@gmail.com
      Cc: rth@twiddle.net
      Cc: schwidefsky@de.ibm.com
      Cc: tony.luck@intel.com
      Link: http://lkml.kernel.org/r/149789533283.9059.9829416940494747182.stgit@localhost.localdomainSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0aa1125f
    • Prateek Sood's avatar
      locking/osq_lock: Fix osq_lock queue corruption · 50972fe7
      Prateek Sood authored
      Fix ordering of link creation between node->prev and prev->next in
      osq_lock(). A case in which the status of optimistic spin queue is
      CPU6->CPU2 in which CPU6 has acquired the lock.
      
              tail
                v
        ,-. <- ,-.
        |6|    |2|
        `-' -> `-'
      
      At this point if CPU0 comes in to acquire osq_lock, it will update the
      tail count.
      
        CPU2			CPU0
        ----------------------------------
      
      				       tail
      				         v
      			  ,-. <- ,-.    ,-.
      			  |6|    |2|    |0|
      			  `-' -> `-'    `-'
      
      After tail count update if CPU2 starts to unqueue itself from
      optimistic spin queue, it will find an updated tail count with CPU0 and
      update CPU2 node->next to NULL in osq_wait_next().
      
        unqueue-A
      
      	       tail
      	         v
        ,-. <- ,-.    ,-.
        |6|    |2|    |0|
        `-'    `-'    `-'
      
        unqueue-B
      
        ->tail != curr && !node->next
      
      If reordering of following stores happen then prev->next where prev
      being CPU2 would be updated to point to CPU0 node:
      
      				       tail
      				         v
      			  ,-. <- ,-.    ,-.
      			  |6|    |2|    |0|
      			  `-'    `-' -> `-'
      
        osq_wait_next()
          node->next <- 0
          xchg(node->next, NULL)
      
      	       tail
      	         v
        ,-. <- ,-.    ,-.
        |6|    |2|    |0|
        `-'    `-'    `-'
      
        unqueue-C
      
      At this point if next instruction
      	WRITE_ONCE(next->prev, prev);
      in CPU2 path is committed before the update of CPU0 node->prev = prev then
      CPU0 node->prev will point to CPU6 node.
      
      	       tail
          v----------. v
        ,-. <- ,-.    ,-.
        |6|    |2|    |0|
        `-'    `-'    `-'
           `----------^
      
      At this point if CPU0 path's node->prev = prev is committed resulting
      in change of CPU0 prev back to CPU2 node. CPU2 node->next is NULL
      currently,
      
      				       tail
      			                 v
      			  ,-. <- ,-. <- ,-.
      			  |6|    |2|    |0|
      			  `-'    `-'    `-'
      			     `----------^
      
      so if CPU0 gets into unqueue path of osq_lock it will keep spinning
      in infinite loop as condition prev->next == node will never be true.
      Signed-off-by: default avatarPrateek Sood <prsood@codeaurora.org>
      [ Added pictures, rewrote comments. ]
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: sramana@codeaurora.org
      Link: http://lkml.kernel.org/r/1500040076-27626-1-git-send-email-prsood@codeaurora.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      50972fe7
    • Peter Zijlstra's avatar
      locking/atomic: Fix atomic_set_release() for 'funny' architectures · 9d664c0a
      Peter Zijlstra authored
      Those architectures that have a special atomic_set implementation also
      need a special atomic_set_release(), because for the very same reason
      WRITE_ONCE() is broken for them, smp_store_release() is too.
      
      The vast majority is architectures that have spinlock hash based atomic
      implementation except hexagon which seems to have a hardware 'feature'.
      
      The spinlock based atomics should be SC, that is, none of them appear to
      place extra barriers in atomic_cmpxchg() or any of the other SC atomic
      primitives and therefore seem to rely on their spinlock implementation
      being SC (I did not fully validate all that).
      
      Therefore, the normal atomic_set() is SC and can be used at
      atomic_set_release().
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: davem@davemloft.net
      Cc: james.hogan@imgtec.com
      Cc: jejb@parisc-linux.org
      Cc: rkuo@codeaurora.org
      Cc: vgupta@synopsys.com
      Link: http://lkml.kernel.org/r/20170609110506.yod47flaav3wgoj5@hirez.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      9d664c0a
    • Boqun Feng's avatar
      sched/wait: Remove the lockless swait_active() check in swake_up*() · 35a2897c
      Boqun Feng authored
      Steven Rostedt reported a potential race in RCU core because of
      swake_up():
      
              CPU0                            CPU1
              ----                            ----
                                      __call_rcu_core() {
      
                                       spin_lock(rnp_root)
                                       need_wake = __rcu_start_gp() {
                                        rcu_start_gp_advanced() {
                                         gp_flags = FLAG_INIT
                                        }
                                       }
      
       rcu_gp_kthread() {
         swait_event_interruptible(wq,
              gp_flags & FLAG_INIT) {
         spin_lock(q->lock)
      
                                      *fetch wq->task_list here! *
      
         list_add(wq->task_list, q->task_list)
         spin_unlock(q->lock);
      
         *fetch old value of gp_flags here *
      
                                       spin_unlock(rnp_root)
      
                                       rcu_gp_kthread_wake() {
                                        swake_up(wq) {
                                         swait_active(wq) {
                                          list_empty(wq->task_list)
      
                                         } * return false *
      
        if (condition) * false *
          schedule();
      
      In this case, a wakeup is missed, which could cause the rcu_gp_kthread
      waits for a long time.
      
      The reason of this is that we do a lockless swait_active() check in
      swake_up(). To fix this, we can either 1) add a smp_mb() in swake_up()
      before swait_active() to provide the proper order or 2) simply remove
      the swait_active() in swake_up().
      
      The solution 2 not only fixes this problem but also keeps the swait and
      wait API as close as possible, as wake_up() doesn't provide a full
      barrier and doesn't do a lockless check of the wait queue either.
      Moreover, there are users already using swait_active() to do their quick
      checks for the wait queues, so it make less sense that swake_up() and
      swake_up_all() do this on their own.
      
      This patch then removes the lockless swait_active() check in swake_up()
      and swake_up_all().
      Reported-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarBoqun Feng <boqun.feng@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Krister Johansen <kjlx@templeofstupid.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170615041828.zk3a3sfyudm5p6nl@tardisSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      35a2897c
    • Ingo Molnar's avatar
      388f8e12
  2. 09 Aug, 2017 14 commits
    • Linus Torvalds's avatar
      Merge tag 'pinctrl-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · 8d31f80e
      Linus Torvalds authored
      Pull pin control fixes from Linus Walleij:
       "These are the pin control fixes I have gathered since the return from
        my vacation. They boiled in -next a while so let's get them in.
      
        Apart from the documentation build it is purely driver fixes. Which is
        nice. The Intel fixes seem kind of important.
      
         - Fix the documentation build as the docs were moved
      
         - Correct the UART pin list on the Intel Merrifield
      
         - Fix pin assignment and number of pins on the Marvell Armada 37xx
           pin controller
      
         - Cover the Setzer models in the Chromebook DMI quirk in the Intel
           cheryview driver so they start working
      
         - Add the missing "sim" function to the sunxi driver
      
         - Fix USB pin definitions on Uniphier Pro4
      
         - Smatch fix for invalid reference in the zx pin control driver"
      
      * tag 'pinctrl-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
        pinctrl: generic: update references to Documentation/pinctrl.txt
        pinctrl: intel: merrifield: Correct UART pin lists
        pinctrl: armada-37xx: Fix number of pin in south bridge
        pinctrl: armada-37xx: Fix the pin 23 on south bridge
        pinctrl: cherryview: Add Setzer models to the Chromebook DMI quirk
        pinctrl: sunxi: add a missing function of A10/A20 pinctrl driver
        pinctrl: uniphier: fix USB3 pin assignment for Pro4
        pinctrl: zte: fix dereference of 'data' in zx_set_mux()
      8d31f80e
    • Mel Gorman's avatar
      futex: Remove unnecessary warning from get_futex_key · 48fb6f4d
      Mel Gorman authored
      Commit 65d8fc77 ("futex: Remove requirement for lock_page() in
      get_futex_key()") removed an unnecessary lock_page() with the
      side-effect that page->mapping needed to be treated very carefully.
      
      Two defensive warnings were added in case any assumption was missed and
      the first warning assumed a correct application would not alter a
      mapping backing a futex key.  Since merging, it has not triggered for
      any unexpected case but Mark Rutland reported the following bug
      triggering due to the first warning.
      
        kernel BUG at kernel/futex.c:679!
        Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 0 PID: 3695 Comm: syz-executor1 Not tainted 4.13.0-rc3-00020-g307fec773ba3 #3
        Hardware name: linux,dummy-virt (DT)
        task: ffff80001e271780 task.stack: ffff000010908000
        PC is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
        LR is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
        pc : [<ffff00000821ac14>] lr : [<ffff00000821ac14>] pstate: 80000145
      
      The fact that it's a bug instead of a warning was due to an unrelated
      arm64 problem, but the warning itself triggered because the underlying
      mapping changed.
      
      This is an application issue but from a kernel perspective it's a
      recoverable situation and the warning is unnecessary so this patch
      removes the warning.  The warning may potentially be triggered with the
      following test program from Mark although it may be necessary to adjust
      NR_FUTEX_THREADS to be a value smaller than the number of CPUs in the
      system.
      
          #include <linux/futex.h>
          #include <pthread.h>
          #include <stdio.h>
          #include <stdlib.h>
          #include <sys/mman.h>
          #include <sys/syscall.h>
          #include <sys/time.h>
          #include <unistd.h>
      
          #define NR_FUTEX_THREADS 16
          pthread_t threads[NR_FUTEX_THREADS];
      
          void *mem;
      
          #define MEM_PROT  (PROT_READ | PROT_WRITE)
          #define MEM_SIZE  65536
      
          static int futex_wrapper(int *uaddr, int op, int val,
                                   const struct timespec *timeout,
                                   int *uaddr2, int val3)
          {
              syscall(SYS_futex, uaddr, op, val, timeout, uaddr2, val3);
          }
      
          void *poll_futex(void *unused)
          {
              for (;;) {
                  futex_wrapper(mem, FUTEX_CMP_REQUEUE_PI, 1, NULL, mem + 4, 1);
              }
          }
      
          int main(int argc, char *argv[])
          {
              int i;
      
              mem = mmap(NULL, MEM_SIZE, MEM_PROT,
                     MAP_SHARED | MAP_ANONYMOUS, -1, 0);
      
              printf("Mapping @ %p\n", mem);
      
              printf("Creating futex threads...\n");
      
              for (i = 0; i < NR_FUTEX_THREADS; i++)
                  pthread_create(&threads[i], NULL, poll_futex, NULL);
      
              printf("Flipping mapping...\n");
              for (;;) {
                  mmap(mem, MEM_SIZE, MEM_PROT,
                       MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, -1, 0);
              }
      
              return 0;
          }
      Reported-and-tested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org # 4.7+
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48fb6f4d
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 358f8c26
      Linus Torvalds authored
      Pull i2c fixes from Wolfram Sang:
       "The main thing is to allow empty id_tables for ACPI to make some
        drivers get probed again. It looks a bit bigger than usual because it
        needs some internal renaming, too.
      
        Other than that, there is a fix for broken DSTDs, a super simple
        enablement for ARM MPS, and two documentation fixes which I'd like to
        see in v4.13 already"
      
      * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: rephrase explanation of I2C_CLASS_DEPRECATED
        i2c: allow i2c-versatile for ARM MPS platforms
        i2c: designware: Some broken DSTDs use 1MiHz instead of 1MHz
        i2c: designware: Print clock freq on invalid clock freq error
        i2c: core: Allow empty id_table in ACPI case as well
        i2c: mux: pinctrl: mention correct module name in Kconfig help text
      358f8c26
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.dk/linux-block · 31cf92f3
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "Three patches that should go into this release.
      
        Two of them are from Paolo and fix up some corner cases with BFQ, and
        the last patch is from Ming and fixes up a potential usage count
        imbalance regression due to the recent NOWAIT work"
      
      * 'for-linus' of git://git.kernel.dk/linux-block:
        blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed
        block, bfq: consider also in_service_entity to state whether an entity is active
        block, bfq: reset in_service_entity if it becomes idle
      31cf92f3
    • Linus Torvalds's avatar
      Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · d555eb6b
      Linus Torvalds authored
      Pull crypto fixes from Herbert Xu:
       "Fix two regressions in the inside-secure driver with respect to
        hmac(sha1)"
      
      * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
        crypto: inside-secure - fix the sha state length in hmac_sha1_setkey
        crypto: inside-secure - fix invalidation check in hmac_sha1_setkey
      d555eb6b
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 4530cca1
      Linus Torvalds authored
      Pull networking fixes from David Miller:
       "The pull requests are getting smaller, that's progress I suppose :-)
      
         1) Fix infinite loop in CIPSO option parsing, from Yujuan Qi.
      
         2) Fix remote checksum handling in VXLAN and GUE tunneling drivers,
            from Koichiro Den.
      
         3) Missing u64_stats_init() calls in several drivers, from Florian
            Fainelli.
      
         4) TCP can set the congestion window to an invalid ssthresh value
            after congestion window reductions, from Yuchung Cheng.
      
         5) Fix BPF jit branch generation on s390, from Daniel Borkmann.
      
         6) Correct MIPS ebpf JIT merge, from David Daney.
      
         7) Correct byte order test in BPF test_verifier.c, from Daniel
            Borkmann.
      
         8) Fix various crashes and leaks in ASIX driver, from Dean Jenkins.
      
         9) Handle SCTP checksums properly in mlx4 driver, from Davide
            Caratti.
      
        10) We can potentially enter tcp_connect() with a cached route
            already, due to fastopen, so we have to explicitly invalidate it.
      
        11) skb_warn_bad_offload() can bark in legitimate situations, fix from
            Willem de Bruijn"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
        net: avoid skb_warn_bad_offload false positives on UFO
        qmi_wwan: fix NULL deref on disconnect
        ppp: fix xmit recursion detection on ppp channels
        rds: Reintroduce statistics counting
        tcp: fastopen: tcp_connect() must refresh the route
        net: sched: set xt_tgchk_param par.net properly in ipt_init_target
        net: dsa: mediatek: add adjust link support for user ports
        net/mlx4_en: don't set CHECKSUM_COMPLETE on SCTP packets
        qed: Fix a memory allocation failure test in 'qed_mcp_cmd_init()'
        hysdn: fix to a race condition in put_log_buffer
        s390/qeth: fix L3 next-hop in xmit qeth hdr
        asix: Fix small memory leak in ax88772_unbind()
        asix: Ensure asix_rx_fixup_info members are all reset
        asix: Add rx->ax_skb = NULL after usbnet_skb_return()
        bpf: fix selftest/bpf/test_pkt_md_access on s390x
        netvsc: fix race on sub channel creation
        bpf: fix byte order test in test_verifier
        xgene: Always get clk source, but ignore if it's missing for SGMII ports
        MIPS: Add missing file for eBPF JIT.
        bpf, s390: fix build for libbpf and selftest suite
        ...
      4530cca1
    • Willem de Bruijn's avatar
      net: avoid skb_warn_bad_offload false positives on UFO · 8d63bee6
      Willem de Bruijn authored
      skb_warn_bad_offload triggers a warning when an skb enters the GSO
      stack at __skb_gso_segment that does not have CHECKSUM_PARTIAL
      checksum offload set.
      
      Commit b2504a5d ("net: reduce skb_warn_bad_offload() noise")
      observed that SKB_GSO_DODGY producers can trigger the check and
      that passing those packets through the GSO handlers will fix it
      up. But, the software UFO handler will set ip_summed to
      CHECKSUM_NONE.
      
      When __skb_gso_segment is called from the receive path, this
      triggers the warning again.
      
      Make UFO set CHECKSUM_UNNECESSARY instead of CHECKSUM_NONE. On
      Tx these two are equivalent. On Rx, this better matches the
      skb state (checksum computed), as CHECKSUM_NONE here means no
      checksum computed.
      
      See also this thread for context:
      http://patchwork.ozlabs.org/patch/799015/
      
      Fixes: b2504a5d ("net: reduce skb_warn_bad_offload() noise")
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8d63bee6
    • Bjørn Mork's avatar
      qmi_wwan: fix NULL deref on disconnect · bbae08e5
      Bjørn Mork authored
      qmi_wwan_disconnect is called twice when disconnecting devices with
      separate control and data interfaces.  The first invocation will set
      the interface data to NULL for both interfaces to flag that the
      disconnect has been handled.  But the matching NULL check was left
      out when qmi_wwan_disconnect was added, resulting in this oops:
      
        usb 2-1.4: USB disconnect, device number 4
        qmi_wwan 2-1.4:1.6 wwp0s29u1u4i6: unregister 'qmi_wwan' usb-0000:00:1d.0-1.4, WWAN/QMI device
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000e0
        IP: qmi_wwan_disconnect+0x25/0xc0 [qmi_wwan]
        PGD 0
        P4D 0
        Oops: 0000 [#1] SMP
        Modules linked in: <stripped irrelevant module list>
        CPU: 2 PID: 33 Comm: kworker/2:1 Tainted: G            E   4.12.3-nr44-normandy-r1500619820+ #1
        Hardware name: LENOVO 4291LR7/4291LR7, BIOS CBET4000 4.6-810-g50522254fb 07/21/2017
        Workqueue: usb_hub_wq hub_event [usbcore]
        task: ffff8c882b716040 task.stack: ffffb8e800d84000
        RIP: 0010:qmi_wwan_disconnect+0x25/0xc0 [qmi_wwan]
        RSP: 0018:ffffb8e800d87b38 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffff8c8824f3f1d0 RDI: ffff8c8824ef6400
        RBP: ffff8c8824ef6400 R08: 0000000000000000 R09: 0000000000000000
        R10: ffffb8e800d87780 R11: 0000000000000011 R12: ffffffffc07ea0e8
        R13: ffff8c8824e2e000 R14: ffff8c8824e2e098 R15: 0000000000000000
        FS:  0000000000000000(0000) GS:ffff8c8835300000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000000e0 CR3: 0000000229ca5000 CR4: 00000000000406e0
        Call Trace:
         ? usb_unbind_interface+0x71/0x270 [usbcore]
         ? device_release_driver_internal+0x154/0x210
         ? qmi_wwan_unbind+0x6d/0xc0 [qmi_wwan]
         ? usbnet_disconnect+0x6c/0xf0 [usbnet]
         ? qmi_wwan_disconnect+0x87/0xc0 [qmi_wwan]
         ? usb_unbind_interface+0x71/0x270 [usbcore]
         ? device_release_driver_internal+0x154/0x210
      Reported-and-tested-by: default avatarNathaniel Roach <nroach44@gmail.com>
      Fixes: c6adf779 ("net: usb: qmi_wwan: add qmap mux protocol support")
      Cc: Daniele Palmas <dnlplm@gmail.com>
      Signed-off-by: default avatarBjørn Mork <bjorn@mork.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bbae08e5
    • Guillaume Nault's avatar
      ppp: fix xmit recursion detection on ppp channels · 0a0e1a85
      Guillaume Nault authored
      Commit e5dadc65 ("ppp: Fix false xmit recursion detect with two ppp
      devices") dropped the xmit_recursion counter incrementation in
      ppp_channel_push() and relied on ppp_xmit_process() for this task.
      But __ppp_channel_push() can also send packets directly (using the
      .start_xmit() channel callback), in which case the xmit_recursion
      counter isn't incremented anymore. If such packets get routed back to
      the parent ppp unit, ppp_xmit_process() won't notice the recursion and
      will call ppp_channel_push() on the same channel, effectively creating
      the deadlock situation that the xmit_recursion mechanism was supposed
      to prevent.
      
      This patch re-introduces the xmit_recursion counter incrementation in
      ppp_channel_push(). Since the xmit_recursion variable is now part of
      the parent ppp unit, incrementation is skipped if the channel doesn't
      have any. This is fine because only packets routed through the parent
      unit may enter the channel recursively.
      
      Finally, we have to ensure that pch->ppp is not going to be modified
      while executing ppp_channel_push(). Instead of taking this lock only
      while calling ppp_xmit_process(), we now have to hold it for the full
      ppp_channel_push() execution. This respects the ppp locks ordering
      which requires locking ->upl before ->downl.
      
      Fixes: e5dadc65 ("ppp: Fix false xmit recursion detect with two ppp devices")
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a0e1a85
    • Håkon Bugge's avatar
      rds: Reintroduce statistics counting · 05bfd7db
      Håkon Bugge authored
      In commit 7e3f2952 ("rds: don't let RDS shutdown a connection
      while senders are present"), refilling the receive queue was removed
      from rds_ib_recv(), along with the increment of
      s_ib_rx_refill_from_thread.
      
      Commit 73ce4317 ("RDS: make sure we post recv buffers")
      re-introduces filling the receive queue from rds_ib_recv(), but does
      not add the statistics counter. rds_ib_recv() was later renamed to
      rds_ib_recv_path().
      
      This commit reintroduces the statistics counting of
      s_ib_rx_refill_from_thread and s_ib_rx_refill_from_cq.
      Signed-off-by: default avatarHåkon Bugge <haakon.bugge@oracle.com>
      Reviewed-by: default avatarKnut Omang <knut.omang@oracle.com>
      Reviewed-by: default avatarWei Lin Guay <wei.lin.guay@oracle.com>
      Reviewed-by: default avatarShamir Rabinovitch <shamir.rabinovitch@oracle.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      05bfd7db
    • Eric Dumazet's avatar
      tcp: fastopen: tcp_connect() must refresh the route · 8ba60924
      Eric Dumazet authored
      With new TCP_FASTOPEN_CONNECT socket option, there is a possibility
      to call tcp_connect() while socket sk_dst_cache is either NULL
      or invalid.
      
       +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
       +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
       +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
       +0 connect(4, ..., ...) = 0
      
      << sk->sk_dst_cache becomes obsolete, or even set to NULL >>
      
       +1 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
      
      We need to refresh the route otherwise bad things can happen,
      especially when syzkaller is running on the host :/
      
      Fixes: 19f6d3f3 ("net/tcp-fastopen: Add new API support")
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ba60924
    • Xin Long's avatar
      net: sched: set xt_tgchk_param par.net properly in ipt_init_target · ec0acb09
      Xin Long authored
      Now xt_tgchk_param par in ipt_init_target is a local varibale,
      par.net is not initialized there. Later when xt_check_target
      calls target's checkentry in which it may access par.net, it
      would cause kernel panic.
      
      Jaroslav found this panic when running:
      
        # ip link add TestIface type dummy
        # tc qd add dev TestIface ingress handle ffff:
        # tc filter add dev TestIface parent ffff: u32 match u32 0 0 \
          action xt -j CONNMARK --set-mark 4
      
      This patch is to pass net param into ipt_init_target and set
      par.net with it properly in there.
      
      v1->v2:
        As Wang Cong pointed, I missed ipt_net_id != xt_net_id, so fix
        it by also passing net_id to __tcf_ipt_init.
      v2->v3:
        Missed the fixes tag, so add it.
      
      Fixes: ecb2421b ("netfilter: add and use nf_ct_netns_get/put")
      Reported-by: default avatarJaroslav Aster <jaster@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ec0acb09
    • John Crispin's avatar
      net: dsa: mediatek: add adjust link support for user ports · 8e6f1521
      John Crispin authored
      Manually adjust the port settings of user ports once PHY polling has
      completed. This patch extends the adjust_link callback to configure the
      per port PMCR register, applying the proper values polled from the PHY.
      Without this patch flow control was not always getting setup properly.
      Signed-off-by: default avatarShashidhar Lakkavalli <shashidhar.lakkavalli@openmesh.com>
      Signed-off-by: default avatarMuciri Gatimu <muciri@openmesh.com>
      Signed-off-by: default avatarJohn Crispin <john@phrozen.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8e6f1521
    • Davide Caratti's avatar
      net/mlx4_en: don't set CHECKSUM_COMPLETE on SCTP packets · e718fe45
      Davide Caratti authored
      if the NIC fails to validate the checksum on TCP/UDP, and validation of IP
      checksum is successful, the driver subtracts the pseudo-header checksum
      from the value obtained by the hardware and sets CHECKSUM_COMPLETE. Don't
      do that if protocol is IPPROTO_SCTP, otherwise CRC32c validation fails.
      
      V2: don't test MLX4_CQE_STATUS_IPV6 if MLX4_CQE_STATUS_IPV4 is set
      Reported-by: default avatarShuang Li <shuali@redhat.com>
      Fixes: f8c6455b ("net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE")
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e718fe45
  3. 08 Aug, 2017 8 commits