1. 23 Jan, 2017 1 commit
    • Nikolay Borisov's avatar
      userns: Make ucounts lock irq-safe · 880a3854
      Nikolay Borisov authored
      The ucounts_lock is being used to protect various ucounts lifecycle
      management functionalities. However, those services can also be invoked
      when a pidns is being freed in an RCU callback (e.g. softirq context).
      This can lead to deadlocks. There were already efforts trying to
      prevent similar deadlocks in add7c65c ("pid: fix lockdep deadlock
      warning due to ucount_lock"), however they just moved the context
      from hardirq to softrq. Fix this issue once and for all by explictly
      making the lock disable irqs altogether.
      
      Dmitry Vyukov <dvyukov@google.com> reported:
      
      > I've got the following deadlock report while running syzkaller fuzzer
      > on eec0d3d065bfcdf9cd5f56dd2a36b94d12d32297 of linux-next (on odroid
      > device if it matters):
      >
      > =================================
      > [ INFO: inconsistent lock state ]
      > 4.10.0-rc3-next-20170112-xc2-dirty #6 Not tainted
      > ---------------------------------
      > inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
      > swapper/2/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
      >  (ucounts_lock){+.?...}, at: [<     inline     >] spin_lock
      > ./include/linux/spinlock.h:302
      >  (ucounts_lock){+.?...}, at: [<ffff2000081678c8>]
      > put_ucounts+0x60/0x138 kernel/ucount.c:162
      > {SOFTIRQ-ON-W} state was registered at:
      > [<ffff2000081c82d8>] mark_lock+0x220/0xb60 kernel/locking/lockdep.c:3054
      > [<     inline     >] mark_irqflags kernel/locking/lockdep.c:2941
      > [<ffff2000081c97a8>] __lock_acquire+0x388/0x3260 kernel/locking/lockdep.c:3295
      > [<ffff2000081cce24>] lock_acquire+0xa4/0x138 kernel/locking/lockdep.c:3753
      > [<     inline     >] __raw_spin_lock ./include/linux/spinlock_api_smp.h:144
      > [<ffff200009798128>] _raw_spin_lock+0x90/0xd0 kernel/locking/spinlock.c:151
      > [<     inline     >] spin_lock ./include/linux/spinlock.h:302
      > [<     inline     >] get_ucounts kernel/ucount.c:131
      > [<ffff200008167c28>] inc_ucount+0x80/0x6c8 kernel/ucount.c:189
      > [<     inline     >] inc_mnt_namespaces fs/namespace.c:2818
      > [<ffff200008481850>] alloc_mnt_ns+0x78/0x3a8 fs/namespace.c:2849
      > [<ffff200008487298>] create_mnt_ns+0x28/0x200 fs/namespace.c:2959
      > [<     inline     >] init_mount_tree fs/namespace.c:3199
      > [<ffff200009bd6674>] mnt_init+0x258/0x384 fs/namespace.c:3251
      > [<ffff200009bd60bc>] vfs_caches_init+0x6c/0x80 fs/dcache.c:3626
      > [<ffff200009bb1114>] start_kernel+0x414/0x460 init/main.c:648
      > [<ffff200009bb01e8>] __primary_switched+0x6c/0x70 arch/arm64/kernel/head.S:456
      > irq event stamp: 2316924
      > hardirqs last  enabled at (2316924): [<     inline     >] rcu_do_batch
      > kernel/rcu/tree.c:2911
      > hardirqs last  enabled at (2316924): [<     inline     >]
      > invoke_rcu_callbacks kernel/rcu/tree.c:3182
      > hardirqs last  enabled at (2316924): [<     inline     >]
      > __rcu_process_callbacks kernel/rcu/tree.c:3149
      > hardirqs last  enabled at (2316924): [<ffff200008210414>]
      > rcu_process_callbacks+0x7a4/0xc28 kernel/rcu/tree.c:3166
      > hardirqs last disabled at (2316923): [<     inline     >] rcu_do_batch
      > kernel/rcu/tree.c:2900
      > hardirqs last disabled at (2316923): [<     inline     >]
      > invoke_rcu_callbacks kernel/rcu/tree.c:3182
      > hardirqs last disabled at (2316923): [<     inline     >]
      > __rcu_process_callbacks kernel/rcu/tree.c:3149
      > hardirqs last disabled at (2316923): [<ffff20000820fe80>]
      > rcu_process_callbacks+0x210/0xc28 kernel/rcu/tree.c:3166
      > softirqs last  enabled at (2316912): [<ffff20000811b4c4>]
      > _local_bh_enable+0x4c/0x80 kernel/softirq.c:155
      > softirqs last disabled at (2316913): [<     inline     >]
      > do_softirq_own_stack ./include/linux/interrupt.h:488
      > softirqs last disabled at (2316913): [<     inline     >]
      > invoke_softirq kernel/softirq.c:371
      > softirqs last disabled at (2316913): [<ffff20000811c994>]
      > irq_exit+0x264/0x308 kernel/softirq.c:405
      >
      > other info that might help us debug this:
      >  Possible unsafe locking scenario:
      >
      >        CPU0
      >        ----
      >   lock(ucounts_lock);
      >   <Interrupt>
      >     lock(ucounts_lock);
      >
      >  *** DEADLOCK ***
      >
      > 1 lock held by swapper/2/0:
      >  #0:  (rcu_callback){......}, at: [<     inline     >] __rcu_reclaim
      > kernel/rcu/rcu.h:108
      >  #0:  (rcu_callback){......}, at: [<     inline     >] rcu_do_batch
      > kernel/rcu/tree.c:2919
      >  #0:  (rcu_callback){......}, at: [<     inline     >]
      > invoke_rcu_callbacks kernel/rcu/tree.c:3182
      >  #0:  (rcu_callback){......}, at: [<     inline     >]
      > __rcu_process_callbacks kernel/rcu/tree.c:3149
      >  #0:  (rcu_callback){......}, at: [<ffff200008210390>]
      > rcu_process_callbacks+0x720/0xc28 kernel/rcu/tree.c:3166
      >
      > stack backtrace:
      > CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.10.0-rc3-next-20170112-xc2-dirty #6
      > Hardware name: Hardkernel ODROID-C2 (DT)
      > Call trace:
      > [<ffff20000808fa60>] dump_backtrace+0x0/0x440 arch/arm64/kernel/traps.c:500
      > [<ffff20000808fec0>] show_stack+0x20/0x30 arch/arm64/kernel/traps.c:225
      > [<ffff2000088a99e0>] dump_stack+0x110/0x168
      > [<ffff2000082fa2b4>] print_usage_bug.part.27+0x49c/0x4bc
      > kernel/locking/lockdep.c:2387
      > [<     inline     >] print_usage_bug kernel/locking/lockdep.c:2357
      > [<     inline     >] valid_state kernel/locking/lockdep.c:2400
      > [<     inline     >] mark_lock_irq kernel/locking/lockdep.c:2617
      > [<ffff2000081c89ec>] mark_lock+0x934/0xb60 kernel/locking/lockdep.c:3065
      > [<     inline     >] mark_irqflags kernel/locking/lockdep.c:2923
      > [<ffff2000081c9a60>] __lock_acquire+0x640/0x3260 kernel/locking/lockdep.c:3295
      > [<ffff2000081cce24>] lock_acquire+0xa4/0x138 kernel/locking/lockdep.c:3753
      > [<     inline     >] __raw_spin_lock ./include/linux/spinlock_api_smp.h:144
      > [<ffff200009798128>] _raw_spin_lock+0x90/0xd0 kernel/locking/spinlock.c:151
      > [<     inline     >] spin_lock ./include/linux/spinlock.h:302
      > [<ffff2000081678c8>] put_ucounts+0x60/0x138 kernel/ucount.c:162
      > [<ffff200008168364>] dec_ucount+0xf4/0x158 kernel/ucount.c:214
      > [<     inline     >] dec_pid_namespaces kernel/pid_namespace.c:89
      > [<ffff200008293dc8>] delayed_free_pidns+0x40/0xe0 kernel/pid_namespace.c:156
      > [<     inline     >] __rcu_reclaim kernel/rcu/rcu.h:118
      > [<     inline     >] rcu_do_batch kernel/rcu/tree.c:2919
      > [<     inline     >] invoke_rcu_callbacks kernel/rcu/tree.c:3182
      > [<     inline     >] __rcu_process_callbacks kernel/rcu/tree.c:3149
      > [<ffff2000082103d8>] rcu_process_callbacks+0x768/0xc28 kernel/rcu/tree.c:3166
      > [<ffff2000080821dc>] __do_softirq+0x324/0x6e0 kernel/softirq.c:284
      > [<     inline     >] do_softirq_own_stack ./include/linux/interrupt.h:488
      > [<     inline     >] invoke_softirq kernel/softirq.c:371
      > [<ffff20000811c994>] irq_exit+0x264/0x308 kernel/softirq.c:405
      > [<ffff2000081ecc28>] __handle_domain_irq+0xc0/0x150 kernel/irq/irqdesc.c:636
      > [<ffff200008081c80>] gic_handle_irq+0x68/0xd8
      > Exception stack(0xffff8000648e7dd0 to 0xffff8000648e7f00)
      > 7dc0:                                   ffff8000648d4b3c 0000000000000007
      > 7de0: 0000000000000000 1ffff0000c91a967 1ffff0000c91a967 1ffff0000c91a967
      > 7e00: ffff20000a4b6b68 0000000000000001 0000000000000007 0000000000000001
      > 7e20: 1fffe4000149ae90 ffff200009d35000 0000000000000000 0000000000000002
      > 7e40: 0000000000000000 0000000000000000 0000000002624a1a 0000000000000000
      > 7e60: 0000000000000000 ffff200009cbcd88 000060006d2ed000 0000000000000140
      > 7e80: ffff200009cff000 ffff200009cb6000 ffff200009cc2020 ffff200009d2159d
      > 7ea0: 0000000000000000 ffff8000648d4380 0000000000000000 ffff8000648e7f00
      > 7ec0: ffff20000820a478 ffff8000648e7f00 ffff20000820a47c 0000000010000145
      > 7ee0: 0000000000000140 dfff200000000000 ffffffffffffffff ffff20000820a478
      > [<ffff2000080837f8>] el1_irq+0xb8/0x130 arch/arm64/kernel/entry.S:486
      > [<     inline     >] arch_local_irq_restore
      > ./arch/arm64/include/asm/irqflags.h:81
      > [<ffff20000820a47c>] rcu_idle_exit+0x64/0xa8 kernel/rcu/tree.c:1030
      > [<     inline     >] cpuidle_idle_call kernel/sched/idle.c:200
      > [<ffff2000081bcbfc>] do_idle+0x1dc/0x2d0 kernel/sched/idle.c:243
      > [<ffff2000081bd1cc>] cpu_startup_entry+0x24/0x28 kernel/sched/idle.c:345
      > [<ffff200008099f8c>] secondary_start_kernel+0x2cc/0x358
      > arch/arm64/kernel/smp.c:276
      > [<000000000279f1a4>] 0x279f1a4
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Tested-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Fixes: add7c65c ("pid: fix lockdep deadlock warning due to ucount_lock")
      Fixes: f333c700 ("pidns: Add a limit on the number of pid namespaces")
      Cc: stable@vger.kernel.org
      Link: https://www.spinics.net/lists/kernel/msg2426637.htmlSigned-off-by: default avatarNikolay Borisov <n.borisov.lkml@gmail.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      880a3854
  2. 10 Jan, 2017 4 commits
    • Zhou Chengming's avatar
      sysctl: Drop reference added by grab_header in proc_sys_readdir · 93362fa4
      Zhou Chengming authored
      Fixes CVE-2016-9191, proc_sys_readdir doesn't drop reference
      added by grab_header when return from !dir_emit_dots path.
      It can cause any path called unregister_sysctl_table will
      wait forever.
      
      The calltrace of CVE-2016-9191:
      
      [ 5535.960522] Call Trace:
      [ 5535.963265]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
      [ 5535.968817]  [<ffffffff817d33fb>] schedule_timeout+0x3db/0x6f0
      [ 5535.975346]  [<ffffffff817cf055>] ? wait_for_completion+0x45/0x130
      [ 5535.982256]  [<ffffffff817cf0d3>] wait_for_completion+0xc3/0x130
      [ 5535.988972]  [<ffffffff810d1fd0>] ? wake_up_q+0x80/0x80
      [ 5535.994804]  [<ffffffff8130de64>] drop_sysctl_table+0xc4/0xe0
      [ 5536.001227]  [<ffffffff8130de17>] drop_sysctl_table+0x77/0xe0
      [ 5536.007648]  [<ffffffff8130decd>] unregister_sysctl_table+0x4d/0xa0
      [ 5536.014654]  [<ffffffff8130deff>] unregister_sysctl_table+0x7f/0xa0
      [ 5536.021657]  [<ffffffff810f57f5>] unregister_sched_domain_sysctl+0x15/0x40
      [ 5536.029344]  [<ffffffff810d7704>] partition_sched_domains+0x44/0x450
      [ 5536.036447]  [<ffffffff817d0761>] ? __mutex_unlock_slowpath+0x111/0x1f0
      [ 5536.043844]  [<ffffffff81167684>] rebuild_sched_domains_locked+0x64/0xb0
      [ 5536.051336]  [<ffffffff8116789d>] update_flag+0x11d/0x210
      [ 5536.057373]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
      [ 5536.064186]  [<ffffffff81167acb>] ? cpuset_css_offline+0x1b/0x60
      [ 5536.070899]  [<ffffffff810fce3d>] ? trace_hardirqs_on+0xd/0x10
      [ 5536.077420]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
      [ 5536.084234]  [<ffffffff8115a9f5>] ? css_killed_work_fn+0x25/0x220
      [ 5536.091049]  [<ffffffff81167ae5>] cpuset_css_offline+0x35/0x60
      [ 5536.097571]  [<ffffffff8115aa2c>] css_killed_work_fn+0x5c/0x220
      [ 5536.104207]  [<ffffffff810bc83f>] process_one_work+0x1df/0x710
      [ 5536.110736]  [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710
      [ 5536.117461]  [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0
      [ 5536.123697]  [<ffffffff810bcd70>] ? process_one_work+0x710/0x710
      [ 5536.130426]  [<ffffffff810c3f7e>] kthread+0xfe/0x120
      [ 5536.135991]  [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
      [ 5536.142041]  [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230
      
      One cgroup maintainer mentioned that "cgroup is trying to offline
      a cpuset css, which takes place under cgroup_mutex.  The offlining
      ends up trying to drain active usages of a sysctl table which apprently
      is not happening."
      The real reason is that proc_sys_readdir doesn't drop reference added
      by grab_header when return from !dir_emit_dots path. So this cpuset
      offline path will wait here forever.
      
      See here for details: http://www.openwall.com/lists/oss-security/2016/11/04/13
      
      Fixes: f0c3b509 ("[readdir] convert procfs")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarCAI Qian <caiqian@redhat.com>
      Tested-by: default avatarYang Shukui <yangshukui@huawei.com>
      Signed-off-by: default avatarZhou Chengming <zhouchengming1@huawei.com>
      Acked-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      93362fa4
    • Andrei Vagin's avatar
      pid: fix lockdep deadlock warning due to ucount_lock · add7c65c
      Andrei Vagin authored
      =========================================================
      [ INFO: possible irq lock inversion dependency detected ]
      4.10.0-rc2-00024-g4aecec9-dirty #118 Tainted: G        W
      ---------------------------------------------------------
      swapper/1/0 just changed the state of lock:
       (&(&sighand->siglock)->rlock){-.....}, at: [<ffffffffbd0a1bc6>] __lock_task_sighand+0xb6/0x2c0
      but this lock took another, HARDIRQ-unsafe lock in the past:
       (ucounts_lock){+.+...}
      and interrupts could create inverse lock ordering between them.
      other info that might help us debug this:
      Chain exists of:                 &(&sighand->siglock)->rlock --> &(&tty->ctrl_lock)->rlock --> ucounts_lock
       Possible interrupt unsafe locking scenario:
             CPU0                    CPU1
             ----                    ----
        lock(ucounts_lock);
                                     local_irq_disable();
                                     lock(&(&sighand->siglock)->rlock);
                                     lock(&(&tty->ctrl_lock)->rlock);
        <Interrupt>
          lock(&(&sighand->siglock)->rlock);
      
       *** DEADLOCK ***
      
      This patch removes a dependency between rlock and ucount_lock.
      
      Fixes: f333c700 ("pidns: Add a limit on the number of pid namespaces")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAndrei Vagin <avagin@openvz.org>
      Acked-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      add7c65c
    • Eric W. Biederman's avatar
      libfs: Modify mount_pseudo_xattr to be clear it is not a userspace mount · 75422726
      Eric W. Biederman authored
      Add MS_KERNMOUNT to the flags that are passed.
      Use sget_userns and force &init_user_ns instead of calling sget so that
      even if called from a weird context the internal filesystem will be
      considered to be in the intial user namespace.
      
      Luis Ressel reported that the the failure to pass MS_KERNMOUNT into
      mount_pseudo broke his in development graphics driver that uses the
      generic drm infrastructure.  I am not certain the deriver was bug
      free in it's usage of that infrastructure but since
      mount_pseudo_xattr can never be triggered by userspace it is clearer
      and less error prone, and less problematic for the code to be explicit.
      Reported-by: default avatarLuis Ressel <aranea@aixah.de>
      Tested-by: default avatarLuis Ressel <aranea@aixah.de>
      Acked-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      75422726
    • Eric W. Biederman's avatar
      mnt: Protect the mountpoint hashtable with mount_lock · 3895dbf8
      Eric W. Biederman authored
      Protecting the mountpoint hashtable with namespace_sem was sufficient
      until a call to umount_mnt was added to mntput_no_expire.  At which
      point it became possible for multiple calls of put_mountpoint on
      the same hash chain to happen on the same time.
      
      Kristen Johansen <kjlx@templeofstupid.com> reported:
      > This can cause a panic when simultaneous callers of put_mountpoint
      > attempt to free the same mountpoint.  This occurs because some callers
      > hold the mount_hash_lock, while others hold the namespace lock.  Some
      > even hold both.
      >
      > In this submitter's case, the panic manifested itself as a GP fault in
      > put_mountpoint() when it called hlist_del() and attempted to dereference
      > a m_hash.pprev that had been poisioned by another thread.
      
      Al Viro observed that the simple fix is to switch from using the namespace_sem
      to the mount_lock to protect the mountpoint hash table.
      
      I have taken Al's suggested patch moved put_mountpoint in pivot_root
      (instead of taking mount_lock an additional time), and have replaced
      new_mountpoint with get_mountpoint a function that does the hash table
      lookup and addition under the mount_lock.   The introduction of get_mounptoint
      ensures that only the mount_lock is needed to manipulate the mountpoint
      hashtable.
      
      d_set_mounted is modified to only set DCACHE_MOUNTED if it is not
      already set.  This allows get_mountpoint to use the setting of
      DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry
      happens exactly once.
      
      Cc: stable@vger.kernel.org
      Fixes: ce07d891 ("mnt: Honor MNT_LOCKED when detaching mounts")
      Reported-by: default avatarKrister Johansen <kjlx@templeofstupid.com>
      Suggested-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Acked-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      3895dbf8
  3. 01 Jan, 2017 2 commits
    • Linus Torvalds's avatar
      Linux 4.10-rc2 · 0c744ea4
      Linus Torvalds authored
      0c744ea4
    • Linus Torvalds's avatar
      Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 4759d386
      Linus Torvalds authored
      Pull DAX updates from Dan Williams:
       "The completion of Jan's DAX work for 4.10.
      
        As I mentioned in the libnvdimm-for-4.10 pull request, these are some
        final fixes for the DAX dirty-cacheline-tracking invalidation work
        that was merged through the -mm, ext4, and xfs trees in -rc1. These
        patches were prepared prior to the merge window, but we waited for
        4.10-rc1 to have a stable merge base after all the prerequisites were
        merged.
      
        Quoting Jan on the overall changes in these patches:
      
           "So I'd like all these 6 patches to go for rc2. The first three
            patches fix invalidation of exceptional DAX entries (a bug which
            is there for a long time) - without these patches data loss can
            occur on power failure even though user called fsync(2). The other
            three patches change locking of DAX faults so that ->iomap_begin()
            is called in a more relaxed locking context and we are safe to
            start a transaction there for ext4"
      
        These have received a build success notification from the kbuild
        robot, and pass the latest libnvdimm unit tests. There have not been
        any -next releases since -rc1, so they have not appeared there"
      
      * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
        ext4: Simplify DAX fault path
        dax: Call ->iomap_begin without entry lock during dax fault
        dax: Finish fault completely when loading holes
        dax: Avoid page invalidation races and unnecessary radix tree traversals
        mm: Invalidate DAX radix tree entries only if appropriate
        ext2: Return BH_New buffers for zeroed blocks
      4759d386
  4. 30 Dec, 2016 2 commits
  5. 29 Dec, 2016 2 commits
    • Olof Johansson's avatar
      mm/filemap: fix parameters to test_bit() · 98473f9f
      Olof Johansson authored
       mm/filemap.c: In function 'clear_bit_unlock_is_negative_byte':
        mm/filemap.c:933:9: error: too few arguments to function 'test_bit'
          return test_bit(PG_waiters);
               ^~~~~~~~
      
      Fixes: b91e1302 ('mm: optimize PageWaiters bit use for unlock_page()')
      Signed-off-by: default avatarOlof Johansson <olof@lixom.net>
      Brown-paper-bag-by: default avatarLinus Torvalds <dummy@duh.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98473f9f
    • Linus Torvalds's avatar
      mm: optimize PageWaiters bit use for unlock_page() · b91e1302
      Linus Torvalds authored
      In commit 62906027 ("mm: add PageWaiters indicating tasks are
      waiting for a page bit") Nick Piggin made our page locking no longer
      unconditionally touch the hashed page waitqueue, which not only helps
      performance in general, but is particularly helpful on NUMA machines
      where the hashed wait queues can bounce around a lot.
      
      However, the "clear lock bit atomically and then test the waiters bit"
      sequence turns out to be much more expensive than it needs to be,
      because you get a nasty stall when trying to access the same word that
      just got updated atomically.
      
      On architectures where locking is done with LL/SC, this would be trivial
      to fix with a new primitive that clears one bit and tests another
      atomically, but that ends up not working on x86, where the only atomic
      operations that return the result end up being cmpxchg and xadd.  The
      atomic bit operations return the old value of the same bit we changed,
      not the value of an unrelated bit.
      
      On x86, we could put the lock bit in the high bit of the byte, and use
      "xadd" with that bit (where the overflow ends up not touching other
      bits), and look at the other bits of the result.  However, an even
      simpler model is to just use a regular atomic "and" to clear the lock
      bit, and then the sign bit in eflags will indicate the resulting state
      of the unrelated bit #7.
      
      So by moving the PageWaiters bit up to bit #7, we can atomically clear
      the lock bit and test the waiters bit on x86 too.  And architectures
      with LL/SC (which is all the usual RISC suspects), the particular bit
      doesn't matter, so they are fine with this approach too.
      
      This avoids the extra access to the same atomic word, and thus avoids
      the costly stall at page unlock time.
      
      The only downside is that the interface ends up being a bit odd and
      specialized: clear a bit in a byte, and test the sign bit.  Nick doesn't
      love the resulting name of the new primitive, but I'd rather make the
      name be descriptive and very clear about the limitation imposed by
      trying to work across all relevant architectures than make it be some
      generic thing that doesn't make the odd semantics explicit.
      
      So this introduces the new architecture primitive
      
          clear_bit_unlock_is_negative_byte();
      
      and adds the trivial implementation for x86.  We have a generic
      non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
      combination) which can be overridden by any architecture that can do
      better.  According to Nick, Power has the same hickup x86 has, for
      example, but some other architectures may not even care.
      
      All these optimizations mean that my page locking stress-test (which is
      just executing a lot of small short-lived shell scripts: "make test" in
      the git source tree) no longer makes our page locking look horribly bad.
      Before all these optimizations, just the unlock_page() costs were just
      over 3% of all CPU overhead on "make test".  After this, it's down to
      0.66%, so just a quarter of the cost it used to be.
      
      (The difference on NUMA is bigger, but there this micro-optimization is
      likely less noticeable, since the big issue on NUMA was not the accesses
      to 'struct page', but the waitqueue accesses that were already removed
      by Nick's earlier commit).
      Acked-by: default avatarNick Piggin <npiggin@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b91e1302
  6. 28 Dec, 2016 2 commits
    • Linus Torvalds's avatar
      Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · 2d706e79
      Linus Torvalds authored
      Pull crypto fix from Herbert Xu:
       "This fixes a hash corruption bug in the marvell driver"
      
      * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
        crypto: marvell - Copy IVDIG before launching partial DMA ahash requests
      2d706e79
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 8f18e4d0
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Various ipvlan fixes from Eric Dumazet and Mahesh Bandewar.
      
          The most important is to not assume the packet is RX just because
          the destination address matches that of the device. Such an
          assumption causes problems when an interface is put into loopback
          mode.
      
       2) If we retry when creating a new tc entry (because we dropped the
          RTNL mutex in order to load a module, for example) we end up with
          -EAGAIN and then loop trying to replay the request. But we didn't
          reset some state when looping back to the top like this, and if
          another thread meanwhile inserted the same tc entry we were trying
          to, we re-link it creating an enless loop in the tc chain. Fix from
          Daniel Borkmann.
      
       3) There are two different WRITE bits in the MDIO address register for
          the stmmac chip, depending upon the chip variant. Due to a bug we
          could set them both, fix from Hock Leong Kweh.
      
       4) Fix mlx4 bug in XDP_TX handling, from Tariq Toukan.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        net: stmmac: fix incorrect bit set in gmac4 mdio addr register
        r8169: add support for RTL8168 series add-on card.
        net: xdp: remove unused bfp_warn_invalid_xdp_buffer()
        openvswitch: upcall: Fix vlan handling.
        ipv4: Namespaceify tcp_tw_reuse knob
        net: korina: Fix NAPI versus resources freeing
        net, sched: fix soft lockup in tc_classify
        net/mlx4_en: Fix user prio field in XDP forward
        tipc: don't send FIN message from connectionless socket
        ipvlan: fix multicast processing
        ipvlan: fix various issues in ipvlan_process_multicast()
      8f18e4d0
  7. 27 Dec, 2016 17 commits
  8. 26 Dec, 2016 5 commits
    • Al Viro's avatar
      arm64: don't pull uaccess.h into *.S · b4b8664d
      Al Viro authored
      Split asm-only parts of arm64 uaccess.h into a new header and use that
      from *.S.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      b4b8664d
    • Florian Fainelli's avatar
      net: korina: Fix NAPI versus resources freeing · e6afb1ad
      Florian Fainelli authored
      Commit beb0babf ("korina: disable napi on close and restart")
      introduced calls to napi_disable() that were missing before,
      unfortunately this leaves a small window during which NAPI has a chance
      to run, yet we just freed resources since korina_free_ring() has been
      called:
      
      Fix this by disabling NAPI first then freeing resource, and make sure
      that we also cancel the restart task before doing the resource freeing.
      
      Fixes: beb0babf ("korina: disable napi on close and restart")
      Reported-by: default avatarAlexandros C. Couloumbis <alex@ozo.com>
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e6afb1ad
    • Daniel Borkmann's avatar
      net, sched: fix soft lockup in tc_classify · 628185cf
      Daniel Borkmann authored
      Shahar reported a soft lockup in tc_classify(), where we run into an
      endless loop when walking the classifier chain due to tp->next == tp
      which is a state we should never run into. The issue only seems to
      trigger under load in the tc control path.
      
      What happens is that in tc_ctl_tfilter(), thread A allocates a new
      tp, initializes it, sets tp_created to 1, and calls into tp->ops->change()
      with it. In that classifier callback we had to unlock/lock the rtnl
      mutex and returned with -EAGAIN. One reason why we need to drop there
      is, for example, that we need to request an action module to be loaded.
      
      This happens via tcf_exts_validate() -> tcf_action_init/_1() meaning
      after we loaded and found the requested action, we need to redo the
      whole request so we don't race against others. While we had to unlock
      rtnl in that time, thread B's request was processed next on that CPU.
      Thread B added a new tp instance successfully to the classifier chain.
      When thread A returned grabbing the rtnl mutex again, propagating -EAGAIN
      and destroying its tp instance which never got linked, we goto replay
      and redo A's request.
      
      This time when walking the classifier chain in tc_ctl_tfilter() for
      checking for existing tp instances we had a priority match and found
      the tp instance that was created and linked by thread B. Now calling
      again into tp->ops->change() with that tp was successful and returned
      without error.
      
      tp_created was never cleared in the second round, thus kernel thinks
      that we need to link it into the classifier chain (once again). tp and
      *back point to the same object due to the match we had earlier on. Thus
      for thread B's already public tp, we reset tp->next to tp itself and
      link it into the chain, which eventually causes the mentioned endless
      loop in tc_classify() once a packet hits the data path.
      
      Fix is to clear tp_created at the beginning of each request, also when
      we replay it. On the paths that can cause -EAGAIN we already destroy
      the original tp instance we had and on replay we really need to start
      from scratch. It seems that this issue was first introduced in commit
      12186be7 ("net_cls: fix unconfigured struct tcf_proto keeps chaining
      and avoid kernel panic when we use cls_cgroup").
      
      Fixes: 12186be7 ("net_cls: fix unconfigured struct tcf_proto keeps chaining and avoid kernel panic when we use cls_cgroup")
      Reported-by: default avatarShahar Klein <shahark@mellanox.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Tested-by: default avatarShahar Klein <shahark@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      628185cf
    • Linus Torvalds's avatar
      Linux 4.10-rc1 · 7ce7d89f
      Linus Torvalds authored
      7ce7d89f
    • Larry Finger's avatar
      powerpc: Fix build warning on 32-bit PPC · 8ae679c4
      Larry Finger authored
      I am getting the following warning when I build kernel 4.9-git on my
      PowerBook G4 with a 32-bit PPC processor:
      
          AS      arch/powerpc/kernel/misc_32.o
        arch/powerpc/kernel/misc_32.S:299:7: warning: "CONFIG_FSL_BOOKE" is not defined [-Wundef]
      
      This problem is evident after commit 989cea5c ("kbuild: prevent
      lib-ksyms.o rebuilds"); however, this change in kbuild only exposes an
      error that has been in the code since 2005 when this source file was
      created.  That was with commit 9994a338 ("powerpc: Introduce
      entry_{32,64}.S, misc_{32,64}.S, systbl.S").
      
      The offending line does not make a lot of sense.  This error does not
      seem to cause any errors in the executable, thus I am not recommending
      that it be applied to any stable versions.
      
      Thanks to Nicholas Piggin for suggesting this solution.
      
      Fixes: 9994a338 ("powerpc: Introduce entry_{32,64}.S, misc_{32,64}.S, systbl.S")
      Signed-off-by: default avatarLarry Finger <Larry.Finger@lwfinger.net>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8ae679c4
  9. 25 Dec, 2016 5 commits
    • Linus Torvalds's avatar
      avoid spurious "may be used uninitialized" warning · d33d5a6c
      Linus Torvalds authored
      The timer type simplifications caused a new gcc warning:
      
        drivers/base/power/domain.c: In function ‘genpd_runtime_suspend’:
        drivers/base/power/domain.c:562:14: warning: ‘time_start’ may be used uninitialized in this function [-Wmaybe-uninitialized]
           elapsed_ns = ktime_to_ns(ktime_sub(ktime_get(), time_start));
      
      despite the actual use of "time_start" not having changed in any way.
      It appears that simply changing the type of ktime_t from a union to a
      plain scalar type made gcc check the use.
      
      The variable wasn't actually used uninitialized, but gcc apparently
      failed to notice that the conditional around the use was exactly the
      same as the conditional around the initialization of that variable.
      
      Add an unnecessary initialization just to shut up the compiler.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d33d5a6c
    • Linus Torvalds's avatar
      Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 3ddc76df
      Linus Torvalds authored
      Pull timer type cleanups from Thomas Gleixner:
       "This series does a tree wide cleanup of types related to
        timers/timekeeping.
      
         - Get rid of cycles_t and use a plain u64. The type is not really
           helpful and caused more confusion than clarity
      
         - Get rid of the ktime union. The union has become useless as we use
           the scalar nanoseconds storage unconditionally now. The 32bit
           timespec alike storage got removed due to the Y2038 limitations
           some time ago.
      
           That leaves the odd union access around for no reason. Clean it up.
      
        Both changes have been done with coccinelle and a small amount of
        manual mopping up"
      
      * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        ktime: Get rid of ktime_equal()
        ktime: Cleanup ktime_set() usage
        ktime: Get rid of the union
        clocksource: Use a plain u64 instead of cycle_t
      3ddc76df
    • Linus Torvalds's avatar
      Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · b272f732
      Linus Torvalds authored
      Pull SMP hotplug notifier removal from Thomas Gleixner:
       "This is the final cleanup of the hotplug notifier infrastructure. The
        series has been reintgrated in the last two days because there came a
        new driver using the old infrastructure via the SCSI tree.
      
        Summary:
      
         - convert the last leftover drivers utilizing notifiers
      
         - fixup for a completely broken hotplug user
      
         - prevent setup of already used states
      
         - removal of the notifiers
      
         - treewide cleanup of hotplug state names
      
         - consolidation of state space
      
        There is a sphinx based documentation pending, but that needs review
        from the documentation folks"
      
      * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqchip/armada-xp: Consolidate hotplug state space
        irqchip/gic: Consolidate hotplug state space
        coresight/etm3/4x: Consolidate hotplug state space
        cpu/hotplug: Cleanup state names
        cpu/hotplug: Remove obsolete cpu hotplug register/unregister functions
        staging/lustre/libcfs: Convert to hotplug state machine
        scsi/bnx2i: Convert to hotplug state machine
        scsi/bnx2fc: Convert to hotplug state machine
        cpu/hotplug: Prevent overwriting of callbacks
        x86/msr: Remove bogus cleanup from the error path
        bus: arm-ccn: Prevent hotplug callback leak
        perf/x86/intel/cstate: Prevent hotplug callback leak
        ARM/imx/mmcd: Fix broken cpu hotplug handling
        scsi: qedi: Convert to hotplug state machine
      b272f732
    • Linus Torvalds's avatar
      Merge branch 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux · 10bbe759
      Linus Torvalds authored
      Pull turbostat updates from Len Brown.
      
      * 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
        tools/power turbostat: remove obsolete -M, -m, -C, -c options
        tools/power turbostat: Make extensible via the --add parameter
        tools/power turbostat: Denverton uses a 25 MHz crystal, not 19.2 MHz
        tools/power turbostat: line up headers when -M is used
        tools/power turbostat: fix SKX PKG_CSTATE_LIMIT decoding
        tools/power turbostat: Support Knights Mill (KNM)
        tools/power turbostat: Display HWP OOB status
        tools/power turbostat: fix Denverton BCLK
        tools/power turbostat: use intel-family.h model strings
        tools/power/turbostat: Add Denverton RAPL support
        tools/power/turbostat: Add Denverton support
        tools/power/turbostat: split core MSR support into status + limit
        tools/power turbostat: fix error case overflow read of slm_freq_table[]
        tools/power turbostat: Allocate correct amount of fd and irq entries
        tools/power turbostat: switch to tab delimited output
        tools/power turbostat: Gracefully handle ACPI S3
        tools/power turbostat: tidy up output on Joule counter overflow
      10bbe759
    • Nicholas Piggin's avatar
      mm: add PageWaiters indicating tasks are waiting for a page bit · 62906027
      Nicholas Piggin authored
      Add a new page flag, PageWaiters, to indicate the page waitqueue has
      tasks waiting. This can be tested rather than testing waitqueue_active
      which requires another cacheline load.
      
      This bit is always set when the page has tasks on page_waitqueue(page),
      and is set and cleared under the waitqueue lock. It may be set when
      there are no tasks on the waitqueue, which will cause a harmless extra
      wakeup check that will clears the bit.
      
      The generic bit-waitqueue infrastructure is no longer used for pages.
      Instead, waitqueues are used directly with a custom key type. The
      generic code was not flexible enough to have PageWaiters manipulation
      under the waitqueue lock (which simplifies concurrency).
      
      This improves the performance of page lock intensive microbenchmarks by
      2-3%.
      
      Putting two bits in the same word opens the opportunity to remove the
      memory barrier between clearing the lock bit and testing the waiters
      bit, after some work on the arch primitives (e.g., ensuring memory
      operand widths match and cover both bits).
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62906027