1. 16 May, 2018 2 commits
    • Sagi Grimberg's avatar
      IB/device: Convert ib-comp-wq to be CPU-bound · 1899f679
      Sagi Grimberg authored
      commit b7363e67 upstream.
      
      This workqueue is used by our storage target mode ULPs
      via the new CQ API. Recent observations when working
      with very high-end flash storage devices reveal that
      UNBOUND workqueue threads can migrate between cpu cores
      and even numa nodes (although some numa locality is accounted
      for).
      
      While this attribute can be useful in some workloads,
      it does not fit in very nicely with the normal
      run-to-completion model we usually use in our target-mode
      ULPs and the block-mq irq<->cpu affinity facilities.
      
      The whole block-mq concept is that the completion will
      land on the same cpu where the submission was performed.
      The fact that our submitter thread is migrating cpus
      can break this locality.
      
      We assume that as a target mode ULP, we will serve multiple
      initiators/clients and we can spread the load enough without
      having to use unbound kworkers.
      
      Also, while we're at it, expose this workqueue via sysfs which
      is harmless and can be useful for debug.
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>--
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: Raju  Rangoju <rajur@chelsio.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1899f679
    • Julian Anastasov's avatar
      ipvs: fix rtnl_lock lockups caused by start_sync_thread · 83797a77
      Julian Anastasov authored
      commit 5c64576a upstream.
      
      syzkaller reports for wrong rtnl_lock usage in sync code [1] and [2]
      
      We have 2 problems in start_sync_thread if error path is
      taken, eg. on memory allocation error or failure to configure
      sockets for mcast group or addr/port binding:
      
      1. recursive locking: holding rtnl_lock while calling sock_release
      which in turn calls again rtnl_lock in ip_mc_drop_socket to leave
      the mcast group, as noticed by Florian Westphal. Additionally,
      sock_release can not be called while holding sync_mutex (ABBA
      deadlock).
      
      2. task hung: holding rtnl_lock while calling kthread_stop to
      stop the running kthreads. As the kthreads do the same to leave
      the mcast group (sock_release -> ip_mc_drop_socket -> rtnl_lock)
      they hang.
      
      Fix the problems by calling rtnl_unlock early in the error path,
      now sock_release is called after unlocking both mutexes.
      
      Problem 3 (task hung reported by syzkaller [2]) is variant of
      problem 2: use _trylock to prevent one user to call rtnl_lock and
      then while waiting for sync_mutex to block kthreads that execute
      sock_release when they are stopped by stop_sync_thread.
      
      [1]
      IPVS: stopping backup sync thread 4500 ...
      WARNING: possible recursive locking detected
      4.16.0-rc7+ #3 Not tainted
      --------------------------------------------
      syzkaller688027/4497 is trying to acquire lock:
        (rtnl_mutex){+.+.}, at: [<00000000bb14d7fb>] rtnl_lock+0x17/0x20
      net/core/rtnetlink.c:74
      
      but task is already holding lock:
      IPVS: stopping backup sync thread 4495 ...
        (rtnl_mutex){+.+.}, at: [<00000000bb14d7fb>] rtnl_lock+0x17/0x20
      net/core/rtnetlink.c:74
      
      other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(rtnl_mutex);
         lock(rtnl_mutex);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
      2 locks held by syzkaller688027/4497:
        #0:  (rtnl_mutex){+.+.}, at: [<00000000bb14d7fb>] rtnl_lock+0x17/0x20
      net/core/rtnetlink.c:74
        #1:  (ipvs->sync_mutex){+.+.}, at: [<00000000703f78e3>]
      do_ip_vs_set_ctl+0x10f8/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2388
      
      stack backtrace:
      CPU: 1 PID: 4497 Comm: syzkaller688027 Not tainted 4.16.0-rc7+ #3
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:17 [inline]
        dump_stack+0x194/0x24d lib/dump_stack.c:53
        print_deadlock_bug kernel/locking/lockdep.c:1761 [inline]
        check_deadlock kernel/locking/lockdep.c:1805 [inline]
        validate_chain kernel/locking/lockdep.c:2401 [inline]
        __lock_acquire+0xe8f/0x3e00 kernel/locking/lockdep.c:3431
        lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
        __mutex_lock_common kernel/locking/mutex.c:756 [inline]
        __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
        mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
        rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
        ip_mc_drop_socket+0x88/0x230 net/ipv4/igmp.c:2643
        inet_release+0x4e/0x1c0 net/ipv4/af_inet.c:413
        sock_release+0x8d/0x1e0 net/socket.c:595
        start_sync_thread+0x2213/0x2b70 net/netfilter/ipvs/ip_vs_sync.c:1924
        do_ip_vs_set_ctl+0x1139/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2389
        nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
        nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
        ip_setsockopt+0x97/0xa0 net/ipv4/ip_sockglue.c:1261
        udp_setsockopt+0x45/0x80 net/ipv4/udp.c:2406
        sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2975
        SYSC_setsockopt net/socket.c:1849 [inline]
        SyS_setsockopt+0x189/0x360 net/socket.c:1828
        do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
        entry_SYSCALL_64_after_hwframe+0x42/0xb7
      RIP: 0033:0x446a69
      RSP: 002b:00007fa1c3a64da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000446a69
      RDX: 000000000000048b RSI: 0000000000000000 RDI: 0000000000000003
      RBP: 00000000006e29fc R08: 0000000000000018 R09: 0000000000000000
      R10: 00000000200000c0 R11: 0000000000000246 R12: 00000000006e29f8
      R13: 00676e697279656b R14: 00007fa1c3a659c0 R15: 00000000006e2b60
      
      [2]
      IPVS: sync thread started: state = BACKUP, mcast_ifn = syz_tun, syncid = 4,
      id = 0
      IPVS: stopping backup sync thread 25415 ...
      INFO: task syz-executor7:25421 blocked for more than 120 seconds.
             Not tainted 4.16.0-rc6+ #284
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      syz-executor7   D23688 25421   4408 0x00000004
      Call Trace:
        context_switch kernel/sched/core.c:2862 [inline]
        __schedule+0x8fb/0x1ec0 kernel/sched/core.c:3440
        schedule+0xf5/0x430 kernel/sched/core.c:3499
        schedule_timeout+0x1a3/0x230 kernel/time/timer.c:1777
        do_wait_for_common kernel/sched/completion.c:86 [inline]
        __wait_for_common kernel/sched/completion.c:107 [inline]
        wait_for_common kernel/sched/completion.c:118 [inline]
        wait_for_completion+0x415/0x770 kernel/sched/completion.c:139
        kthread_stop+0x14a/0x7a0 kernel/kthread.c:530
        stop_sync_thread+0x3d9/0x740 net/netfilter/ipvs/ip_vs_sync.c:1996
        do_ip_vs_set_ctl+0x2b1/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2394
        nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
        nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
        ip_setsockopt+0x97/0xa0 net/ipv4/ip_sockglue.c:1253
        sctp_setsockopt+0x2ca/0x63e0 net/sctp/socket.c:4154
        sock_common_setsockopt+0x95/0xd0 net/core/sock.c:3039
        SYSC_setsockopt net/socket.c:1850 [inline]
        SyS_setsockopt+0x189/0x360 net/socket.c:1829
        do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
        entry_SYSCALL_64_after_hwframe+0x42/0xb7
      RIP: 0033:0x454889
      RSP: 002b:00007fc927626c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
      RAX: ffffffffffffffda RBX: 00007fc9276276d4 RCX: 0000000000454889
      RDX: 000000000000048c RSI: 0000000000000000 RDI: 0000000000000017
      RBP: 000000000072bf58 R08: 0000000000000018 R09: 0000000000000000
      R10: 0000000020000000 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 000000000000051c R14: 00000000006f9b40 R15: 0000000000000001
      
      Showing all locks held in the system:
      2 locks held by khungtaskd/868:
        #0:  (rcu_read_lock){....}, at: [<00000000a1a8f002>]
      check_hung_uninterruptible_tasks kernel/hung_task.c:175 [inline]
        #0:  (rcu_read_lock){....}, at: [<00000000a1a8f002>] watchdog+0x1c5/0xd60
      kernel/hung_task.c:249
        #1:  (tasklist_lock){.+.+}, at: [<0000000037c2f8f9>]
      debug_show_all_locks+0xd3/0x3d0 kernel/locking/lockdep.c:4470
      1 lock held by rsyslogd/4247:
        #0:  (&f->f_pos_lock){+.+.}, at: [<000000000d8d6983>]
      __fdget_pos+0x12b/0x190 fs/file.c:765
      2 locks held by getty/4338:
        #0:  (&tty->ldisc_sem){++++}, at: [<00000000bee98654>]
      ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
        #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>]
      n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
      2 locks held by getty/4339:
        #0:  (&tty->ldisc_sem){++++}, at: [<00000000bee98654>]
      ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
        #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>]
      n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
      2 locks held by getty/4340:
        #0:  (&tty->ldisc_sem){++++}, at: [<00000000bee98654>]
      ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
        #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>]
      n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
      2 locks held by getty/4341:
        #0:  (&tty->ldisc_sem){++++}, at: [<00000000bee98654>]
      ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
        #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>]
      n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
      2 locks held by getty/4342:
        #0:  (&tty->ldisc_sem){++++}, at: [<00000000bee98654>]
      ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
        #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>]
      n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
      2 locks held by getty/4343:
        #0:  (&tty->ldisc_sem){++++}, at: [<00000000bee98654>]
      ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
        #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>]
      n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
      2 locks held by getty/4344:
        #0:  (&tty->ldisc_sem){++++}, at: [<00000000bee98654>]
      ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
        #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>]
      n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
      3 locks held by kworker/0:5/6494:
        #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
      [<00000000a062b18e>] work_static include/linux/workqueue.h:198 [inline]
        #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
      [<00000000a062b18e>] set_work_data kernel/workqueue.c:619 [inline]
        #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
      [<00000000a062b18e>] set_work_pool_and_clear_pending kernel/workqueue.c:646
      [inline]
        #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
      [<00000000a062b18e>] process_one_work+0xb12/0x1bb0 kernel/workqueue.c:2084
        #1:  ((addr_chk_work).work){+.+.}, at: [<00000000278427d5>]
      process_one_work+0xb89/0x1bb0 kernel/workqueue.c:2088
        #2:  (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20
      net/core/rtnetlink.c:74
      1 lock held by syz-executor7/25421:
        #0:  (ipvs->sync_mutex){+.+.}, at: [<00000000d414a689>]
      do_ip_vs_set_ctl+0x277/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2393
      2 locks held by syz-executor7/25427:
        #0:  (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20
      net/core/rtnetlink.c:74
        #1:  (ipvs->sync_mutex){+.+.}, at: [<00000000e6d48489>]
      do_ip_vs_set_ctl+0x10f8/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2388
      1 lock held by syz-executor7/25435:
        #0:  (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20
      net/core/rtnetlink.c:74
      1 lock held by ipvs-b:2:0/25415:
        #0:  (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20
      net/core/rtnetlink.c:74
      
      Reported-and-tested-by: syzbot+a46d6abf9d56b1365a72@syzkaller.appspotmail.com
      Reported-and-tested-by: syzbot+5fe074c01b2032ce9618@syzkaller.appspotmail.com
      Fixes: e0b26cc9 ("ipvs: call rtnl_lock early")
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarSimon Horman <horms@verge.net.au>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Cc: Zubin Mithra <zsm@chromium.org>
      Cc: Guenter Roeck <groeck@chromium.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      83797a77
  2. 09 May, 2018 33 commits
  3. 01 May, 2018 5 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.9.98 · eff40cb1
      Greg Kroah-Hartman authored
      eff40cb1
    • Michael Neuling's avatar
      powerpc/eeh: Fix race with driver un/bind · 80bb480f
      Michael Neuling authored
      commit f0295e04 upstream.
      
      The current EEH callbacks can race with a driver unbind. This can
      result in a backtraces like this:
      
        EEH: Frozen PHB#0-PE#1fc detected
        EEH: PE location: S000009, PHB location: N/A
        CPU: 2 PID: 2312 Comm: kworker/u258:3 Not tainted 4.15.6-openpower1 #2
        Workqueue: nvme-wq nvme_reset_work [nvme]
        Call Trace:
          dump_stack+0x9c/0xd0 (unreliable)
          eeh_dev_check_failure+0x420/0x470
          eeh_check_failure+0xa0/0xa4
          nvme_reset_work+0x138/0x1414 [nvme]
          process_one_work+0x1ec/0x328
          worker_thread+0x2e4/0x3a8
          kthread+0x14c/0x154
          ret_from_kernel_thread+0x5c/0xc8
        nvme nvme1: Removing after probe failure status: -19
        <snip>
        cpu 0x23: Vector: 300 (Data Access) at [c000000ff50f3800]
            pc: c0080000089a0eb0: nvme_error_detected+0x4c/0x90 [nvme]
            lr: c000000000026564: eeh_report_error+0xe0/0x110
            sp: c000000ff50f3a80
           msr: 9000000000009033
           dar: 400
         dsisr: 40000000
          current = 0xc000000ff507c000
          paca    = 0xc00000000fdc9d80   softe: 0        irq_happened: 0x01
            pid   = 782, comm = eehd
        Linux version 4.15.6-openpower1 (smc@smc-desktop) (gcc version 6.4.0 (Buildroot 2017.11.2-00008-g4b6188e)) #2 SM                                             P Tue Feb 27 12:33:27 PST 2018
        enter ? for help
          eeh_report_error+0xe0/0x110
          eeh_pe_dev_traverse+0xc0/0xdc
          eeh_handle_normal_event+0x184/0x4c4
          eeh_handle_event+0x30/0x288
          eeh_event_handler+0x124/0x170
          kthread+0x14c/0x154
          ret_from_kernel_thread+0x5c/0xc8
      
      The first part is an EEH (on boot), the second half is the resulting
      crash. nvme probe starts the nvme_reset_work() worker thread. This
      worker thread starts touching the device which see a device error
      (EEH) and hence queues up an event in the powerpc EEH worker
      thread. nvme_reset_work() then continues and runs
      nvme_remove_dead_ctrl_work() which results in unbinding the driver
      from the device and hence releases all resources. At the same time,
      the EEH worker thread starts doing the EEH .error_detected() driver
      callback, which no longer works since the resources have been freed.
      
      This fixes the problem in the same way the generic PCIe AER code (in
      drivers/pci/pcie/aer/aerdrv_core.c) does. It makes the EEH code hold
      the device_lock() while performing the driver EEH callbacks and
      associated code. This ensures either the callbacks are no longer
      register, or if they are registered the driver will not be removed
      from underneath us.
      
      This has been broken forever. The EEH call backs were first introduced
      in 2005 (in 77bd7415) but it's not clear if a lock was needed back
      then.
      
      Fixes: 77bd7415 ("[PATCH] powerpc: PCI Error Recovery: PPC64 core recovery routines")
      Cc: stable@vger.kernel.org # v2.6.16+
      Signed-off-by: default avatarMichael Neuling <mikey@neuling.org>
      Reviewed-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      80bb480f
    • Borislav Petkov's avatar
      x86/microcode/intel: Save microcode patch unconditionally · c11a6ed5
      Borislav Petkov authored
      commit 84749d83 upstream.
      
      save_mc_for_early() was a no-op on !CONFIG_HOTPLUG_CPU but the
      generic_load_microcode() path saves the microcode patches it has found into
      the cache of patches which is used for late loading too. Regardless of
      whether CPU hotplug is used or not.
      
      Make the saving unconditional so that late loading can find the proper
      patch.
      Reported-by: default avatarVitezslav Samel <vitezslav@samel.cz>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarVitezslav Samel <vitezslav@samel.cz>
      Tested-by: default avatarAshok Raj <ashok.raj@intel.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180418081140.GA2439@pc11.op.pod.cz
      Link: https://lkml.kernel.org/r/20180421081930.15741-1-bp@alien8.deSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c11a6ed5
    • Yazen Ghannam's avatar
      x86/smpboot: Don't use mwait_play_dead() on AMD systems · 09e43b9b
      Yazen Ghannam authored
      commit da6fa7ef upstream.
      
      Recent AMD systems support using MWAIT for C1 state. However, MWAIT will
      not allow deeper cstates than C1 on current systems.
      
      play_dead() expects to use the deepest state available.  The deepest state
      available on AMD systems is reached through SystemIO or HALT. If MWAIT is
      available, it is preferred over the other methods, so the CPU never reaches
      the deepest possible state.
      
      Don't try to use MWAIT to play_dead() on AMD systems. Instead, use CPUIDLE
      to enter the deepest state advertised by firmware. If CPUIDLE is not
      available then fallback to HALT.
      Signed-off-by: default avatarYazen Ghannam <yazen.ghannam@amd.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: stable@vger.kernel.org
      Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
      Link: https://lkml.kernel.org/r/20180403140228.58540-1-Yazen.Ghannam@amd.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      09e43b9b
    • Arnd Bergmann's avatar
      x86/ipc: Fix x32 version of shmid64_ds and msqid64_ds · 41b0aa3c
      Arnd Bergmann authored
      commit 1a512c08 upstream.
      
      A bugfix broke the x32 shmid64_ds and msqid64_ds data structure layout
      (as seen from user space)  a few years ago: Originally, __BITS_PER_LONG
      was defined as 64 on x32, so we did not have padding after the 64-bit
      __kernel_time_t fields, After __BITS_PER_LONG got changed to 32,
      applications would observe extra padding.
      
      In other parts of the uapi headers we seem to have a mix of those
      expecting either 32 or 64 on x32 applications, so we can't easily revert
      the path that broke these two structures.
      
      Instead, this patch decouples x32 from the other architectures and moves
      it back into arch specific headers, partially reverting the even older
      commit 73a2d096 ("x86: remove all now-duplicate header files").
      
      It's not clear whether this ever made any difference, since at least
      glibc carries its own (correct) copy of both of these header files,
      so possibly no application has ever observed the definitions here.
      
      Based on a suggestion from H.J. Lu, I tried out the tool from
      https://github.com/hjl-tools/linux-header to find other such
      bugs, which pointed out the same bug in statfs(), which also has
      a separate (correct) copy in glibc.
      
      Fixes: f4b4aae1 ("x86/headers/uapi: Fix __BITS_PER_LONG value for x32 builds")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: "H . J . Lu" <hjl.tools@gmail.com>
      Cc: Jeffrey Walton <noloader@gmail.com>
      Cc: stable@vger.kernel.org
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180424212013.3967461-1-arnd@arndb.deSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      41b0aa3c