1. 08 Aug, 2019 9 commits
    • Peter Zijlstra's avatar
      sched/fair: Expose newidle_balance() · 5ba553ef
      Peter Zijlstra authored
      For pick_next_task_fair() it is the newidle balance that requires
      dropping the rq->lock; provided we do put_prev_task() early, we can
      also detect the condition for doing newidle early.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Lu <aaron.lwe@gmail.com>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Cc: mingo@kernel.org
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Julien Desfossez <jdesfossez@digitalocean.com>
      Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
      Link: https://lkml.kernel.org/r/9e3eb1859b946f03d7e500453a885725b68957ba.1559129225.git.vpillai@digitalocean.com
      5ba553ef
    • Peter Zijlstra's avatar
      sched: Add task_struct pointer to sched_class::set_curr_task · 03b7fad1
      Peter Zijlstra authored
      In preparation of further separating pick_next_task() and
      set_curr_task() we have to pass the actual task into it, while there,
      rename the thing to better pair with put_prev_task().
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Lu <aaron.lwe@gmail.com>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Cc: mingo@kernel.org
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Julien Desfossez <jdesfossez@digitalocean.com>
      Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
      Link: https://lkml.kernel.org/r/a96d1bcdd716db4a4c5da2fece647a1456c0ed78.1559129225.git.vpillai@digitalocean.com
      03b7fad1
    • Peter Zijlstra's avatar
      sched: Rework CPU hotplug task selection · 10e7071b
      Peter Zijlstra authored
      The CPU hotplug task selection is the only place where we used
      put_prev_task() on a task that is not current. While looking at that,
      it occured to me that we can simplify all that by by using a custom
      pick loop.
      
      Since we don't need to put current, we can do away with the fake task
      too.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Lu <aaron.lwe@gmail.com>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Cc: mingo@kernel.org
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Julien Desfossez <jdesfossez@digitalocean.com>
      Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
      10e7071b
    • Peter Zijlstra's avatar
      sched/{rt,deadline}: Fix set_next_task vs pick_next_task · f95d4eae
      Peter Zijlstra authored
      Because pick_next_task() implies set_curr_task() and some of the
      details haven't mattered too much, some of what _should_ be in
      set_curr_task() ended up in pick_next_task, correct this.
      
      This prepares the way for a pick_next_task() variant that does not
      affect the current state; allowing remote picking.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Lu <aaron.lwe@gmail.com>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Cc: mingo@kernel.org
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Julien Desfossez <jdesfossez@digitalocean.com>
      Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
      Link: https://lkml.kernel.org/r/38c61d5240553e043c27c5e00b9dd0d184dd6081.1559129225.git.vpillai@digitalocean.com
      f95d4eae
    • Peter Zijlstra's avatar
      sched: Fix kerneldoc comment for ia64_set_curr_task · 5feeb783
      Peter Zijlstra authored
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Lu <aaron.lwe@gmail.com>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Cc: mingo@kernel.org
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Julien Desfossez <jdesfossez@digitalocean.com>
      Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
      Link: https://lkml.kernel.org/r/fde3a65ea3091ec6b84dac3c19639f85f452c5d1.1559129225.git.vpillai@digitalocean.com
      5feeb783
    • Peter Zijlstra's avatar
      stop_machine: Fix stop_cpus_in_progress ordering · 99d84bf8
      Peter Zijlstra authored
      Make sure the entire for loop has stop_cpus_in_progress set.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Lu <aaron.lwe@gmail.com>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Cc: mingo@kernel.org
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Julien Desfossez <jdesfossez@digitalocean.com>
      Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
      Link: https://lkml.kernel.org/r/0fd8fd4b99b9b9aa88d8b2dff897f7fd0d88f72c.1559129225.git.vpillai@digitalocean.com
      99d84bf8
    • Dave Chiluk's avatar
      sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices · de53fd7a
      Dave Chiluk authored
      It has been observed, that highly-threaded, non-cpu-bound applications
      running under cpu.cfs_quota_us constraints can hit a high percentage of
      periods throttled while simultaneously not consuming the allocated
      amount of quota. This use case is typical of user-interactive non-cpu
      bound applications, such as those running in kubernetes or mesos when
      run on multiple cpu cores.
      
      This has been root caused to cpu-local run queue being allocated per cpu
      bandwidth slices, and then not fully using that slice within the period.
      At which point the slice and quota expires. This expiration of unused
      slice results in applications not being able to utilize the quota for
      which they are allocated.
      
      The non-expiration of per-cpu slices was recently fixed by
      'commit 512ac999 ("sched/fair: Fix bandwidth timer clock drift
      condition")'. Prior to that it appears that this had been broken since
      at least 'commit 51f2176d ("sched/fair: Fix unlocked reads of some
      cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
      added the following conditional which resulted in slices never being
      expired.
      
      if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
      	/* extend local deadline, drift is bounded above by 2 ticks */
      	cfs_rq->runtime_expires += TICK_NSEC;
      
      Because this was broken for nearly 5 years, and has recently been fixed
      and is now being noticed by many users running kubernetes
      (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
      that the mechanisms around expiring runtime should be removed
      altogether.
      
      This allows quota already allocated to per-cpu run-queues to live longer
      than the period boundary. This allows threads on runqueues that do not
      use much CPU to continue to use their remaining slice over a longer
      period of time than cpu.cfs_period_us. However, this helps prevent the
      above condition of hitting throttling while also not fully utilizing
      your cpu quota.
      
      This theoretically allows a machine to use slightly more than its
      allotted quota in some periods. This overflow would be bounded by the
      remaining quota left on each per-cpu runqueueu. This is typically no
      more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
      change nothing, as they should theoretically fully utilize all of their
      quota in each period. For user-interactive tasks as described above this
      provides a much better user/application experience as their cpu
      utilization will more closely match the amount they requested when they
      hit throttling. This means that cpu limits no longer strictly apply per
      period for non-cpu bound applications, but that they are still accurate
      over longer timeframes.
      
      This greatly improves performance of high-thread-count, non-cpu bound
      applications with low cfs_quota_us allocation on high-core-count
      machines. In the case of an artificial testcase (10ms/100ms of quota on
      80 CPU machine), this commit resulted in almost 30x performance
      improvement, while still maintaining correct cpu quota restrictions.
      That testcase is available at https://github.com/indeedeng/fibtest.
      
      Fixes: 512ac999 ("sched/fair: Fix bandwidth timer clock drift condition")
      Signed-off-by: default avatarDave Chiluk <chiluk+linux@indeed.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarPhil Auld <pauld@redhat.com>
      Reviewed-by: default avatarBen Segall <bsegall@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: John Hammond <jhammond@indeed.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kyle Anderson <kwa@yelp.com>
      Cc: Gabriel Munos <gmunoz@netflix.com>
      Cc: Peter Oskolkov <posk@posk.io>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Brendan Gregg <bgregg@netflix.com>
      Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com
      de53fd7a
    • Peter Zijlstra's avatar
      sched: Clean up active_mm reference counting · 139d025c
      Peter Zijlstra authored
      The current active_mm reference counting is confusing and sub-optimal.
      
      Rewrite the code to explicitly consider the 4 separate cases:
      
          user -> user
      
      	When switching between two user tasks, all we need to consider
      	is switch_mm().
      
          user -> kernel
      
      	When switching from a user task to a kernel task (which
      	doesn't have an associated mm) we retain the last mm in our
      	active_mm. Increment a reference count on active_mm.
      
        kernel -> kernel
      
      	When switching between kernel threads, all we need to do is
      	pass along the active_mm reference.
      
        kernel -> user
      
      	When switching between a kernel and user task, we must switch
      	from the last active_mm to the next mm, hoping of course that
      	these are the same. Decrement a reference on the active_mm.
      
      The code keeps a different order, because as you'll note, both 'to
      user' cases require switch_mm().
      
      And where the old code would increment/decrement for the 'kernel ->
      kernel' case, the new code observes this is a neutral operation and
      avoids touching the reference count.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Reviewed-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: luto@kernel.org
      139d025c
    • Peter Zijlstra's avatar
      rcu/tree: Fix SCHED_FIFO params · 130d9c33
      Peter Zijlstra authored
      A rather embarrasing mistake had us call sched_setscheduler() before
      initializing the parameters passed to it.
      
      Fixes: 1a763fd7 ("rcu/tree: Call setschedule() gp ktread to SCHED_FIFO outside of atomic region")
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.ibm.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      130d9c33
  2. 25 Jul, 2019 23 commits
  3. 22 Jul, 2019 8 commits
    • Linus Torvalds's avatar
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7b5cf701
      Linus Torvalds authored
      Pull preemption Kconfig fix from Thomas Gleixner:
       "The PREEMPT_RT stub config renamed PREEMPT to PREEMPT_LL and defined
        PREEMPT outside of the menu and made it selectable by both PREEMPT_LL
        and PREEMPT_RT.
      
        Stupid me missed that 114 defconfigs select CONFIG_PREEMPT which
        obviously can't work anymore. oldconfig builds are affected as well,
        but it's more obvious as the user gets asked. [old]defconfig silently
        fixes it up and selects PREEMPT_NONE.
      
        Unbreak it by undoing the rename and adding a intermediate config
        symbol which is selected by both PREEMPT and PREEMPT_RT. That requires
        to chase down a few #ifdefs, but it's better than tweaking 114
        defconfigs and annoying users"
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/rt, Kconfig: Unbreak def/oldconfig with CONFIG_PREEMPT=y
      7b5cf701
    • Linus Torvalds's avatar
      Merge tag 'for-linus-20190722' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux · 44b912cd
      Linus Torvalds authored
      Pull pidfd polling fix from Christian Brauner:
       "A fix for pidfd polling. It ensures that the task's exit state is
        visible to all waiters"
      
      * tag 'for-linus-20190722' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
        pidfd: fix a poll race when setting exit_state
      44b912cd
    • Linus Torvalds's avatar
      Merge tag 'for-5.3-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 21c730d7
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
      
       - fixes for leaks caused by recently merged patches
      
       - one build fix
      
       - a fix to prevent mixing of incompatible features
      
      * tag 'for-5.3-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: don't leak extent_map in btrfs_get_io_geometry()
        btrfs: free checksum hash on in close_ctree
        btrfs: Fix build error while LIBCRC32C is module
        btrfs: inode: Don't compress if NODATASUM or NODATACOW set
      21c730d7
    • Thomas Gleixner's avatar
      sched/rt, Kconfig: Unbreak def/oldconfig with CONFIG_PREEMPT=y · b8d33498
      Thomas Gleixner authored
      The merge of the CONFIG_PREEMPT_RT stub renamed CONFIG_PREEMPT to
      CONFIG_PREEMPT_LL which causes all defconfigs which have CONFIG_PREEMPT=y
      set to fall back to CONFIG_PREEMPT_NONE because CONFIG_PREEMPT depends on
      the preemption mode choice wich defaults to NONE. This also affects
      oldconfig builds.
      
      So rather than changing 114 defconfig files and being an annoyance to
      users, revert the rename and select a new config symbol PREEMPTION. That
      keeps everything working smoothly and the revelant ifdef's are going to be
      fixed up step by step.
      Reported-by: default avatarMark Rutland <mark.rutland@arm.com>
      Fixes: a50a3f4b ("sched/rt, Kconfig: Introduce CONFIG_PREEMPT_RT")
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      b8d33498
    • Linus Torvalds's avatar
      Merge tag 'media/v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · c92f0380
      Linus Torvalds authored
      Pull media fixes from Mauro Carvalho Chehab:
       "For two regressions in media core:
      
         - v4l2-subdev: fix regression in check_pad()
      
         - videodev2.h: change V4L2_PIX_FMT_BGRA444 define: fourcc was already
           in use"
      
      * tag 'media/v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
        media: videodev2.h: change V4L2_PIX_FMT_BGRA444 define: fourcc was already in use
        media: v4l2-subdev: fix regression in check_pad()
      c92f0380
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 83768245
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Several netfilter fixes including a nfnetlink deadlock fix from
          Florian Westphal and fix for dropping VRF packets from Miaohe Lin.
      
       2) Flow offload fixes from Pablo Neira Ayuso including a fix to restore
          proper block sharing.
      
       3) Fix r8169 PHY init from Thomas Voegtle.
      
       4) Fix memory leak in mac80211, from Lorenzo Bianconi.
      
       5) Missing NULL check on object allocation in cxgb4, from Navid
          Emamdoost.
      
       6) Fix scaling of RX power in sfp phy driver, from Andrew Lunn.
      
       7) Check that there is actually an ip header to access in skb->data in
          VRF, from Peter Kosyh.
      
       8) Remove spurious rcu unlock in hv_netvsc, from Haiyang Zhang.
      
       9) One more tweak the the TCP fragmentation memory limit changes, to be
          less harmful to applications setting small SO_SNDBUF values. From
          Eric Dumazet.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (40 commits)
        tcp: be more careful in tcp_fragment()
        hv_netvsc: Fix extra rcu_read_unlock in netvsc_recv_callback()
        vrf: make sure skb->data contains ip header to make routing
        connector: remove redundant input callback from cn_dev
        qed: Prefer pcie_capability_read_word()
        igc: Prefer pcie_capability_read_word()
        cxgb4: Prefer pcie_capability_read_word()
        be2net: Synchronize be_update_queues with dev_watchdog
        bnx2x: Prevent load reordering in tx completion processing
        net: phy: sfp: hwmon: Fix scaling of RX power
        net: sched: verify that q!=NULL before setting q->flags
        chelsio: Fix a typo in a function name
        allocate_flower_entry: should check for null deref
        net: hns3: typo in the name of a constant
        kbuild: add net/netfilter/nf_tables_offload.h to header-test blacklist.
        tipc: Fix a typo
        mac80211: don't warn about CW params when not using them
        mac80211: fix possible memory leak in ieee80211_assign_beacon
        nl80211: fix NL80211_HE_MAX_CAPABILITY_LEN
        nl80211: fix VENDOR_CMD_RAW_DATA
        ...
      83768245
    • Suren Baghdasaryan's avatar
      pidfd: fix a poll race when setting exit_state · b191d649
      Suren Baghdasaryan authored
      There is a race between reading task->exit_state in pidfd_poll and
      writing it after do_notify_parent calls do_notify_pidfd. Expected
      sequence of events is:
      
      CPU 0                            CPU 1
      ------------------------------------------------
      exit_notify
        do_notify_parent
          do_notify_pidfd
        tsk->exit_state = EXIT_DEAD
                                        pidfd_poll
                                           if (tsk->exit_state)
      
      However nothing prevents the following sequence:
      
      CPU 0                            CPU 1
      ------------------------------------------------
      exit_notify
        do_notify_parent
          do_notify_pidfd
                                         pidfd_poll
                                            if (tsk->exit_state)
        tsk->exit_state = EXIT_DEAD
      
      This causes a polling task to wait forever, since poll blocks because
      exit_state is 0 and the waiting task is not notified again. A stress
      test continuously doing pidfd poll and process exits uncovered this bug.
      
      To fix it, we make sure that the task's exit_state is always set before
      calling do_notify_pidfd.
      
      Fixes: b53b0b9d ("pidfd: add polling support")
      Cc: kernel-team@android.com
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Link: https://lore.kernel.org/r/20190717172100.261204-1-joel@joelfernandes.org
      [christian@brauner.io: adapt commit message and drop unneeded changes from wait_task_zombie]
      Signed-off-by: default avatarChristian Brauner <christian@brauner.io>
      b191d649
    • Eric Dumazet's avatar
      tcp: be more careful in tcp_fragment() · b617158d
      Eric Dumazet authored
      Some applications set tiny SO_SNDBUF values and expect
      TCP to just work. Recent patches to address CVE-2019-11478
      broke them in case of losses, since retransmits might
      be prevented.
      
      We should allow these flows to make progress.
      
      This patch allows the first and last skb in retransmit queue
      to be split even if memory limits are hit.
      
      It also adds the some room due to the fact that tcp_sendmsg()
      and tcp_sendpage() might overshoot sk_wmem_queued by about one full
      TSO skb (64KB size). Note this allowance was already present
      in stable backports for kernels < 4.15
      
      Note for < 4.15 backports :
       tcp_rtx_queue_tail() will probably look like :
      
      static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
      {
      	struct sk_buff *skb = tcp_send_head(sk);
      
      	return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
      }
      
      Fixes: f070ef2a ("tcp: tcp_fragment() should apply sane memory limits")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarAndrew Prout <aprout@ll.mit.edu>
      Tested-by: default avatarAndrew Prout <aprout@ll.mit.edu>
      Tested-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Tested-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Cc: Jonathan Looney <jtl@netflix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b617158d