1. 13 Jun, 2022 10 commits
    • Tianchen Ding's avatar
      sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle · f3dd3f67
      Tianchen Ding authored
      Wakelist can help avoid cache bouncing and offload the overhead of waker
      cpu. So far, using wakelist within the same llc only happens on
      WF_ON_CPU, and this limitation could be removed to further improve
      wakeup performance.
      
      The commit 518cd623 ("sched: Only queue remote wakeups when
      crossing cache boundaries") disabled queuing tasks on wakelist when
      the cpus share llc. This is because, at that time, the scheduler must
      send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also
      supports TIF_POLLING, so this is not a problem now when the wakee cpu is
      in idle polling.
      
      Benefits:
        Queuing the task on idle cpu can help improving performance on waker cpu
        and utilization on wakee cpu, and further improve locality because
        the wakee cpu can handle its own rq. This patch helps improving rt on
        our real java workloads where wakeup happens frequently.
      
        Consider the normal condition (CPU0 and CPU1 share same llc)
        Before this patch:
      
               CPU0                                       CPU1
      
          select_task_rq()                                idle
          rq_lock(CPU1->rq)
          enqueue_task(CPU1->rq)
          notify CPU1 (by sending IPI or CPU1 polling)
      
                                                          resched()
      
        After this patch:
      
               CPU0                                       CPU1
      
          select_task_rq()                                idle
          add to wakelist of CPU1
          notify CPU1 (by sending IPI or CPU1 polling)
      
                                                          rq_lock(CPU1->rq)
                                                          enqueue_task(CPU1->rq)
                                                          resched()
      
        We see CPU0 can finish its work earlier. It only needs to put task to
        wakelist and return.
        While CPU1 is idle, so let itself handle its own runqueue data.
      
      This patch brings no difference about IPI.
        This patch only takes effect when the wakee cpu is:
        1) idle polling
        2) idle not polling
      
        For 1), there will be no IPI with or without this patch.
      
        For 2), there will always be an IPI before or after this patch.
        Before this patch: waker cpu will enqueue task and check preempt. Since
        "idle" will be sure to be preempted, waker cpu must send a resched IPI.
        After this patch: waker cpu will put the task to the wakelist of wakee
        cpu, and send an IPI.
      
      Benchmark:
      We've tested schbench, unixbench, and hachbench on both x86 and arm64.
      
      On x86 (Intel Xeon Platinum 8269CY):
        schbench -m 2 -t 8
      
          Latency percentiles (usec)              before        after
              50.0000th:                             8            6
              75.0000th:                            10            7
              90.0000th:                            11            8
              95.0000th:                            12            8
              *99.0000th:                           13           10
              99.5000th:                            15           11
              99.9000th:                            18           14
      
        Unixbench with full threads (104)
                                                  before        after
          Dhrystone 2 using register variables  3011862938    3009935994  -0.06%
          Double-Precision Whetstone              617119.3      617298.5   0.03%
          Execl Throughput                         27667.3       27627.3  -0.14%
          File Copy 1024 bufsize 2000 maxblocks   785871.4      784906.2  -0.12%
          File Copy 256 bufsize 500 maxblocks     210113.6      212635.4   1.20%
          File Copy 4096 bufsize 8000 maxblocks  2328862.2     2320529.1  -0.36%
          Pipe Throughput                      145535622.8   145323033.2  -0.15%
          Pipe-based Context Switching           3221686.4     3583975.4  11.25%
          Process Creation                        101347.1      103345.4   1.97%
          Shell Scripts (1 concurrent)            120193.5      123977.8   3.15%
          Shell Scripts (8 concurrent)             17233.4       17138.4  -0.55%
          System Call Overhead                   5300604.8     5312213.6   0.22%
      
        hackbench -g 1 -l 100000
                                                  before        after
          Time                                     3.246        2.251
      
      On arm64 (Ampere Altra):
        schbench -m 2 -t 8
      
          Latency percentiles (usec)              before        after
              50.0000th:                            14           10
              75.0000th:                            19           14
              90.0000th:                            22           16
              95.0000th:                            23           16
              *99.0000th:                           24           17
              99.5000th:                            24           17
              99.9000th:                            28           25
      
        Unixbench with full threads (80)
                                                  before        after
          Dhrystone 2 using register variables  3536194249    3537019613   0.02%
          Double-Precision Whetstone              629383.6      629431.6   0.01%
          Execl Throughput                         65920.5       65846.2  -0.11%
          File Copy 1024 bufsize 2000 maxblocks  1063722.8     1064026b.8   0.03%
          File Copy 256 bufsize 500 maxblocks     322684.5      318724.5  -1.23%
          File Copy 4096 bufsize 8000 maxblocks  2348285.3     2328804.8  -0.83%
          Pipe Throughput                      133542875.3   131619389.8  -1.44%
          Pipe-based Context Switching           3215356.1     3576945.1  11.25%
          Process Creation                        108520.5      120184.6  10.75%
          Shell Scripts (1 concurrent)            122636.3        121888  -0.61%
          Shell Scripts (8 concurrent)             17462.1       17381.4  -0.46%
          System Call Overhead                   4429998.9     44350061.7   0.11%
      
        hackbench -g 1 -l 100000
                                                  before        after
          Time                                     4.217        2.916
      
      Our patch has improvement on schbench, hackbench
      and Pipe-based Context Switching of unixbench
      when there exists idle cpus,
      and no obvious regression on other tests of unixbench.
      This can help improve rt in scenes where wakeup happens frequently.
      Signed-off-by: default avatarTianchen Ding <dtcccc@linux.alibaba.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <vschneid@redhat.com>
      Link: https://lore.kernel.org/r/20220608233412.327341-3-dtcccc@linux.alibaba.com
      f3dd3f67
    • Tianchen Ding's avatar
      sched: Fix the check of nr_running at queue wakelist · 28156108
      Tianchen Ding authored
      The commit 2ebb1771 ("sched/core: Offload wakee task activation if it
      the wakee is descheduling") checked rq->nr_running <= 1 to avoid task
      stacking when WF_ON_CPU.
      
      Per the ordering of writes to p->on_rq and p->on_cpu, observing p->on_cpu
      (WF_ON_CPU) in ttwu_queue_cond() implies !p->on_rq, IOW p has gone through
      the deactivate_task() in __schedule(), thus p has been accounted out of
      rq->nr_running. As such, the task being the only runnable task on the rq
      implies reading rq->nr_running == 0 at that point.
      
      The benchmark result is in [1].
      
      [1] https://lore.kernel.org/all/e34de686-4e85-bde1-9f3c-9bbc86b38627@linux.alibaba.com/Suggested-by: default avatarValentin Schneider <vschneid@redhat.com>
      Signed-off-by: default avatarTianchen Ding <dtcccc@linux.alibaba.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <vschneid@redhat.com>
      Link: https://lore.kernel.org/r/20220608233412.327341-2-dtcccc@linux.alibaba.com
      28156108
    • Josh Don's avatar
      sched: Allow newidle balancing to bail out of load_balance · 792b9f65
      Josh Don authored
      While doing newidle load balancing, it is possible for new tasks to
      arrive, such as with pending wakeups. newidle_balance() already accounts
      for this by exiting the sched_domain load_balance() iteration if it
      detects these cases. This is very important for minimizing wakeup
      latency.
      
      However, if we are already in load_balance(), we may stay there for a
      while before returning back to newidle_balance(). This is most
      exacerbated if we enter a 'goto redo' loop in the LBF_ALL_PINNED case. A
      very straightforward workaround to this is to adjust should_we_balance()
      to bail out if we're doing a CPU_NEWLY_IDLE balance and new tasks are
      detected.
      
      This was tested with the following reproduction:
      - two threads that take turns sleeping and waking each other up are
        affined to two cores
      - a large number of threads with 100% utilization are pinned to all
        other cores
      
      Without this patch, wakeup latency was ~120us for the pair of threads,
      almost entirely spent in load_balance(). With this patch, wakeup latency
      is ~6us.
      Signed-off-by: default avatarJosh Don <joshdon@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20220609025515.2086253-1-joshdon@google.com
      792b9f65
    • Yajun Deng's avatar
      sched/deadline: Use proc_douintvec_minmax() limit minimum value · 2ed81e76
      Yajun Deng authored
      sysctl_sched_dl_period_max and sysctl_sched_dl_period_min are unsigned
      integer, but proc_dointvec() wouldn't return error even if we set a
      negative number.
      
      Use proc_douintvec_minmax() instead of proc_dointvec(). Add extra1 for
      sysctl_sched_dl_period_max and extra2 for sysctl_sched_dl_period_min.
      
      It's just an optimization for match data and proc_handler in struct
      ctl_table. The 'if (period < min || period > max)' in __checkparam_dl()
      will work fine even if there hasn't this patch.
      Signed-off-by: default avatarYajun Deng <yajun.deng@linux.dev>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDaniel Bristot de Oliveira <bristot@kernel.org>
      Link: https://lore.kernel.org/r/20220607101807.249965-1-yajun.deng@linux.dev
      2ed81e76
    • Chengming Zhou's avatar
      sched/fair: Optimize and simplify rq leaf_cfs_rq_list · 51bf903b
      Chengming Zhou authored
      We notice the rq leaf_cfs_rq_list has two problems when do bugfix
      backports and some test profiling.
      
      1. cfs_rqs under throttled subtree could be added to the list, and
         make their fully decayed ancestors on the list, even though not needed.
      
      2. #1 also make the leaf_cfs_rq_list management complex and error prone,
         this is the list of related bugfix so far:
      
         commit 31bc6aea ("sched/fair: Optimize update_blocked_averages()")
         commit fe61468b ("sched/fair: Fix enqueue_task_fair warning")
         commit b34cb07d ("sched/fair: Fix enqueue_task_fair() warning some more")
         commit 39f23ce0 ("sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list")
         commit 0258bdfa ("sched/fair: Fix unfairness caused by missing load decay")
         commit a7b359fc ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
         commit fdaba61e ("sched/fair: Ensure that the CFS parent is added after unthrottling")
         commit 2630cde2 ("sched/fair: Add ancestors of unthrottled undecayed cfs_rq")
      
      commit 31bc6aea ("sched/fair: Optimize update_blocked_averages()")
      delete every cfs_rq under throttled subtree from rq->leaf_cfs_rq_list,
      and delete the throttled_hierarchy() test in update_blocked_averages(),
      which optimized update_blocked_averages().
      
      But those later bugfix add cfs_rqs under throttled subtree back to
      rq->leaf_cfs_rq_list again, with their fully decayed ancestors, for
      the integrity of rq->leaf_cfs_rq_list.
      
      This patch takes another method, skip all cfs_rqs under throttled
      hierarchy when list_add_leaf_cfs_rq(), to completely make cfs_rqs
      under throttled subtree off the leaf_cfs_rq_list.
      
      So we don't need to consider throttled related things in
      enqueue_entity(), unthrottle_cfs_rq() and enqueue_task_fair(),
      which simplify the code a lot. Also optimize update_blocked_averages()
      since cfs_rqs under throttled hierarchy and their ancestors
      won't be on the leaf_cfs_rq_list.
      Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lore.kernel.org/r/20220601021848.76943-1-zhouchengming@bytedance.com
      51bf903b
    • K Prateek Nayak's avatar
      sched/fair: Consider CPU affinity when allowing NUMA imbalance in find_idlest_group() · f5b2eeb4
      K Prateek Nayak authored
      In the case of systems containing multiple LLCs per socket, like
      AMD Zen systems, users want to spread bandwidth hungry applications
      across multiple LLCs. Stream is one such representative workload where
      the best performance is obtained by limiting one stream thread per LLC.
      To ensure this, users are known to pin the tasks to a specify a subset
      of the CPUs consisting of one CPU per LLC while running such bandwidth
      hungry tasks.
      
      Suppose we kickstart a multi-threaded task like stream with 8 threads
      using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3
      server where each socket contains 128 CPUs
      (0-63,128-191 in one socket, 64-127,192-255 in another socket)
      
      Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8
      
      Here each CPU in the list is from a different LLC and 4 of those LLCs
      are on one socket, while the other 4 are on another socket.
      
      Ideally we would prefer that each stream thread runs on a different
      CPU from the allowed list of CPUs. However, the current heuristics in
      find_idlest_group() do not allow this during the initial placement.
      
      Suppose the first socket (0-63,128-191) is our local group from which
      we are kickstarting the stream tasks. The first four stream threads
      will be placed in this socket. When it comes to placing the 5th
      thread, all the allowed CPUs are from the local group (0,16,32,48)
      would have been taken.
      
      However, the current scheduler code simply checks if the number of
      tasks in the local group is fewer than the allowed numa-imbalance
      threshold. This threshold was previously 25% of the NUMA domain span
      (in this case threshold = 32) but after the v6 of Mel's patchset
      "Adjust NUMA imbalance for multiple LLCs", got merged in sched-tip,
      Commit: e496132e ("sched/fair: Adjust the allowed NUMA imbalance
      when SD_NUMA spans multiple LLCs") it is now equal to number of LLCs
      in the NUMA domain, for processors with multiple LLCs.
      (in this case threshold = 8).
      
      For this example, the number of tasks will always be within threshold
      and thus all the 8 stream threads will be woken up on the first socket
      thereby resulting in sub-optimal performance.
      
      The following sched_wakeup_new tracepoint output shows the initial
      placement of tasks in the current tip/sched/core on the Zen3 machine:
      
      stream-5313    [016] d..2.   627.005036: sched_wakeup_new: comm=stream pid=5315 prio=120 target_cpu=032
      stream-5313    [016] d..2.   627.005086: sched_wakeup_new: comm=stream pid=5316 prio=120 target_cpu=048
      stream-5313    [016] d..2.   627.005141: sched_wakeup_new: comm=stream pid=5317 prio=120 target_cpu=000
      stream-5313    [016] d..2.   627.005183: sched_wakeup_new: comm=stream pid=5318 prio=120 target_cpu=016
      stream-5313    [016] d..2.   627.005218: sched_wakeup_new: comm=stream pid=5319 prio=120 target_cpu=016
      stream-5313    [016] d..2.   627.005256: sched_wakeup_new: comm=stream pid=5320 prio=120 target_cpu=016
      stream-5313    [016] d..2.   627.005295: sched_wakeup_new: comm=stream pid=5321 prio=120 target_cpu=016
      
      Once the first four threads are distributed among the allowed CPUs of
      socket one, the rest of the treads start piling on these same CPUs
      when clearly there are CPUs on the second socket that can be used.
      
      Following the initial pile up on a small number of CPUs, though the
      load-balancer eventually kicks in, it takes a while to get to {4}{4}
      and even {4}{4} isn't stable as we observe a bunch of ping ponging
      between {4}{4} to {5}{3} and back before a stable state is reached
      much later (1 Stream thread per allowed CPU) and no more migration is
      required.
      
      We can detect this piling and avoid it by checking if the number of
      allowed CPUs in the local group are fewer than the number of tasks
      running in the local group and use this information to spread the
      5th task out into the next socket (after all, the goal in this
      slowpath is to find the idlest group and the idlest CPU during the
      initial placement!).
      
      The following sched_wakeup_new tracepoint output shows the initial
      placement of tasks after adding this fix on the Zen3 machine:
      
      stream-4485    [016] d..2.   230.784046: sched_wakeup_new: comm=stream pid=4487 prio=120 target_cpu=032
      stream-4485    [016] d..2.   230.784123: sched_wakeup_new: comm=stream pid=4488 prio=120 target_cpu=048
      stream-4485    [016] d..2.   230.784167: sched_wakeup_new: comm=stream pid=4489 prio=120 target_cpu=000
      stream-4485    [016] d..2.   230.784222: sched_wakeup_new: comm=stream pid=4490 prio=120 target_cpu=112
      stream-4485    [016] d..2.   230.784271: sched_wakeup_new: comm=stream pid=4491 prio=120 target_cpu=096
      stream-4485    [016] d..2.   230.784322: sched_wakeup_new: comm=stream pid=4492 prio=120 target_cpu=080
      stream-4485    [016] d..2.   230.784368: sched_wakeup_new: comm=stream pid=4493 prio=120 target_cpu=064
      
      We see that threads are using all of the allowed CPUs and there is
      no pileup.
      
      No output is generated for tracepoint sched_migrate_task with this
      patch due to a perfect initial placement which removes the need
      for balancing later on - both across NUMA boundaries and within
      NUMA boundaries for stream.
      
      Following are the results from running 8 Stream threads with and
      without pinning on a dual socket Zen3 Machine (2 x 64C/128T):
      
      During the testing of this patch, the tip sched/core was at
      commit: 089c02ae "ftrace: Use preemption model accessors for trace
      header printout"
      
      Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8
      
      	           5.18.0-rc1               5.18.0-rc1                5.18.0-rc1
                     tip sched/core           tip sched/core            tip sched/core
                       (no pinning)                + pinning              + this-patch
      								       + pinning
      
       Copy:   109364.74 (0.00 pct)     94220.50 (-13.84 pct)    158301.28 (44.74 pct)
      Scale:   109670.26 (0.00 pct)     90210.59 (-17.74 pct)    149525.64 (36.34 pct)
        Add:   129029.01 (0.00 pct)    101906.00 (-21.02 pct)    186658.17 (44.66 pct)
      Triad:   127260.05 (0.00 pct)    106051.36 (-16.66 pct)    184327.30 (44.84 pct)
      
      Pinning currently hurts the performance compared to unbound case on
      tip/sched/core. With the addition of this patch, we are able to
      outperform tip/sched/core by a good margin with pinning.
      
      Following are the results from running 16 Stream threads with and
      without pinning on a dual socket IceLake Machine (2 x 32C/64T):
      
      NUMA Topology of Intel Skylake machine:
      Node 1: 0,2,4,6 ... 126 (Even numbers)
      Node 2: 1,3,5,7 ... 127 (Odd numbers)
      
      Pinning is done using: numactl -C 0-15 ./stream16
      
      	           5.18.0-rc1               5.18.0-rc1                5.18.0-rc1
                     tip sched/core           tip sched/core            tip sched/core
                       (no pinning)                 +pinning              + this-patch
      								       + pinning
      
       Copy:    85815.31 (0.00 pct)     149819.21 (74.58 pct)    156807.48 (82.72 pct)
      Scale:    64795.60 (0.00 pct)      97595.07 (50.61 pct)     99871.96 (54.13 pct)
        Add:    71340.68 (0.00 pct)     111549.10 (56.36 pct)    114598.33 (60.63 pct)
      Triad:    68890.97 (0.00 pct)     111635.16 (62.04 pct)    114589.24 (66.33 pct)
      
      In case of Icelake machine, with single LLC per socket, pinning across
      the two sockets reduces cache contention, thus showing great
      improvement in pinned case which is further benefited by this patch.
      Signed-off-by: default avatarK Prateek Nayak <kprateek.nayak@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Reviewed-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Link: https://lkml.kernel.org/r/20220407111222.22649-1-kprateek.nayak@amd.com
      f5b2eeb4
    • Mel Gorman's avatar
      sched/numa: Adjust imb_numa_nr to a better approximation of memory channels · 026b98a9
      Mel Gorman authored
      For a single LLC per node, a NUMA imbalance is allowed up until 25%
      of CPUs sharing a node could be active. One intent of the cut-off is
      to avoid an imbalance of memory channels but there is no topological
      information based on active memory channels. Furthermore, there can
      be differences between nodes depending on the number of populated
      DIMMs.
      
      A cut-off of 25% was arbitrary but generally worked. It does have a severe
      corner cases though when an parallel workload is using 25% of all available
      CPUs over-saturates memory channels. This can happen due to the initial
      forking of tasks that get pulled more to one node after early wakeups
      (e.g. a barrier synchronisation) that is not quickly corrected by the
      load balancer. The LB may fail to act quickly as the parallel tasks are
      considered to be poor migrate candidates due to locality or cache hotness.
      
      On a range of modern Intel CPUs, 12.5% appears to be a better cut-off
      assuming all memory channels are populated and is used as the new cut-off
      point. A minimum of 1 is specified to allow a communicating pair to
      remain local even for CPUs with low numbers of cores. For modern AMDs,
      there are multiple LLCs and are not affected.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarK Prateek Nayak <kprateek.nayak@amd.com>
      Link: https://lore.kernel.org/r/20220520103519.1863-5-mgorman@techsingularity.net
      026b98a9
    • Mel Gorman's avatar
      sched/numa: Apply imbalance limitations consistently · cb29a5c1
      Mel Gorman authored
      The imbalance limitations are applied inconsistently at fork time
      and at runtime. At fork, a new task can remain local until there are
      too many running tasks even if the degree of imbalance is larger than
      NUMA_IMBALANCE_MIN which is different to runtime. Secondly, the imbalance
      figure used during load balancing is different to the one used at NUMA
      placement. Load balancing uses the number of tasks that must move to
      restore imbalance where as NUMA balancing uses the total imbalance.
      
      In combination, it is possible for a parallel workload that uses a small
      number of CPUs without applying scheduler policies to have very variable
      run-to-run performance.
      
      [lkp@intel.com: Fix build breakage for arc-allyesconfig]
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarK Prateek Nayak <kprateek.nayak@amd.com>
      Link: https://lore.kernel.org/r/20220520103519.1863-4-mgorman@techsingularity.net
      cb29a5c1
    • Mel Gorman's avatar
      sched/numa: Do not swap tasks between nodes when spare capacity is available · 13ede331
      Mel Gorman authored
      If a destination node has spare capacity but there is an imbalance then
      two tasks are selected for swapping. If the tasks have no numa group
      or are within the same NUMA group, it's simply shuffling tasks around
      without having any impact on the compute imbalance. Instead, it's just
      punishing one task to help another.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarK Prateek Nayak <kprateek.nayak@amd.com>
      Link: https://lore.kernel.org/r/20220520103519.1863-3-mgorman@techsingularity.net
      13ede331
    • Mel Gorman's avatar
      sched/numa: Initialise numa_migrate_retry · 70ce3ea9
      Mel Gorman authored
      On clone, numa_migrate_retry is inherited from the parent which means
      that the first NUMA placement of a task is non-deterministic. This
      affects when load balancing recognises numa tasks and whether to
      migrate "regular", "remote" or "all" tasks between NUMA scheduler
      domains.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarK Prateek Nayak <kprateek.nayak@amd.com>
      Link: https://lore.kernel.org/r/20220520103519.1863-2-mgorman@techsingularity.net
      70ce3ea9
  2. 12 Jun, 2022 10 commits
  3. 11 Jun, 2022 9 commits
    • Linus Torvalds's avatar
      Merge tag 'gpio-fixes-for-v5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux · 7a68065e
      Linus Torvalds authored
      Pull gpio fixes from Bartosz Golaszewski:
       "A set of fixes. Most address the new warning we emit at build time
        when irq chips are not immutable with some additional tweaks to
        gpio-crystalcove from Andy and a small tweak to gpio-dwapd.
      
         - make irq_chip structs immutable in several Diolan and intel drivers
           to get rid of the new warning we emit when fiddling with irq chips
      
         - don't print error messages on probe deferral in gpio-dwapb"
      
      * tag 'gpio-fixes-for-v5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
        gpio: dwapb: Don't print error on -EPROBE_DEFER
        gpio: dln2: make irq_chip immutable
        gpio: sch: make irq_chip immutable
        gpio: merrifield: make irq_chip immutable
        gpio: wcove: make irq_chip immutable
        gpio: crystalcove: Join function declarations and long lines
        gpio: crystalcove: Use specific type and API for IRQ number
        gpio: crystalcove: make irq_chip immutable
      7a68065e
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · cecb3540
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Driver fixes and and one core patch.
      
        Nine of the driver patches are minor fixes and reworks to lpfc and the
        rest are trivial and minor fixes elsewhere"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: pmcraid: Fix missing resource cleanup in error case
        scsi: ipr: Fix missing/incorrect resource cleanup in error case
        scsi: mpt3sas: Fix out-of-bounds compiler warning
        scsi: lpfc: Update lpfc version to 14.2.0.4
        scsi: lpfc: Allow reduced polling rate for nvme_admin_async_event cmd completion
        scsi: lpfc: Add more logging of cmd and cqe information for aborted NVMe cmds
        scsi: lpfc: Fix port stuck in bypassed state after LIP in PT2PT topology
        scsi: lpfc: Resolve NULL ptr dereference after an ELS LOGO is aborted
        scsi: lpfc: Address NULL pointer dereference after starget_to_rport()
        scsi: lpfc: Resolve some cleanup issues following SLI path refactoring
        scsi: lpfc: Resolve some cleanup issues following abort path refactoring
        scsi: lpfc: Correct BDE type for XMIT_SEQ64_WQE in lpfc_ct_reject_event()
        scsi: vmw_pvscsi: Expand vcpuHint to 16 bits
        scsi: sd: Fix interpretation of VPD B9h length
      cecb3540
    • Linus Torvalds's avatar
      Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost · abe71eb3
      Linus Torvalds authored
      Pull virtio fixes from Michael Tsirkin:
       "Fixes all over the place, most notably fixes for latent bugs in
        drivers that got exposed by suppressing interrupts before DRIVER_OK,
        which in turn has been done by 8b4ec69d ("virtio: harden vring
        IRQ")"
      
      * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
        um: virt-pci: set device ready in probe()
        vdpa: make get_vq_group and set_group_asid optional
        virtio: Fix all occurences of the "the the" typo
        vduse: Fix NULL pointer dereference on sysfs access
        vringh: Fix loop descriptors check in the indirect cases
        vdpa/mlx5: clean up indenting in handle_ctrl_vlan()
        vdpa/mlx5: fix error code for deleting vlan
        virtio-mmio: fix missing put_device() when vm_cmdline_parent registration failed
        vdpa/mlx5: Fix syntax errors in comments
        virtio-rng: make device ready before making request
      abe71eb3
    • Linus Torvalds's avatar
      Merge tag 'loongarch-fixes-5.19-1' of... · 0678afa6
      Linus Torvalds authored
      Merge tag 'loongarch-fixes-5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
      
      Pull LoongArch fixes from Huacai Chen.
       "Fix build errors and a stale comment"
      
      * tag 'loongarch-fixes-5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
        LoongArch: Remove MIPS comment about cycle counter
        LoongArch: Fix copy_thread() build errors
        LoongArch: Fix the !CONFIG_SMP build
      0678afa6
    • Linus Torvalds's avatar
      iov_iter: fix build issue due to possible type mis-match · 1c27f1fc
      Linus Torvalds authored
      Commit 6c776766 ("iov_iter: Fix iter_xarray_get_pages{,_alloc}()")
      introduced a problem on some 32-bit architectures (at least arm, xtensa,
      csky,sparc and mips), that have a 'size_t' that is 'unsigned int'.
      
      The reason is that we now do
      
          min(nr * PAGE_SIZE - offset, maxsize);
      
      where 'nr' and 'offset' and both 'unsigned int', and PAGE_SIZE is
      'unsigned long'.  As a result, the normal C type rules means that the
      first argument to 'min()' ends up being 'unsigned long'.
      
      In contrast, 'maxsize' is of type 'size_t'.
      
      Now, 'size_t' and 'unsigned long' are always the same physical type in
      the kernel, so you'd think this doesn't matter, and from an actual
      arithmetic standpoint it doesn't.
      
      But on 32-bit architectures 'size_t' is commonly 'unsigned int', even if
      it could also be 'unsigned long'.  In that situation, both are unsigned
      32-bit types, but they are not the *same* type.
      
      And as a result 'min()' will complain about the distinct types (ignore
      the "pointer types" part of the error message: that's an artifact of the
      way we have made 'min()' check types for being the same):
      
        lib/iov_iter.c: In function 'iter_xarray_get_pages':
        include/linux/minmax.h:20:35: error: comparison of distinct pointer types lacks a cast [-Werror]
           20 |         (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
              |                                   ^~
        lib/iov_iter.c:1464:16: note: in expansion of macro 'min'
         1464 |         return min(nr * PAGE_SIZE - offset, maxsize);
              |                ^~~
      
      This was not visible on 64-bit architectures (where we always define
      'size_t' to be 'unsigned long').
      
      Force these cases to use 'min_t(size_t, x, y)' to make the type explicit
      and avoid the issue.
      
      [ Nit-picky note: technically 'size_t' doesn't have to match 'unsigned
        long' arithmetically. We've certainly historically seen environments
        with 16-bit address spaces and 32-bit 'unsigned long'.
      
        Similarly, even in 64-bit modern environments, 'size_t' could be its
        own type distinct from 'unsigned long', even if it were arithmetically
        identical.
      
        So the above type commentary is only really descriptive of the kernel
        environment, not some kind of universal truth for the kinds of wild
        and crazy situations that are allowed by the C standard ]
      Reported-by: default avatarSudip Mukherjee <sudipm.mukherjee@gmail.com>
      Link: https://lore.kernel.org/all/YqRyL2sIqQNDfky2@debian/
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c27f1fc
    • Jason A. Donenfeld's avatar
      wireguard: selftests: use maximum cpu features and allow rng seeding · 17b0128a
      Jason A. Donenfeld authored
      By forcing the maximum CPU that QEMU has available, we expose additional
      capabilities, such as the RNDR instruction, which increases test
      coverage. This then allows the CI to skip the fake seeding step in some
      cases. Also enable STRICT_KERNEL_RWX to catch issues related to early
      jump labels when the RNG is initialized at boot.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      17b0128a
    • Kuan-Ying Lee's avatar
      scripts/gdb: change kernel config dumping method · 1f7a6cf6
      Kuan-Ying Lee authored
      MAGIC_START("IKCFG_ST") and MAGIC_END("IKCFG_ED") are moved out
      from the kernel_config_data variable.
      
      Thus, we parse kernel_config_data directly instead of considering
      offset of MAGIC_START and MAGIC_END.
      
      Fixes: 13610aa9 ("kernel/configs: use .incbin directive to embed config_data.gz")
      Signed-off-by: default avatarKuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
      Signed-off-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      1f7a6cf6
    • Vincent Whitchurch's avatar
      um: virt-pci: set device ready in probe() · eacea844
      Vincent Whitchurch authored
      Call virtio_device_ready() to make this driver work after commit
      b4ec69d7e09 ("virtio: harden vring IRQ"), since the driver uses the
      virtqueues in the probe function.  (The virtio core sets the device
      ready when probe returns.)
      
      Fixes: 8b4ec69d ("virtio: harden vring IRQ")
      Fixes: 68f5d3f3 ("um: add PCI over virtio emulation driver")
      Signed-off-by: default avatarVincent Whitchurch <vincent.whitchurch@axis.com>
      Message-Id: <20220610151203.3492541-1-vincent.whitchurch@axis.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Tested-by: default avatarJohannes Berg <johannes@sipsolutions.net>
      eacea844
    • Linus Torvalds's avatar
      Merge tag 'nfsd-5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux · 0885eacd
      Linus Torvalds authored
      Pull nfsd fixes from Chuck Lever:
       "Notable changes:
      
         - There is now a backup maintainer for NFSD
      
        Notable fixes:
      
         - Prevent array overruns in svc_rdma_build_writes()
      
         - Prevent buffer overruns when encoding NFSv3 READDIR results
      
         - Fix a potential UAF in nfsd_file_put()"
      
      * tag 'nfsd-5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
        SUNRPC: Remove pointer type casts from xdr_get_next_encode_buffer()
        SUNRPC: Clean up xdr_get_next_encode_buffer()
        SUNRPC: Clean up xdr_commit_encode()
        SUNRPC: Optimize xdr_reserve_space()
        SUNRPC: Fix the calculation of xdr->end in xdr_get_next_encode_buffer()
        SUNRPC: Trap RDMA segment overflows
        NFSD: Fix potential use-after-free in nfsd_file_put()
        MAINTAINERS: reciprocal co-maintainership for file locking and nfsd
      0885eacd
  4. 10 Jun, 2022 11 commits