1. 08 Sep, 2023 27 commits
    • Oleg Nesterov's avatar
      bpf: task_group_seq_get_next: simplify the "next tid" logic · 780aa8df
      Oleg Nesterov authored
      Kill saved_tid. It looks ugly to update *tid and then restore the
      previous value if __task_pid_nr_ns() returns 0. Change this code
      to update *tid and common->pid_visiting once before return.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20230905154656.GA24950@redhat.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      780aa8df
    • Oleg Nesterov's avatar
      bpf: task_group_seq_get_next: kill next_task · 0ee9808b
      Oleg Nesterov authored
      It only adds the unnecessary confusion and compicates the "retry" code.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20230905154654.GA24945@redhat.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0ee9808b
    • Oleg Nesterov's avatar
      bpf: task_group_seq_get_next: fix the skip_if_dup_files check · 87abbf7a
      Oleg Nesterov authored
      Unless I am notally confused it is wrong. We are going to return or
      skip next_task so we need to check next_task-files, not task->files.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20230905154651.GA24940@redhat.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      87abbf7a
    • Oleg Nesterov's avatar
      bpf: task_group_seq_get_next: cleanup the usage of get/put_task_struct · 49819213
      Oleg Nesterov authored
      get_pid_task() makes no sense, the code does put_task_struct() soon after.
      Use find_task_by_pid_ns() instead of find_pid_ns + get_pid_task and kill
      put_task_struct(), this allows to do get_task_struct() only once before
      return.
      
      While at it, kill the unnecessary "if (!pid)" check in the "if (!*tid)"
      block, this matches the next usage of find_pid_ns() + get_pid_task() in
      this function.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20230905154649.GA24935@redhat.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      49819213
    • Oleg Nesterov's avatar
      bpf: task_group_seq_get_next: cleanup the usage of next_thread() · 1a00ef57
      Oleg Nesterov authored
      1. find_pid_ns() + get_pid_task() under rcu_read_lock() guarantees that we
         can safely iterate the task->thread_group list. Even if this task exits
         right after get_pid_task() (or goto retry) and pid_alive() returns 0.
      
         Kill the unnecessary pid_alive() check.
      
      2. next_thread() simply can't return NULL, kill the bogus "if (!next_task)"
         check.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20230905154646.GA24928@redhat.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1a00ef57
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-enable-irq-after-irq_work_raise-completes' · 35897c3c
      Alexei Starovoitov authored
      Hou Tao says:
      
      ====================
      bpf: Enable IRQ after irq_work_raise() completes
      
      From: Hou Tao <houtao1@huawei.com>
      
      Hi,
      
      The patchset aims to fix the problem that bpf_mem_alloc() may return
      NULL unexpectedly when multiple bpf_mem_alloc() are invoked concurrently
      under process context and there is still free memory available. The
      problem was found when doing stress test for qp-trie but the same
      problem also exists for bpf_obj_new() as demonstrated in patch #3.
      
      As pointed out by Alexei, the patchset can only fix ENOMEM problem for
      normal process context and can not fix the problem for irq-disabled
      context or RT-enabled kernel.
      
      Patch #1 fixes the race between unit_alloc() and unit_alloc(). Patch #2
      fixes the race between unit_alloc() and unit_free(). And patch #3 adds
      a selftest for the problem. The major change compared with v1 is using
      local_irq_{save,restore)() pair to disable and enable preemption
      instead of preempt_{disable,enable}_notrace pair. The main reason is to
      prevent potential overhead from __preempt_schedule_notrace(). I also
      run htab_mem benchmark and hash_map_perf on a 8-CPUs KVM VM to compare
      the performance between local_irq_{save,restore} and
      preempt_{disable,enable}_notrace(), but the results are similar as shown
      below:
      
      (1) use preempt_{disable,enable}_notrace()
      
      [root@hello bpf]# ./map_perf_test 4 8
      0:hash_map_perf kmalloc 652179 events per sec
      1:hash_map_perf kmalloc 651880 events per sec
      2:hash_map_perf kmalloc 651382 events per sec
      3:hash_map_perf kmalloc 650791 events per sec
      5:hash_map_perf kmalloc 650140 events per sec
      6:hash_map_perf kmalloc 652773 events per sec
      7:hash_map_perf kmalloc 652751 events per sec
      4:hash_map_perf kmalloc 648199 events per sec
      
      [root@hello bpf]# ./benchs/run_bench_htab_mem.sh
      normal bpf ma
      =============
      overwrite            per-prod-op: 110.82 ± 0.02k/s, avg mem: 2.00 ± 0.00MiB, peak mem: 2.73MiB
      batch_add_batch_del  per-prod-op: 89.79 ± 0.75k/s, avg mem: 1.68 ± 0.38MiB, peak mem: 2.73MiB
      add_del_on_diff_cpu  per-prod-op: 17.83 ± 0.07k/s, avg mem: 25.68 ± 2.92MiB, peak mem: 35.10MiB
      
      (2) use local_irq_{save,restore}
      
      [root@hello bpf]# ./map_perf_test 4 8
      0:hash_map_perf kmalloc 656299 events per sec
      1:hash_map_perf kmalloc 656397 events per sec
      2:hash_map_perf kmalloc 656046 events per sec
      3:hash_map_perf kmalloc 655723 events per sec
      5:hash_map_perf kmalloc 655221 events per sec
      4:hash_map_perf kmalloc 654617 events per sec
      6:hash_map_perf kmalloc 650269 events per sec
      7:hash_map_perf kmalloc 653665 events per sec
      
      [root@hello bpf]# ./benchs/run_bench_htab_mem.sh
      normal bpf ma
      =============
      overwrite            per-prod-op: 116.10 ± 0.02k/s, avg mem: 2.00 ± 0.00MiB, peak mem: 2.74MiB
      batch_add_batch_del  per-prod-op: 88.76 ± 0.61k/s, avg mem: 1.94 ± 0.33MiB, peak mem: 2.74MiB
      add_del_on_diff_cpu  per-prod-op: 18.12 ± 0.08k/s, avg mem: 25.10 ± 2.70MiB, peak mem: 34.78MiB
      
      As ususal comments are always welcome.
      
      Change Log:
      v2:
        * Use local_irq_save to disable preemption instead of using
          preempt_{disable,enable}_notrace pair to prevent potential overhead
      
      v1: https://lore.kernel.org/bpf/20230822133807.3198625-1-houtao@huaweicloud.com/
      ====================
      
      Link: https://lore.kernel.org/r/20230901111954.1804721-1-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      35897c3c
    • Hou Tao's avatar
      selftests/bpf: Test preemption between bpf_obj_new() and bpf_obj_drop() · 29c11aa8
      Hou Tao authored
      The test case creates 4 threads and then pins these 4 threads in CPU 0.
      These 4 threads will run different bpf program through
      bpf_prog_test_run_opts() and these bpf program will use bpf_obj_new()
      and bpf_obj_drop() to allocate and free local kptrs concurrently.
      
      Under preemptible kernel, bpf_obj_new() and bpf_obj_drop() may preempt
      each other, bpf_obj_new() may return NULL and the test will fail before
      applying these fixes as shown below:
      
        test_preempted_bpf_ma_op:PASS:open_and_load 0 nsec
        test_preempted_bpf_ma_op:PASS:attach 0 nsec
        test_preempted_bpf_ma_op:PASS:no test prog 0 nsec
        test_preempted_bpf_ma_op:PASS:no test prog 0 nsec
        test_preempted_bpf_ma_op:PASS:no test prog 0 nsec
        test_preempted_bpf_ma_op:PASS:no test prog 0 nsec
        test_preempted_bpf_ma_op:PASS:pthread_create 0 nsec
        test_preempted_bpf_ma_op:PASS:pthread_create 0 nsec
        test_preempted_bpf_ma_op:PASS:pthread_create 0 nsec
        test_preempted_bpf_ma_op:PASS:pthread_create 0 nsec
        test_preempted_bpf_ma_op:PASS:run prog err 0 nsec
        test_preempted_bpf_ma_op:PASS:run prog err 0 nsec
        test_preempted_bpf_ma_op:PASS:run prog err 0 nsec
        test_preempted_bpf_ma_op:PASS:run prog err 0 nsec
        test_preempted_bpf_ma_op:FAIL:ENOMEM unexpected ENOMEM: got TRUE
        #168     preempted_bpf_ma_op:FAIL
        Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20230901111954.1804721-4-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      29c11aa8
    • Hou Tao's avatar
      bpf: Enable IRQ after irq_work_raise() completes in unit_free{_rcu}() · 62cf51cb
      Hou Tao authored
      Both unit_free() and unit_free_rcu() invoke irq_work_raise() to free
      freed objects back to slab and the invocation may also be preempted by
      unit_alloc() and unit_alloc() may return NULL unexpectedly as shown in
      the following case:
      
      task A         task B
      
      unit_free()
        // high_watermark = 48
        // free_cnt = 49 after free
        irq_work_raise()
          // mark irq work as IRQ_WORK_PENDING
          irq_work_claim()
      
                     // task B preempts task A
                     unit_alloc()
                       // free_cnt = 48 after alloc
      
                     // does unit_alloc() 32-times
      	       ......
      	       // free_cnt = 16
      
      	       unit_alloc()
      	         // free_cnt = 15 after alloc
                       // irq work is already PENDING,
                       // so just return
                       irq_work_raise()
      
      	       // does unit_alloc() 15-times
                     ......
      	       // free_cnt = 0
      
                     unit_alloc()
                       // free_cnt = 0 before alloc
                       return NULL
      
      Fix it by enabling IRQ after irq_work_raise() completes.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20230901111954.1804721-3-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      62cf51cb
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-add-support-for-local-percpu-kptr' · 1e4a6d97
      Alexei Starovoitov authored
      Yonghong Song says:
      
      ====================
      bpf: Add support for local percpu kptr
      
      Patch set [1] implemented cgroup local storage BPF_MAP_TYPE_CGRP_STORAGE
      similar to sk/task/inode local storage and old BPF_MAP_TYPE_CGROUP_STORAGE
      map is marked as deprecated since old BPF_MAP_TYPE_CGROUP_STORAGE map can
      only work with current cgroup.
      
      Similarly, the existing BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE map
      is a percpu version of BPF_MAP_TYPE_CGROUP_STORAGE and only works
      with current cgroup. But there is no replacement which can work
      with arbitrary cgroup.
      
      This patch set solved this problem but adding support for local
      percpu kptr. The map value can have a percpu kptr field which holds
      a bpf prog allocated percpu data. The below is an example,
      
        struct percpu_val_t {
          ... fields ...
        }
      
        struct map_value_t {
          struct percpu_val_t __percpu_kptr *percpu_data_ptr;
        }
      
      In the above, 'map_value_t' is the map value type for a
      BPF_MAP_TYPE_CGRP_STORAGE map. User can access 'percpu_data_ptr'
      and then read/write percpu data. This covers BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE
      and more. So BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE map type
      is marked as deprecated.
      
      In additional, local percpu kptr supports the same map type
      as other kptrs including hash, lru_hash, array, sk/inode/task/cgrp
      local storage. Currently, percpu data structure does not support
      non-scalars or special fields (e.g., bpf_spin_lock, bpf_rb_root, etc.).
      They can be supported in the future if there exist use cases.
      
      Please for individual patches for details.
      
        [1] https://lore.kernel.org/all/20221026042835.672317-1-yhs@fb.com/
      
      Changelog:
        v2 -> v3:
          - fix libbpf_str test failure.
        v1 -> v2:
          - does not support special fields in percpu data structure.
          - rename __percpu attr to __percpu_kptr attr.
          - rename BPF_KPTR_PERCPU_REF to BPF_KPTR_PERCPU.
          - better code to handle bpf_{this,per}_cpu_ptr() helpers.
          - add more negative tests.
          - fix a bpftool related test failure.
      ====================
      
      Link: https://lore.kernel.org/r/20230827152729.1995219-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1e4a6d97
    • Hou Tao's avatar
      bpf: Enable IRQ after irq_work_raise() completes in unit_alloc() · 566f6de3
      Hou Tao authored
      When doing stress test for qp-trie, bpf_mem_alloc() returned NULL
      unexpectedly because all qp-trie operations were initiated from
      bpf syscalls and there was still available free memory. bpf_obj_new()
      has the same problem as shown by the following selftest.
      
      The failure is due to the preemption. irq_work_raise() will invoke
      irq_work_claim() first to mark the irq work as pending and then inovke
      __irq_work_queue_local() to raise an IPI. So when the current task
      which is invoking irq_work_raise() is preempted by other task,
      unit_alloc() may return NULL for preemption task as shown below:
      
      task A         task B
      
      unit_alloc()
        // low_watermark = 32
        // free_cnt = 31 after alloc
        irq_work_raise()
          // mark irq work as IRQ_WORK_PENDING
          irq_work_claim()
      
      	       // task B preempts task A
      	       unit_alloc()
      	         // free_cnt = 30 after alloc
      	         // irq work is already PENDING,
      	         // so just return
      	         irq_work_raise()
      	       // does unit_alloc() 30-times
      	       ......
      	       unit_alloc()
      	         // free_cnt = 0 before alloc
      	         return NULL
      
      Fix it by enabling IRQ after irq_work_raise() completes. An alternative
      fix is using preempt_{disable|enable}_notrace() pair, but it may have
      extra overhead. Another feasible fix is to only disable preemption or
      IRQ before invoking irq_work_queue() and enable preemption or IRQ after
      the invocation completes, but it can't handle the case when
      c->low_watermark is 1.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20230901111954.1804721-2-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      566f6de3
    • Yonghong Song's avatar
      bpf: Mark BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE deprecated · 9bc95a95
      Yonghong Song authored
      Now 'BPF_MAP_TYPE_CGRP_STORAGE + local percpu ptr'
      can cover all BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE functionality
      and more. So mark BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE deprecated.
      Also make changes in selftests/bpf/test_bpftool_synctypes.py
      and selftest libbpf_str to fix otherwise test errors.
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20230827152837.2003563-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9bc95a95
    • Yonghong Song's avatar
      selftests/bpf: Add some negative tests · 1bd79317
      Yonghong Song authored
      Add a few negative tests for common mistakes with using percpu kptr
      including:
        - store to percpu kptr.
        - type mistach in bpf_kptr_xchg arguments.
        - sleepable prog with untrusted arg for bpf_this_cpu_ptr().
        - bpf_percpu_obj_new && bpf_obj_drop, and bpf_obj_new && bpf_percpu_obj_drop
        - struct with ptr for bpf_percpu_obj_new
        - struct with special field (e.g., bpf_spin_lock) for bpf_percpu_obj_new
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20230827152832.2002421-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1bd79317
    • Yonghong Song's avatar
      selftests/bpf: Add tests for cgrp_local_storage with local percpu kptr · dfae1eee
      Yonghong Song authored
      Add a non-sleepable cgrp_local_storage test with percpu kptr. The
      test does allocation of percpu data, assigning values to percpu
      data and retrieval of percpu data. The de-allocation of percpu
      data is done when the map is freed.
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20230827152827.2001784-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      dfae1eee
    • Yonghong Song's avatar
      selftests/bpf: Remove unnecessary direct read of local percpu kptr · 46200d6d
      Yonghong Song authored
      For the second argument of bpf_kptr_xchg(), if the reg type contains
      MEM_ALLOC and MEM_PERCPU, which means a percpu allocation,
      after bpf_kptr_xchg(), the argument is marked as MEM_RCU and MEM_PERCPU
      if in rcu critical section. This way, re-reading from the map value
      is not needed. Remove it from the percpu_alloc_array.c selftest.
      
      Without previous kernel change, the test will fail like below:
      
        0: R1=ctx(off=0,imm=0) R10=fp0
        ; int BPF_PROG(test_array_map_10, int a)
        0: (b4) w1 = 0                        ; R1_w=0
        ; int i, index = 0;
        1: (63) *(u32 *)(r10 -4) = r1         ; R1_w=0 R10=fp0 fp-8=0000????
        2: (bf) r2 = r10                      ; R2_w=fp0 R10=fp0
        ;
        3: (07) r2 += -4                      ; R2_w=fp-4
        ; e = bpf_map_lookup_elem(&array, &index);
        4: (18) r1 = 0xffff88810e771800       ; R1_w=map_ptr(off=0,ks=4,vs=16,imm=0)
        6: (85) call bpf_map_lookup_elem#1    ; R0_w=map_value_or_null(id=1,off=0,ks=4,vs=16,imm=0)
        7: (bf) r6 = r0                       ; R0_w=map_value_or_null(id=1,off=0,ks=4,vs=16,imm=0) R6_w=map_value_or_null(id=1,off=0,ks=4,vs=16,imm=0)
        ; if (!e)
        8: (15) if r6 == 0x0 goto pc+81       ; R6_w=map_value(off=0,ks=4,vs=16,imm=0)
        ; bpf_rcu_read_lock();
        9: (85) call bpf_rcu_read_lock#87892          ;
        ; p = e->pc;
        10: (bf) r7 = r6                      ; R6=map_value(off=0,ks=4,vs=16,imm=0) R7_w=map_value(off=0,ks=4,vs=16,imm=0)
        11: (07) r7 += 8                      ; R7_w=map_value(off=8,ks=4,vs=16,imm=0)
        12: (79) r6 = *(u64 *)(r6 +8)         ; R6_w=percpu_rcu_ptr_or_null_val_t(id=2,off=0,imm=0)
        ; if (!p) {
        13: (55) if r6 != 0x0 goto pc+13      ; R6_w=0
        ; p = bpf_percpu_obj_new(struct val_t);
        14: (18) r1 = 0x12                    ; R1_w=18
        16: (b7) r2 = 0                       ; R2_w=0
        17: (85) call bpf_percpu_obj_new_impl#87883   ; R0_w=percpu_ptr_or_null_val_t(id=4,ref_obj_id=4,off=0,imm=0) refs=4
        18: (bf) r6 = r0                      ; R0=percpu_ptr_or_null_val_t(id=4,ref_obj_id=4,off=0,imm=0) R6=percpu_ptr_or_null_val_t(id=4,ref_obj_id=4,off=0,imm=0) refs=4
        ; if (!p)
        19: (15) if r6 == 0x0 goto pc+69      ; R6=percpu_ptr_val_t(ref_obj_id=4,off=0,imm=0) refs=4
        ; p1 = bpf_kptr_xchg(&e->pc, p);
        20: (bf) r1 = r7                      ; R1_w=map_value(off=8,ks=4,vs=16,imm=0) R7=map_value(off=8,ks=4,vs=16,imm=0) refs=4
        21: (bf) r2 = r6                      ; R2_w=percpu_ptr_val_t(ref_obj_id=4,off=0,imm=0) R6=percpu_ptr_val_t(ref_obj_id=4,off=0,imm=0) refs=4
        22: (85) call bpf_kptr_xchg#194       ; R0_w=percpu_ptr_or_null_val_t(id=6,ref_obj_id=6,off=0,imm=0) refs=6
        ; if (p1) {
        23: (15) if r0 == 0x0 goto pc+3       ; R0_w=percpu_ptr_val_t(ref_obj_id=6,off=0,imm=0) refs=6
        ; bpf_percpu_obj_drop(p1);
        24: (bf) r1 = r0                      ; R0_w=percpu_ptr_val_t(ref_obj_id=6,off=0,imm=0) R1_w=percpu_ptr_val_t(ref_obj_id=6,off=0,imm=0) refs=6
        25: (b7) r2 = 0                       ; R2_w=0 refs=6
        26: (85) call bpf_percpu_obj_drop_impl#87882          ;
        ; v = bpf_this_cpu_ptr(p);
        27: (bf) r1 = r6                      ; R1_w=scalar(id=7) R6=scalar(id=7)
        28: (85) call bpf_this_cpu_ptr#154
        R1 type=scalar expected=percpu_ptr_, percpu_rcu_ptr_, percpu_trusted_ptr_
      
      The R1 which gets its value from R6 is a scalar. But before insn 22, R6 is
        R6=percpu_ptr_val_t(ref_obj_id=4,off=0,imm=0)
      Its type is changed to a scalar at insn 22 without previous patch.
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20230827152821.2001129-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      46200d6d
    • Yonghong Song's avatar
      bpf: Mark OBJ_RELEASE argument as MEM_RCU when possible · 5b221ecb
      Yonghong Song authored
      In previous selftests/bpf patch, we have
        p = bpf_percpu_obj_new(struct val_t);
        if (!p)
                goto out;
      
        p1 = bpf_kptr_xchg(&e->pc, p);
        if (p1) {
                /* race condition */
                bpf_percpu_obj_drop(p1);
        }
      
        p = e->pc;
        if (!p)
                goto out;
      
      After bpf_kptr_xchg(), we need to re-read e->pc into 'p'.
      This is due to that the second argument of bpf_kptr_xchg() is marked
      OBJ_RELEASE and it will be marked as invalid after the call.
      So after bpf_kptr_xchg(), 'p' is an unknown scalar,
      and the bpf program needs to reread from the map value.
      
      This patch checks if the 'p' has type MEM_ALLOC and MEM_PERCPU,
      and if 'p' is RCU protected. If this is the case, 'p' can be marked
      as MEM_RCU. MEM_ALLOC needs to be removed since 'p' is not
      an owning reference any more. Such a change makes re-read
      from the map value unnecessary.
      
      Note that re-reading 'e->pc' after bpf_kptr_xchg() might get
      a different value from 'p' if immediately before 'p = e->pc',
      another cpu may do another bpf_kptr_xchg() and swap in another value
      into 'e->pc'. If this is the case, then 'p = e->pc' may
      get either 'p' or another value, and race condition already exists.
      So removing direct re-reading seems fine too.
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20230827152816.2000760-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5b221ecb
    • Yonghong Song's avatar
      selftests/bpf: Add tests for array map with local percpu kptr · 6adf82a4
      Yonghong Song authored
      Add non-sleepable and sleepable tests with percpu kptr. For
      non-sleepable test, four programs are executed in the order of:
        1. allocate percpu data.
        2. assign values to percpu data.
        3. retrieve percpu data.
        4. de-allocate percpu data.
      
      The sleepable prog tried to exercise all above 4 steps in a
      single prog. Also for sleepable prog, rcu_read_lock is needed
      to protect direct percpu ptr access (from map value) and
      following bpf_this_cpu_ptr() and bpf_per_cpu_ptr() helpers.
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20230827152811.2000125-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      6adf82a4
    • Yonghong Song's avatar
      selftests/bpf: Add bpf_percpu_obj_{new,drop}() macro in bpf_experimental.h · 968c76cb
      Yonghong Song authored
      The new macro bpf_percpu_obj_{new/drop}() is very similar to bpf_obj_{new,drop}()
      as they both take a type as the argument.
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20230827152805.1999417-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      968c76cb
    • Yonghong Song's avatar
      libbpf: Add __percpu_kptr macro definition · ed5285a1
      Yonghong Song authored
      Add __percpu_kptr macro definition in bpf_helpers.h.
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20230827152800.1998492-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ed5285a1
    • Andrii Nakryiko's avatar
      libbpf: Add basic BTF sanity validation · 3903802b
      Andrii Nakryiko authored
      Implement a simple and straightforward BTF sanity check when parsing BTF
      data. Right now it's very basic and just validates that all the string
      offsets and type IDs are within valid range. For FUNC we also check that
      it points to FUNC_PROTO kinds.
      
      Even with such simple checks it fixes a bunch of crashes found by OSS
      fuzzer ([0]-[5]) and will allow fuzzer to make further progress.
      
      Some other invariants will be checked in follow up patches (like
      ensuring there is no infinite type loops), but this seems like a good
      start already.
      
      Adding FUNC -> FUNC_PROTO check revealed that one of selftests has
      a problem with FUNC pointing to VAR instead, so fix it up in the same
      commit.
      
        [0] https://github.com/libbpf/libbpf/issues/482
        [1] https://github.com/libbpf/libbpf/issues/483
        [2] https://github.com/libbpf/libbpf/issues/485
        [3] https://github.com/libbpf/libbpf/issues/613
        [4] https://github.com/libbpf/libbpf/issues/618
        [5] https://github.com/libbpf/libbpf/issues/619Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Reviewed-by: default avatarSong Liu <song@kernel.org>
      Closes: https://github.com/libbpf/libbpf/issues/617
      Link: https://lore.kernel.org/bpf/20230825202152.1813394-1-andrii@kernel.org
      3903802b
    • Yonghong Song's avatar
      selftests/bpf: Update error message in negative linked_list test · 96fc99d3
      Yonghong Song authored
      Some error messages are changed due to the addition of
      percpu kptr support. Fix linked_list test with changed
      error messages.
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20230827152754.1997769-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      96fc99d3
    • Yonghong Song's avatar
      bpf: Add bpf_this_cpu_ptr/bpf_per_cpu_ptr support for allocated percpu obj · 01cc55af
      Yonghong Song authored
      The bpf helpers bpf_this_cpu_ptr() and bpf_per_cpu_ptr() are re-purposed
      for allocated percpu objects. For an allocated percpu obj,
      the reg type is 'PTR_TO_BTF_ID | MEM_PERCPU | MEM_RCU'.
      
      The return type for these two re-purposed helpera is
      'PTR_TO_MEM | MEM_RCU | MEM_ALLOC'.
      The MEM_ALLOC allows that the per-cpu data can be read and written.
      
      Since the memory allocator bpf_mem_alloc() returns
      a ptr to a percpu ptr for percpu data, the first argument
      of bpf_this_cpu_ptr() and bpf_per_cpu_ptr() is patched
      with a dereference before passing to the helper func.
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20230827152749.1997202-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      01cc55af
    • Yonghong Song's avatar
      bpf: Add alloc/xchg/direct_access support for local percpu kptr · 36d8bdf7
      Yonghong Song authored
      Add two new kfunc's, bpf_percpu_obj_new_impl() and
      bpf_percpu_obj_drop_impl(), to allocate a percpu obj.
      Two functions are very similar to bpf_obj_new_impl()
      and bpf_obj_drop_impl(). The major difference is related
      to percpu handling.
      
          bpf_rcu_read_lock()
          struct val_t __percpu_kptr *v = map_val->percpu_data;
          ...
          bpf_rcu_read_unlock()
      
      For a percpu data map_val like above 'v', the reg->type
      is set as
      	PTR_TO_BTF_ID | MEM_PERCPU | MEM_RCU
      if inside rcu critical section.
      
      MEM_RCU marking here is similar to NON_OWN_REF as 'v'
      is not a owning reference. But NON_OWN_REF is
      trusted and typically inside the spinlock while
      MEM_RCU is under rcu read lock. RCU is preferred here
      since percpu data structures mean potential concurrent
      access into its contents.
      
      Also, bpf_percpu_obj_new_impl() is restricted such that
      no pointers or special fields are allowed. Therefore,
      the bpf_list_head and bpf_rb_root will not be supported
      in this patch set to avoid potential memory leak issue
      due to racing between bpf_obj_free_fields() and another
      bpf_kptr_xchg() moving an allocated object to
      bpf_list_head and bpf_rb_root.
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20230827152744.1996739-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      36d8bdf7
    • Yonghong Song's avatar
      bpf: Add BPF_KPTR_PERCPU as a field type · 55db92f4
      Yonghong Song authored
      BPF_KPTR_PERCPU represents a percpu field type like below
      
        struct val_t {
          ... fields ...
        };
        struct t {
          ...
          struct val_t __percpu_kptr *percpu_data_ptr;
          ...
        };
      
      where
        #define __percpu_kptr __attribute__((btf_type_tag("percpu_kptr")))
      
      While BPF_KPTR_REF points to a trusted kernel object or a trusted
      local object, BPF_KPTR_PERCPU points to a trusted local
      percpu object.
      
      This patch added basic support for BPF_KPTR_PERCPU
      related to percpu_kptr field parsing, recording and free operations.
      BPF_KPTR_PERCPU also supports the same map types
      as BPF_KPTR_REF does.
      
      Note that unlike a local kptr, it is possible that
      a BPF_KTPR_PERCPU struct may not contain any
      special fields like other kptr, bpf_spin_lock, bpf_list_head, etc.
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20230827152739.1996391-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      55db92f4
    • Yonghong Song's avatar
      bpf: Add support for non-fix-size percpu mem allocation · 41a5db8d
      Yonghong Song authored
      This is needed for later percpu mem allocation when the
      allocation is done by bpf program. For such cases, a global
      bpf_global_percpu_ma is added where a flexible allocation
      size is needed.
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20230827152734.1995725-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      41a5db8d
    • Linus Torvalds's avatar
      Merge tag 'net-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 73be7fb1
      Linus Torvalds authored
      Pull networking updates from Jakub Kicinski:
       "Including fixes from netfilter and bpf.
      
        Current release - regressions:
      
         - eth: stmmac: fix failure to probe without MAC interface specified
      
        Current release - new code bugs:
      
         - docs: netlink: fix missing classic_netlink doc reference
      
        Previous releases - regressions:
      
         - deal with integer overflows in kmalloc_reserve()
      
         - use sk_forward_alloc_get() in sk_get_meminfo()
      
         - bpf_sk_storage: fix the missing uncharge in sk_omem_alloc
      
         - fib: avoid warn splat in flow dissector after packet mangling
      
         - skb_segment: call zero copy functions before using skbuff frags
      
         - eth: sfc: check for zero length in EF10 RX prefix
      
        Previous releases - always broken:
      
         - af_unix: fix msg_controllen test in scm_pidfd_recv() for
           MSG_CMSG_COMPAT
      
         - xsk: fix xsk_build_skb() dereferencing possible ERR_PTR()
      
         - netfilter:
            - nft_exthdr: fix non-linear header modification
            - xt_u32, xt_sctp: validate user space input
            - nftables: exthdr: fix 4-byte stack OOB write
            - nfnetlink_osf: avoid OOB read
            - one more fix for the garbage collection work from last release
      
         - igmp: limit igmpv3_newpack() packet size to IP_MAX_MTU
      
         - bpf, sockmap: fix preempt_rt splat when using raw_spin_lock_t
      
         - handshake: fix null-deref in handshake_nl_done_doit()
      
         - ip: ignore dst hint for multipath routes to ensure packets are
           hashed across the nexthops
      
         - phy: micrel:
            - correct bit assignments for cable test errata
            - disable EEE according to the KSZ9477 errata
      
        Misc:
      
         - docs/bpf: document compile-once-run-everywhere (CO-RE) relocations
      
         - Revert "net: macsec: preserve ingress frame ordering", it appears
           to have been developed against an older kernel, problem doesn't
           exist upstream"
      
      * tag 'net-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (95 commits)
        net: enetc: distinguish error from valid pointers in enetc_fixup_clear_rss_rfs()
        Revert "net: team: do not use dynamic lockdep key"
        net: hns3: remove GSO partial feature bit
        net: hns3: fix the port information display when sfp is absent
        net: hns3: fix invalid mutex between tc qdisc and dcb ets command issue
        net: hns3: fix debugfs concurrency issue between kfree buffer and read
        net: hns3: fix byte order conversion issue in hclge_dbg_fd_tcam_read()
        net: hns3: Support query tx timeout threshold by debugfs
        net: hns3: fix tx timeout issue
        net: phy: Provide Module 4 KSZ9477 errata (DS80000754C)
        netfilter: nf_tables: Unbreak audit log reset
        netfilter: ipset: add the missing IP_SET_HASH_WITH_NET0 macro for ip_set_hash_netportnet.c
        netfilter: nft_set_rbtree: skip sync GC for new elements in this transaction
        netfilter: nf_tables: uapi: Describe NFTA_RULE_CHAIN_ID
        netfilter: nfnetlink_osf: avoid OOB read
        netfilter: nftables: exthdr: fix 4-byte stack OOB write
        selftests/bpf: Check bpf_sk_storage has uncharged sk_omem_alloc
        bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc
        bpf: bpf_sk_storage: Fix invalid wait context lockdep report
        s390/bpf: Pass through tail call counter in trampolines
        ...
      73be7fb1
    • Linus Torvalds's avatar
      Merge tag 'devicetree-fixes-for-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux · 2ab35ce2
      Linus Torvalds authored
      Pull more devicetree updates from Rob Herring:
       "A couple of conversions which didn't get picked up by the subsystems
        and one fix:
      
         - Convert st,stih407-irq-syscfg and Omnivision OV7251 bindings to DT
           schema
      
         - Merge Omnivision OV5695 into OV5693 binding
      
         - Fix of_overlay_fdt_apply prototype when !CONFIG_OF_OVERLAY"
      
      * tag 'devicetree-fixes-for-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
        dt-bindings: irqchip: convert st,stih407-irq-syscfg to DT schema
        media: dt-bindings: Convert Omnivision OV7251 to DT schema
        media: dt-bindings: Merge OV5695 into OV5693 binding
        of: overlay: Fix of_overlay_fdt_apply prototype when !CONFIG_OF_OVERLAY
      2ab35ce2
    • Linus Torvalds's avatar
      Merge tag 'pwm/for-6.6-rc1' of... · 8d844b35
      Linus Torvalds authored
      Merge tag 'pwm/for-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm
      
      Pull pwm updates from Thierry Reding:
       "Various cleanups and fixes across the board"
      
      * tag 'pwm/for-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm: (31 commits)
        pwm: lpc32xx: Remove handling of PWM channels
        pwm: atmel: Simplify using devm functions
        dt-bindings: pwm: brcm,kona-pwm: convert to YAML
        pwm: stmpe: Handle errors when disabling the signal
        pwm: stm32: Simplify using devm_pwmchip_add()
        pwm: stm32: Don't modify HW state in .remove() callback
        pwm: Fix order of freeing resources in pwmchip_remove()
        pwm: ntxec: Use device_set_of_node_from_dev()
        pwm: ntxec: Drop a write-only variable from driver data
        pwm: pxa: Don't reimplement of_device_get_match_data()
        pwm: lpc18xx-sct: Simplify using devm_clk_get_enabled()
        pwm: atmel-tcb: Don't track polarity in driver data
        pwm: atmel-tcb: Unroll atmel_tcb_pwm_set_polarity() into only caller
        pwm: atmel-tcb: Put per-channel data into driver data
        pwm: atmel-tcb: Fix resource freeing in error path and remove
        pwm: atmel-tcb: Harmonize resource allocation order
        pwm: Drop unused #include <linux/radix-tree.h>
        pwm: rz-mtu3: Fix build warning 'num_channel_ios' not described
        pwm: Remove outdated documentation for pwmchip_remove()
        pwm: atmel: Enable clk when pwm already enabled in bootloader
        ...
      8d844b35
  2. 07 Sep, 2023 13 commits
    • Linus Torvalds's avatar
      Merge tag 'rtc-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux · ff6e6ded
      Linus Torvalds authored
      Pull RTC updates from Alexandre Belloni:
       "Subsystem:
      
         - Add a way for drivers to tell the core the supported alarm range is
           smaller than the date range. This is not used yet but will be
           useful for the alarmtimers in the next release.
      
         - fix Wvoid-pointer-to-enum-cast warnings
      
         - remove redundant of_match_ptr()
      
         - stop warning for invalid alarms when the alarm is disabled
      
        Drivers:
      
         - isl12022: allow setting the trip level for battery level detection
      
         - pcf2127: add support for PCF2131 and multiple timestamps
      
         - stm32: time precision improvement, many fixes
      
         - twl: NVRAM support"
      
      * tag 'rtc-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (73 commits)
        dt-bindings: rtc: ds3231: Remove text binding
        rtc: wm8350: remove unnecessary messages
        rtc: twl: remove unnecessary messages
        rtc: sun6i: remove unnecessary message
        rtc: stop warning for invalid alarms when the alarm is disabled
        rtc: twl: add NVRAM support
        rtc: pcf85363: Allow to wake up system without IRQ
        rtc: m48t86: add DT support for m48t86
        dt-bindings: rtc: Add ST M48T86
        rtc: pcf2127: remove useless check
        rtc: rzn1: Report maximum alarm limit to rtc core
        rtc: ds1305: Report maximum alarm limit to rtc core
        rtc: tps6586x: Report maximum alarm limit to rtc core
        rtc: cmos: Report supported alarm limit to rtc infrastructure
        rtc: cros-ec: Detect and report supported alarm window size
        rtc: Add support for limited alarm timer offsets
        rtc: isl1208: Fix incorrect logic in isl1208_set_xtoscb()
        MAINTAINERS: remove obsolete pattern in RTC SUBSYSTEM section
        rtc: tps65910: Remove redundant dev_warn() and do not check for 0 return after calling platform_get_irq()
        rtc: omap: Do not check for 0 return after calling platform_get_irq()
        ...
      ff6e6ded
    • Linus Torvalds's avatar
      Merge tag 'i3c/for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux · e59a698b
      Linus Torvalds authored
      Pull i3c updates from Alexandre Belloni:
       "Core:
         - Fix SETDASA when static and dynamic adress are equal
         - Fix cmd_v1 DAA exit criteria
      
        Drivers:
         - svc: allow probing without any device"
      
      * tag 'i3c/for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux:
        i3c: master: svc: fix probe failure when no i3c device exist
        i3c: master: Fix SETDASA process
        dt-bindings: i3c: Fix description for assigned-address
        i3c: master: svc: Describe member 'saved_regs'
        i3c: master: svc: Do not check for 0 return after calling platform_get_irq()
        i3c/master: cmd_v1: Fix the exit criteria for the daa procedure
        i3c: Explicitly include correct DT includes
      e59a698b
    • Linus Torvalds's avatar
      Merge tag 'regulator-fix-v6.6-merge-window' of... · d9b9ea58
      Linus Torvalds authored
      Merge tag 'regulator-fix-v6.6-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
      
      Pull regulator fixes from Mark Brown:
       "A couple of fixes that came in during the merge window, both driver
        specific - one for a bug that came up in testing, one for a bug due
        to a misreading of the datasheet"
      
      * tag 'regulator-fix-v6.6-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
        regulator: tps6594-regulator: Fix random kernel crash
        regulator: tps6287x: Fix n_voltages
      d9b9ea58
    • Linus Torvalds's avatar
      Merge tag 'spi-fix-v6.6-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi · 32904dec
      Linus Torvalds authored
      Pull spi fixes from Mark Brown:
       "A couple of fixes for the sun6i driver. The patch to reduce DMA RX to
        single byte width all the time is *hopefully* excessively cautious but
        it's unclear which SoCs are affected so the fix just covers everything
        for safety"
      
      * tag 'spi-fix-v6.6-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
        spi: sun6i: fix race between DMA RX transfer completion and RX FIFO drain
        spi: sun6i: reduce DMA RX transfer width to single byte
      32904dec
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 0c021834
      Linus Torvalds authored
      Pull kvm updates from Paolo Bonzini:
       "ARM:
      
         - Clean up vCPU targets, always returning generic v8 as the preferred
           target
      
         - Trap forwarding infrastructure for nested virtualization (used for
           traps that are taken from an L2 guest and are needed by the L1
           hypervisor)
      
         - FEAT_TLBIRANGE support to only invalidate specific ranges of
           addresses when collapsing a table PTE to a block PTE. This avoids
           that the guest refills the TLBs again for addresses that aren't
           covered by the table PTE.
      
         - Fix vPMU issues related to handling of PMUver.
      
         - Don't unnecessary align non-stack allocations in the EL2 VA space
      
         - Drop HCR_VIRT_EXCP_MASK, which was never used...
      
         - Don't use smp_processor_id() in kvm_arch_vcpu_load(), but the cpu
           parameter instead
      
         - Drop redundant call to kvm_set_pfn_accessed() in user_mem_abort()
      
         - Remove prototypes without implementations
      
        RISC-V:
      
         - Zba, Zbs, Zicntr, Zicsr, Zifencei, and Zihpm support for guest
      
         - Added ONE_REG interface for SATP mode
      
         - Added ONE_REG interface to enable/disable multiple ISA extensions
      
         - Improved error codes returned by ONE_REG interfaces
      
         - Added KVM_GET_REG_LIST ioctl() implementation for KVM RISC-V
      
         - Added get-reg-list selftest for KVM RISC-V
      
        s390:
      
         - PV crypto passthrough enablement (Tony, Steffen, Viktor, Janosch)
      
           Allows a PV guest to use crypto cards. Card access is governed by
           the firmware and once a crypto queue is "bound" to a PV VM every
           other entity (PV or not) looses access until it is not bound
           anymore. Enablement is done via flags when creating the PV VM.
      
         - Guest debug fixes (Ilya)
      
        x86:
      
         - Clean up KVM's handling of Intel architectural events
      
         - Intel bugfixes
      
         - Add support for SEV-ES DebugSwap, allowing SEV-ES guests to use
           debug registers and generate/handle #DBs
      
         - Clean up LBR virtualization code
      
         - Fix a bug where KVM fails to set the target pCPU during an IRTE
           update
      
         - Fix fatal bugs in SEV-ES intrahost migration
      
         - Fix a bug where the recent (architecturally correct) change to
           reinject #BP and skip INT3 broke SEV guests (can't decode INT3 to
           skip it)
      
         - Retry APIC map recalculation if a vCPU is added/enabled
      
         - Overhaul emergency reboot code to bring SVM up to par with VMX, tie
           the "emergency disabling" behavior to KVM actually being loaded,
           and move all of the logic within KVM
      
         - Fix user triggerable WARNs in SVM where KVM incorrectly assumes the
           TSC ratio MSR cannot diverge from the default when TSC scaling is
           disabled up related code
      
         - Add a framework to allow "caching" feature flags so that KVM can
           check if the guest can use a feature without needing to search
           guest CPUID
      
         - Rip out the ancient MMU_DEBUG crud and replace the useful bits with
           CONFIG_KVM_PROVE_MMU
      
         - Fix KVM's handling of !visible guest roots to avoid premature
           triple fault injection
      
         - Overhaul KVM's page-track APIs, and KVMGT's usage, to reduce the
           API surface that is needed by external users (currently only
           KVMGT), and fix a variety of issues in the process
      
        Generic:
      
         - Wrap kvm_{gfn,hva}_range.pte in a union to allow mmu_notifier
           events to pass action specific data without needing to constantly
           update the main handlers.
      
         - Drop unused function declarations
      
        Selftests:
      
         - Add testcases to x86's sync_regs_test for detecting KVM TOCTOU bugs
      
         - Add support for printf() in guest code and covert all guest asserts
           to use printf-based reporting
      
         - Clean up the PMU event filter test and add new testcases
      
         - Include x86 selftests in the KVM x86 MAINTAINERS entry"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (279 commits)
        KVM: x86/mmu: Include mmu.h in spte.h
        KVM: x86/mmu: Use dummy root, backed by zero page, for !visible guest roots
        KVM: x86/mmu: Disallow guest from using !visible slots for page tables
        KVM: x86/mmu: Harden TDP MMU iteration against root w/o shadow page
        KVM: x86/mmu: Harden new PGD against roots without shadow pages
        KVM: x86/mmu: Add helper to convert root hpa to shadow page
        drm/i915/gvt: Drop final dependencies on KVM internal details
        KVM: x86/mmu: Handle KVM bookkeeping in page-track APIs, not callers
        KVM: x86/mmu: Drop @slot param from exported/external page-track APIs
        KVM: x86/mmu: Bug the VM if write-tracking is used but not enabled
        KVM: x86/mmu: Assert that correct locks are held for page write-tracking
        KVM: x86/mmu: Rename page-track APIs to reflect the new reality
        KVM: x86/mmu: Drop infrastructure for multiple page-track modes
        KVM: x86/mmu: Use page-track notifiers iff there are external users
        KVM: x86/mmu: Move KVM-only page-track declarations to internal header
        KVM: x86: Remove the unused page-track hook track_flush_slot()
        drm/i915/gvt: switch from ->track_flush_slot() to ->track_remove_region()
        KVM: x86: Add a new page-track hook to handle memslot deletion
        drm/i915/gvt: Don't bother removing write-protection on to-be-deleted slot
        KVM: x86: Reject memslot MOVE operations if KVMGT is attached
        ...
      0c021834
    • Vladimir Oltean's avatar
      net: enetc: distinguish error from valid pointers in enetc_fixup_clear_rss_rfs() · 1b36955c
      Vladimir Oltean authored
      enetc_psi_create() returns an ERR_PTR() or a valid station interface
      pointer, but checking for the non-NULL quality of the return code blurs
      that difference away. So if enetc_psi_create() fails, we call
      enetc_psi_destroy() when we shouldn't. This will likely result in
      crashes, since enetc_psi_create() cleans up everything after itself when
      it returns an ERR_PTR().
      
      Fixes: f0168042 ("net: enetc: reimplement RFS/RSS memory clearing as PCI quirk")
      Reported-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Closes: https://lore.kernel.org/netdev/582183ef-e03b-402b-8e2d-6d9bb3c83bd9@moroto.mountain/Suggested-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20230906141609.247579-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1b36955c
    • Jakub Kicinski's avatar
      Revert "net: team: do not use dynamic lockdep key" · 6afcf0fb
      Jakub Kicinski authored
      This reverts commit 39285e12.
      
      Looks like the change has unintended consequences in exposing
      objects before they are initialized. Let's drop this patch
      and try again in net-next.
      
      Reported-by: syzbot+44ae022028805f4600fc@syzkaller.appspotmail.com
      Fixes: 39285e12 ("net: team: do not use dynamic lockdep key")
      Link: https://lore.kernel.org/all/20230907103124.6adb7256@kernel.org/Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6afcf0fb
    • Linus Torvalds's avatar
      Merge tag 's390-6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 4a0fc73d
      Linus Torvalds authored
      Pull more s390 updates from Heiko Carstens:
      
       - A couple of virtual vs physical address confusion fixes
      
       - Rework locking in dcssblk driver to address a lockdep warning
      
       - Remove support for "noexec" kernel command line option since there is
         no use case where it would make sense
      
       - Simplify kernel mapping setup and get rid of quite a bit of code
      
       - Add architecture specific __set_memory_yy() functions which allow us
         to modify kernel mappings. Unlike the set_memory_xx() variants they
         take void pointer start and end parameters, which allows using them
         without the usual casts, and also to use them on areas larger than
         8TB.
      
         Note that the set_memory_xx() family comes with an int num_pages
         parameter which overflows with 8TB. This could be addressed by
         changing the num_pages parameter to unsigned long, however requires
         to change all architectures, since the module code expects an int
         parameter (see module_set_memory()).
      
         This was indeed an issue since for debug_pagealloc() we call
         set_memory_4k() on the whole identity mapping. Therefore address this
         for now with the __set_memory_yy() variant, and address common code
         later
      
       - Use dev_set_name() and also fix memory leak in zcrypt driver error
         handling
      
       - Remove unused lsi_mask from airq_struct
      
       - Add warning for invalid kernel mapping requests
      
      * tag 's390-6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390/vmem: do not silently ignore mapping limit
        s390/zcrypt: utilize dev_set_name() ability to use a formatted string
        s390/zcrypt: don't leak memory if dev_set_name() fails
        s390/mm: fix MAX_DMA_ADDRESS physical vs virtual confusion
        s390/airq: remove lsi_mask from airq_struct
        s390/mm: use __set_memory() variants where useful
        s390/set_memory: add __set_memory() variant
        s390/set_memory: generate all set_memory() functions
        s390/mm: improve description of mapping permissions of prefix pages
        s390/amode31: change type of __samode31, __eamode31, etc
        s390/mm: simplify kernel mapping setup
        s390: remove "noexec" option
        s390/vmem: fix virtual vs physical address confusion
        s390/dcssblk: fix lockdep warning
        s390/monreader: fix virtual vs physical address confusion
      4a0fc73d
    • Linus Torvalds's avatar
      Merge tag 'mips_6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux · ac2224a4
      Linus Torvalds authored
      Pull MIPS updates from Thomas Bogendoerfer:
       "Just cleanups and fixes"
      
      * tag 'mips_6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
        MIPS: TXx9: Do PCI error checks on own line
        arch/mips/configs/*_defconfig cleanup
        MIPS: VDSO: Conditionally export __vdso_gettimeofday()
        Mips: loongson3_defconfig: Enable ast drm driver by default
        mips: remove <asm/export.h>
        mips: replace #include <asm/export.h> with #include <linux/export.h>
        mips: remove unneeded #include <asm/export.h>
        MIPS: Loongson64: Fix more __iomem attributes
        MIPS: loongson32: Remove regs-rtc.h
        MIPS: loongson32: Remove regs-clk.h
        MIPS: More explicit DT include clean-ups
        MIPS: Fixup explicit DT include clean-up
        Revert MIPS: Loongson: Fix build error when make modules_install
        MIPS: Only fiddle with CHECKFLAGS if `need-compiler'
        MIPS: Fix CONFIG_CPU_DADDI_WORKAROUNDS `modules_install' regression
        MIPS: Explicitly include correct DT includes
      ac2224a4
    • Linus Torvalds's avatar
      Merge tag 'xtensa-20230905' of https://github.com/jcmvbkbc/linux-xtensa · dd1386dd
      Linus Torvalds authored
      Pull xtensa updates from Max Filippov:
      
       - enable MTD XIP support
      
       - fix base address of the xtensa perf module in newer hardware
      
      * tag 'xtensa-20230905' of https://github.com/jcmvbkbc/linux-xtensa:
        xtensa: add XIP-aware MTD support
        xtensa: PMU: fix base address for the newer hardware
      dd1386dd
    • Christian Brauner's avatar
      ntfs3: drop inode references in ntfs_put_super() · 78a06688
      Christian Brauner authored
      Recently we moved most cleanup from ntfs_put_super() into
      ntfs3_kill_sb() as part of a bigger cleanup.  This accidently also moved
      dropping inode references stashed in ntfs3's sb->s_fs_info from
      @sb->put_super() to @sb->kill_sb().  But generic_shutdown_super()
      verifies that there are no busy inodes past sb->put_super().  Fix this
      and disentangle dropping inode references from freeing @sb->s_fs_info.
      
      Fixes: a4f64a30 ("ntfs3: free the sbi in ->kill_sb") # mainline only
      Reported-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Tested-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78a06688
    • Linus Torvalds's avatar
      vfs: mostly undo glibc turning 'fstat()' into 'fstatat(AT_EMPTY_PATH)' · 9013c51c
      Linus Torvalds authored
      Mateusz reports that glibc turns 'fstat()' calls into 'fstatat()', and
      that seems to have been going on for quite a long time due to glibc
      having tried to simplify its stat logic into just one point.
      
      This turns out to cause completely unnecessary overhead, where we then
      go off and allocate the kernel side pathname, and actually look up the
      empty path.  Sure, our path lookup is quite optimized, but it still
      causes a fair bit of allocation overhead and a couple of completely
      unnecessary rounds of lockref accesses etc.
      
      This is all hopefully getting fixed in user space, and there is a patch
      floating around for just having glibc use the native fstat() system
      call.  But even with the current situation we can at least improve on
      things by catching the situation and short-circuiting it.
      
      Note that this is still measurably slower than just a plain 'fstat()',
      since just checking that the filename is actually empty is somewhat
      expensive due to inevitable user space access overhead from the kernel
      (ie verifying pointers, and SMAP on x86).  But it's still quite a bit
      faster than actually looking up the path for real.
      
      To quote numers from Mateusz:
       "Sapphire Rapids, will-it-scale, ops/s
      
        stock fstat	5088199
        patched fstat	7625244	(+49%)
        real fstat	8540383	(+67% / +12%)"
      
      where that 'stock fstat' is the glibc translation of fstat into
      fstatat() with an empty path, the 'patched fstat' is with this short
      circuiting of the path lookup, and the 'real fstat' is the actual native
      fstat() system call with none of this overhead.
      
      Link: https://lore.kernel.org/lkml/20230903204858.lv7i3kqvw6eamhgz@f/Reported-by: default avatarMateusz Guzik <mjguzik@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9013c51c
    • Paolo Abeni's avatar
      Merge tag 'nf-23-09-06' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · 7153a404
      Paolo Abeni authored
      Florian Westphal says:
      
      ====================
      netfilter updates for net
      
      This PR contains nf_tables updates for your *net* tree.
      This time almost all fixes are for old bugs:
      
      First patch fixes a 4-byte stack OOB write, from myself.
      This was broken ever since nftables was switches from 128 to 32bit
      register addressing in v4.1.
      
      2nd patch fixes an out-of-bounds read.
      This has been broken ever since xt_osf got added in 2.6.31, the bug
      was then just moved around during refactoring, from Wander Lairson Costa.
      
      3rd patch adds a missing enum description, from Phil Sutter.
      
      4th patch fixes a UaF inftables that occurs when userspace adds
      elements with a timeout so small that expiration happens while the
      transaction is still in progress.  Fix from Pablo Neira Ayuso.
      
      Patch 5 fixes a memory out of bounds access, this was
      broken since v4.20. Patch from Kyle Zeng and Jozsef Kadlecsik.
      
      Patch 6 fixes another bogus memory access when building audit
      record. Bug added in the previous pull request, fix from Pablo.
      
      netfilter pull request 2023-09-06
      
      * tag 'nf-23-09-06' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
        netfilter: nf_tables: Unbreak audit log reset
        netfilter: ipset: add the missing IP_SET_HASH_WITH_NET0 macro for ip_set_hash_netportnet.c
        netfilter: nft_set_rbtree: skip sync GC for new elements in this transaction
        netfilter: nf_tables: uapi: Describe NFTA_RULE_CHAIN_ID
        netfilter: nfnetlink_osf: avoid OOB read
        netfilter: nftables: exthdr: fix 4-byte stack OOB write
      ====================
      
      Link: https://lore.kernel.org/r/20230906162525.11079-1-fw@strlen.deSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      7153a404