1. 10 Mar, 2020 5 commits
  2. 06 Mar, 2020 2 commits
    • Luke Nelson's avatar
      selftests: bpf: Add test for JMP32 JSET BPF_X with upper bits set · 93e5fbb1
      Luke Nelson authored
      The existing tests attempt to check that JMP32 JSET ignores the upper
      bits in the operand registers. However, the tests missed one such bug in
      the x32 JIT that is only uncovered when a previous instruction pollutes
      the upper 32 bits of the registers.
      
      This patch adds a new test case that catches the bug by first executing
      a 64-bit JSET to pollute the upper 32-bits of the temporary registers,
      followed by a 32-bit JSET which should ignore the upper 32 bits.
      Co-developed-by: default avatarXi Wang <xi.wang@gmail.com>
      Signed-off-by: default avatarXi Wang <xi.wang@gmail.com>
      Signed-off-by: default avatarLuke Nelson <luke.r.nels@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200305234416.31597-2-luke.r.nels@gmail.com
      93e5fbb1
    • Luke Nelson's avatar
      bpf, x32: Fix bug with JMP32 JSET BPF_X checking upper bits · 80f1f850
      Luke Nelson authored
      The current x32 BPF JIT is incorrect for JMP32 JSET BPF_X when the upper
      32 bits of operand registers are non-zero in certain situations.
      
      The problem is in the following code:
      
        case BPF_JMP | BPF_JSET | BPF_X:
        case BPF_JMP32 | BPF_JSET | BPF_X:
        ...
      
        /* and dreg_lo,sreg_lo */
        EMIT2(0x23, add_2reg(0xC0, sreg_lo, dreg_lo));
        /* and dreg_hi,sreg_hi */
        EMIT2(0x23, add_2reg(0xC0, sreg_hi, dreg_hi));
        /* or dreg_lo,dreg_hi */
        EMIT2(0x09, add_2reg(0xC0, dreg_lo, dreg_hi));
      
      This code checks the upper bits of the operand registers regardless if
      the BPF instruction is BPF_JMP32 or BPF_JMP64. Registers dreg_hi and
      dreg_lo are not loaded from the stack for BPF_JMP32, however, they can
      still be polluted with values from previous instructions.
      
      The following BPF program demonstrates the bug. The jset64 instruction
      loads the temporary registers and performs the jump, since ((u64)r7 &
      (u64)r8) is non-zero. The jset32 should _not_ be taken, as the lower
      32 bits are all zero, however, the current JIT will take the branch due
      the pollution of temporary registers from the earlier jset64.
      
        mov64    r0, 0
        ld64     r7, 0x8000000000000000
        ld64     r8, 0x8000000000000000
        jset64   r7, r8, 1
        exit
        jset32   r7, r8, 1
        mov64    r0, 2
        exit
      
      The expected return value of this program is 2; under the buggy x32 JIT
      it returns 0. The fix is to skip using the upper 32 bits for jset32 and
      compare the upper 32 bits for jset64 only.
      
      All tests in test_bpf.ko and selftests/bpf/test_verifier continue to
      pass with this change.
      
      We found this bug using our automated verification tool, Serval.
      
      Fixes: 69f827eb ("x32: bpf: implement jitting of JMP32")
      Co-developed-by: default avatarXi Wang <xi.wang@gmail.com>
      Signed-off-by: default avatarXi Wang <xi.wang@gmail.com>
      Signed-off-by: default avatarLuke Nelson <luke.r.nels@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200305234416.31597-1-luke.r.nels@gmail.com
      80f1f850
  3. 05 Mar, 2020 6 commits
    • Martin KaFai Lau's avatar
      bpf: Do not allow map_freeze in struct_ops map · 849b4d94
      Martin KaFai Lau authored
      struct_ops map cannot support map_freeze.  Otherwise, a struct_ops
      cannot be unregistered from the subsystem.
      
      Fixes: 85d33df3 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS")
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200305013454.535397-1-kafai@fb.com
      849b4d94
    • Martin KaFai Lau's avatar
      bpf: Return better error value in delete_elem for struct_ops map · 8e5290e7
      Martin KaFai Lau authored
      The current always succeed behavior in bpf_struct_ops_map_delete_elem()
      is not ideal for userspace tool.  It can be improved to return proper
      error value.
      
      If it is in TOBEFREE, it means unregistration has been already done
      before but it is in progress and waiting for the subsystem to clear
      the refcnt to zero, so -EINPROGRESS.
      
      If it is INIT, it means the struct_ops has not been registered yet,
      so -ENOENT.
      
      Fixes: 85d33df3 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS")
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200305013447.535326-1-kafai@fb.com
      8e5290e7
    • Alexei Starovoitov's avatar
      Merge branch 'fix_bpf_send_signal' · a35a76fa
      Alexei Starovoitov authored
      Yonghong Song says:
      
      ====================
      Commit 8b401f9e ("bpf: implement bpf_send_signal() helper")
      introduced bpf_send_signal() helper and Commit 8482941f
      ("bpf: Add bpf_send_signal_thread() helper") added bpf_send_signal_thread()
      helper. Both helpers try to send a signel to current process or thread.
      
      When bpf_prog is called with scheduler rq_lock held, a deadlock
      could happen since bpf_send_signal() and bpf_send_signal_thread()
      will call group_send_sig_info() which may ultimately want to acquire
      rq_lock() again. This happens in 5.2 and 4.16 based kernels in our
      production environment with perf_sw_event.
      
      In a different scenario, the following is also possible in the last kernel:
        cpu 1:
           do_task_stat <- holding sighand->siglock
           ...
           task_sched_runtime <- trying to grab rq_lock
      
        cpu 2:
           __schedule <- holding rq_lock
           ...
           do_send_sig_info <- trying to grab sighand->siglock
      
      Commit eac9153f ("bpf/stackmap: Fix deadlock with
      rq_lock in bpf_get_stack()") has a similar issue with above
      rq_lock() deadlock. This patch set addressed the issue
      in a similar way. Patch #1 provided kernel solution and
      Patch #2 added a selftest.
      
      Changelogs:
        v2 -> v3:
          . simplify selftest send_signal_sched_switch().
            The previous version has mmap/munmap inherited
            from Song's reproducer. They are not necessary
            in this context.
        v1 -> v2:
          . previous fix using task_work in nmi() is incorrect.
            there is no nmi() involvement here. Using task_work
            in all cases might be a solution. But decided to
            use a similar fix as in Commit eac9153f.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a35a76fa
    • Yonghong Song's avatar
      selftests/bpf: Add send_signal_sched_switch test · c4ef2f32
      Yonghong Song authored
      Added one test, send_signal_sched_switch, to test bpf_send_signal()
      helper triggered by sched/sched_switch tracepoint. This test can be used
      to verify kernel deadlocks fixed by the previous commit. The test itself
      is heavily borrowed from Commit eac9153f ("bpf/stackmap: Fix deadlock
      with rq_lock in bpf_get_stack()").
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Cc: Song Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200304191105.2796601-1-yhs@fb.com
      c4ef2f32
    • Yonghong Song's avatar
      bpf: Fix deadlock with rq_lock in bpf_send_signal() · 1bc7896e
      Yonghong Song authored
      When experimenting with bpf_send_signal() helper in our production
      environment (5.2 based), we experienced a deadlock in NMI mode:
         #5 [ffffc9002219f770] queued_spin_lock_slowpath at ffffffff8110be24
         #6 [ffffc9002219f770] _raw_spin_lock_irqsave at ffffffff81a43012
         #7 [ffffc9002219f780] try_to_wake_up at ffffffff810e7ecd
         #8 [ffffc9002219f7e0] signal_wake_up_state at ffffffff810c7b55
         #9 [ffffc9002219f7f0] __send_signal at ffffffff810c8602
        #10 [ffffc9002219f830] do_send_sig_info at ffffffff810ca31a
        #11 [ffffc9002219f868] bpf_send_signal at ffffffff8119d227
        #12 [ffffc9002219f988] bpf_overflow_handler at ffffffff811d4140
        #13 [ffffc9002219f9e0] __perf_event_overflow at ffffffff811d68cf
        #14 [ffffc9002219fa10] perf_swevent_overflow at ffffffff811d6a09
        #15 [ffffc9002219fa38] ___perf_sw_event at ffffffff811e0f47
        #16 [ffffc9002219fc30] __schedule at ffffffff81a3e04d
        #17 [ffffc9002219fc90] schedule at ffffffff81a3e219
        #18 [ffffc9002219fca0] futex_wait_queue_me at ffffffff8113d1b9
        #19 [ffffc9002219fcd8] futex_wait at ffffffff8113e529
        #20 [ffffc9002219fdf0] do_futex at ffffffff8113ffbc
        #21 [ffffc9002219fec0] __x64_sys_futex at ffffffff81140d1c
        #22 [ffffc9002219ff38] do_syscall_64 at ffffffff81002602
        #23 [ffffc9002219ff50] entry_SYSCALL_64_after_hwframe at ffffffff81c00068
      
      The above call stack is actually very similar to an issue
      reported by Commit eac9153f ("bpf/stackmap: Fix deadlock with
      rq_lock in bpf_get_stack()") by Song Liu. The only difference is
      bpf_send_signal() helper instead of bpf_get_stack() helper.
      
      The above deadlock is triggered with a perf_sw_event.
      Similar to Commit eac9153f, the below almost identical reproducer
      used tracepoint point sched/sched_switch so the issue can be easily caught.
        /* stress_test.c */
        #include <stdio.h>
        #include <stdlib.h>
        #include <sys/mman.h>
        #include <pthread.h>
        #include <sys/types.h>
        #include <sys/stat.h>
        #include <fcntl.h>
      
        #define THREAD_COUNT 1000
        char *filename;
        void *worker(void *p)
        {
              void *ptr;
              int fd;
              char *pptr;
      
              fd = open(filename, O_RDONLY);
              if (fd < 0)
                      return NULL;
              while (1) {
                      struct timespec ts = {0, 1000 + rand() % 2000};
      
                      ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0);
                      usleep(1);
                      if (ptr == MAP_FAILED) {
                              printf("failed to mmap\n");
                              break;
                      }
                      munmap(ptr, 4096 * 64);
                      usleep(1);
                      pptr = malloc(1);
                      usleep(1);
                      pptr[0] = 1;
                      usleep(1);
                      free(pptr);
                      usleep(1);
                      nanosleep(&ts, NULL);
              }
              close(fd);
              return NULL;
        }
      
        int main(int argc, char *argv[])
        {
              void *ptr;
              int i;
              pthread_t threads[THREAD_COUNT];
      
              if (argc < 2)
                      return 0;
      
              filename = argv[1];
      
              for (i = 0; i < THREAD_COUNT; i++) {
                      if (pthread_create(threads + i, NULL, worker, NULL)) {
                              fprintf(stderr, "Error creating thread\n");
                              return 0;
                      }
              }
      
              for (i = 0; i < THREAD_COUNT; i++)
                      pthread_join(threads[i], NULL);
              return 0;
        }
      and the following command:
        1. run `stress_test /bin/ls` in one windown
        2. hack bcc trace.py with the following change:
           --- a/tools/trace.py
           +++ b/tools/trace.py
           @@ -513,6 +513,7 @@ BPF_PERF_OUTPUT(%s);
                    __data.tgid = __tgid;
                    __data.pid = __pid;
                    bpf_get_current_comm(&__data.comm, sizeof(__data.comm));
           +        bpf_send_signal(10);
            %s
            %s
                    %s.perf_submit(%s, &__data, sizeof(__data));
        3. in a different window run
           ./trace.py -p $(pidof stress_test) t:sched:sched_switch
      
      The deadlock can be reproduced in our production system.
      
      Similar to Song's fix, the fix is to delay sending signal if
      irqs is disabled to avoid deadlocks involving with rq_lock.
      With this change, my above stress-test in our production system
      won't cause deadlock any more.
      
      I also implemented a scale-down version of reproducer in the
      selftest (a subsequent commit). With latest bpf-next,
      it complains for the following potential deadlock.
        [   32.832450] -> #1 (&p->pi_lock){-.-.}:
        [   32.833100]        _raw_spin_lock_irqsave+0x44/0x80
        [   32.833696]        task_rq_lock+0x2c/0xa0
        [   32.834182]        task_sched_runtime+0x59/0xd0
        [   32.834721]        thread_group_cputime+0x250/0x270
        [   32.835304]        thread_group_cputime_adjusted+0x2e/0x70
        [   32.835959]        do_task_stat+0x8a7/0xb80
        [   32.836461]        proc_single_show+0x51/0xb0
        ...
        [   32.839512] -> #0 (&(&sighand->siglock)->rlock){....}:
        [   32.840275]        __lock_acquire+0x1358/0x1a20
        [   32.840826]        lock_acquire+0xc7/0x1d0
        [   32.841309]        _raw_spin_lock_irqsave+0x44/0x80
        [   32.841916]        __lock_task_sighand+0x79/0x160
        [   32.842465]        do_send_sig_info+0x35/0x90
        [   32.842977]        bpf_send_signal+0xa/0x10
        [   32.843464]        bpf_prog_bc13ed9e4d3163e3_send_signal_tp_sched+0x465/0x1000
        [   32.844301]        trace_call_bpf+0x115/0x270
        [   32.844809]        perf_trace_run_bpf_submit+0x4a/0xc0
        [   32.845411]        perf_trace_sched_switch+0x10f/0x180
        [   32.846014]        __schedule+0x45d/0x880
        [   32.846483]        schedule+0x5f/0xd0
        ...
      
        [   32.853148] Chain exists of:
        [   32.853148]   &(&sighand->siglock)->rlock --> &p->pi_lock --> &rq->lock
        [   32.853148]
        [   32.854451]  Possible unsafe locking scenario:
        [   32.854451]
        [   32.855173]        CPU0                    CPU1
        [   32.855745]        ----                    ----
        [   32.856278]   lock(&rq->lock);
        [   32.856671]                                lock(&p->pi_lock);
        [   32.857332]                                lock(&rq->lock);
        [   32.857999]   lock(&(&sighand->siglock)->rlock);
      
        Deadlock happens on CPU0 when it tries to acquire &sighand->siglock
        but it has been held by CPU1 and CPU1 tries to grab &rq->lock
        and cannot get it.
      
        This is not exactly the callstack in our production environment,
        but sympotom is similar and both locks are using spin_lock_irqsave()
        to acquire the lock, and both involves rq_lock. The fix to delay
        sending signal when irq is disabled also fixed this issue.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200304191104.2796501-1-yhs@fb.com
      1bc7896e
    • Quentin Monnet's avatar
      mailmap: Update email address · 52e7c083
      Quentin Monnet authored
      My Netronome address is no longer active. I am no maintainer, but
      get_maintainer.pl sometimes returns my name for a small number of
      files (BPF-related). Add an entry to .mailmap for good measure.
      Signed-off-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200226171353.18982-1-quentin@isovalent.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      52e7c083
  4. 04 Mar, 2020 6 commits
    • Dajun Jin's avatar
      drivers/of/of_mdio.c:fix of_mdiobus_register() · 209c65b6
      Dajun Jin authored
      When registers a phy_device successful, should terminate the loop
      or the phy_device would be registered in other addr. If there are
      multiple PHYs without reg properties, it will go wrong.
      Signed-off-by: default avatarDajun Jin <adajunjin@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      209c65b6
    • Vishal Kulkarni's avatar
      cxgb4: fix checks for max queues to allocate · 116ca924
      Vishal Kulkarni authored
      Hardware can support more than 8 queues currently limited by
      netif_get_num_default_rss_queues(). So, rework and fix checks for max
      number of queues to allocate. The checks should be based on how many are
      actually supported by hardware, OR the number of online cpus; whichever
      is lower.
      
      Fixes: 5952dde7 ("cxgb4: set maximal number of default RSS queues")
      Signed-off-by: Vishal Kulkarni <vishal@chelsio.com>"
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      116ca924
    • Hauke Mehrtens's avatar
      phylink: Improve error message when validate failed · 20d8bb0d
      Hauke Mehrtens authored
      This should improve the error message when the PHY validate in the MAC
      driver failed. I ran into this problem multiple times that I put wrong
      interface values into the device tree and was searching why it is
      failing with -22 (-EINVAL). This should make it easier to spot the
      problem.
      Signed-off-by: default avatarHauke Mehrtens <hauke@hauke-m.de>
      Acked-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20d8bb0d
    • Jonas Gorski's avatar
      net: phy: bcm63xx: fix OOPS due to missing driver name · 43de81b0
      Jonas Gorski authored
      719655a1 ("net: phy: Replace phy driver features u32 with link_mode
      bitmap") was a bit over-eager and also removed the second phy driver's
      name, resulting in a nasty OOPS on registration:
      
      [    1.319854] CPU 0 Unable to handle kernel paging request at virtual address 00000000, epc == 804dd50c, ra == 804dd4f0
      [    1.330859] Oops[#1]:
      [    1.333138] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.4.22 #0
      [    1.339217] $ 0   : 00000000 00000001 87ca7f00 805c1874
      [    1.344590] $ 4   : 00000000 00000047 00585000 8701f800
      [    1.349965] $ 8   : 8701f800 804f4a5c 00000003 64726976
      [    1.355341] $12   : 00000001 00000000 00000000 00000114
      [    1.360718] $16   : 87ca7f80 00000000 00000000 80639fe4
      [    1.366093] $20   : 00000002 00000000 806441d0 80b90000
      [    1.371470] $24   : 00000000 00000000
      [    1.376847] $28   : 87c1e000 87c1fda0 80b90000 804dd4f0
      [    1.382224] Hi    : d1c8f8da
      [    1.385180] Lo    : 5518a480
      [    1.388182] epc   : 804dd50c kset_find_obj+0x3c/0x114
      [    1.393345] ra    : 804dd4f0 kset_find_obj+0x20/0x114
      [    1.398530] Status: 10008703 KERNEL EXL IE
      [    1.402833] Cause : 00800008 (ExcCode 02)
      [    1.406952] BadVA : 00000000
      [    1.409913] PrId  : 0002a075 (Broadcom BMIPS4350)
      [    1.414745] Modules linked in:
      [    1.417895] Process swapper/0 (pid: 1, threadinfo=(ptrval), task=(ptrval), tls=00000000)
      [    1.426214] Stack : 87cec000 80630000 80639370 80640658 80640000 80049af4 80639fe4 8063a0d8
      [    1.434816]         8063a0d8 802ef078 00000002 00000000 806441d0 80b90000 8063a0d8 802ef114
      [    1.443417]         87cea0de 87c1fde0 00000000 804de488 87cea000 8063a0d8 8063a0d8 80334e48
      [    1.452018]         80640000 8063984c 80639bf4 00000000 8065de48 00000001 8063a0d8 80334ed0
      [    1.460620]         806441d0 80b90000 80b90000 802ef164 8065dd70 80620000 80b90000 8065de58
      [    1.469222]         ...
      [    1.471734] Call Trace:
      [    1.474255] [<804dd50c>] kset_find_obj+0x3c/0x114
      [    1.479141] [<802ef078>] driver_find+0x1c/0x44
      [    1.483665] [<802ef114>] driver_register+0x74/0x148
      [    1.488719] [<80334e48>] phy_driver_register+0x9c/0xd0
      [    1.493968] [<80334ed0>] phy_drivers_register+0x54/0xe8
      [    1.499345] [<8001061c>] do_one_initcall+0x7c/0x1f4
      [    1.504374] [<80644ed8>] kernel_init_freeable+0x1d4/0x2b4
      [    1.509940] [<804f4e24>] kernel_init+0x10/0xf8
      [    1.514502] [<80018e68>] ret_from_kernel_thread+0x14/0x1c
      [    1.520040] Code: 1060000c  02202025  90650000 <90810000> 24630001  14250004  24840001  14a0fffb  90650000
      [    1.530061]
      [    1.531698] ---[ end trace d52f1717cd29bdc8 ]---
      
      Fix it by readding the name.
      
      Fixes: 719655a1 ("net: phy: Replace phy driver features u32 with link_mode bitmap")
      Signed-off-by: default avatarJonas Gorski <jonas.gorski@gmail.com>
      Acked-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      43de81b0
    • Jacob Keller's avatar
      devlink: remove trigger command from devlink-region.rst · 70751834
      Jacob Keller authored
      The devlink trigger command does not exist. While rewriting the
      documentation for devlink into the reStructuredText format,
      documentation for the trigger command was accidentally merged in. This
      occurred because the author was also working on a potential extension to
      devlink regions which included this trigger command, and accidentally
      squashed the documentation incorrectly.
      
      Further review eventually settled on using the previously unused "new"
      command instead of creating a new trigger command.
      
      Fix this by removing mention of the trigger command from the
      documentation.
      
      Fixes: 0b0f945f ("devlink: add a file documenting devlink regions", 2020-01-10)
      Noticed-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      70751834
    • Jonathan Neuschäfer's avatar
  5. 03 Mar, 2020 21 commits