1. 30 Jul, 2021 4 commits
  2. 28 Jul, 2021 10 commits
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · fc16a532
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2021-07-29
      
      The following pull-request contains BPF updates for your *net* tree.
      
      We've added 9 non-merge commits during the last 14 day(s) which contain
      a total of 20 files changed, 446 insertions(+), 138 deletions(-).
      
      The main changes are:
      
      1) Fix UBSAN out-of-bounds splat for showing XDP link fdinfo, from Lorenz Bauer.
      
      2) Fix insufficient Spectre v4 mitigation in BPF runtime, from Daniel Borkmann,
         Piotr Krysiuk and Benedict Schlueter.
      
      3) Batch of fixes for BPF sockmap found under stress testing, from John Fastabend.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc16a532
    • Daniel Borkmann's avatar
      bpf: Fix leakage due to insufficient speculative store bypass mitigation · 2039f26f
      Daniel Borkmann authored
      Spectre v4 gadgets make use of memory disambiguation, which is a set of
      techniques that execute memory access instructions, that is, loads and
      stores, out of program order; Intel's optimization manual, section 2.4.4.5:
      
        A load instruction micro-op may depend on a preceding store. Many
        microarchitectures block loads until all preceding store addresses are
        known. The memory disambiguator predicts which loads will not depend on
        any previous stores. When the disambiguator predicts that a load does
        not have such a dependency, the load takes its data from the L1 data
        cache. Eventually, the prediction is verified. If an actual conflict is
        detected, the load and all succeeding instructions are re-executed.
      
      af86ca4e ("bpf: Prevent memory disambiguation attack") tried to mitigate
      this attack by sanitizing the memory locations through preemptive "fast"
      (low latency) stores of zero prior to the actual "slow" (high latency) store
      of a pointer value such that upon dependency misprediction the CPU then
      speculatively executes the load of the pointer value and retrieves the zero
      value instead of the attacker controlled scalar value previously stored at
      that location, meaning, subsequent access in the speculative domain is then
      redirected to the "zero page".
      
      The sanitized preemptive store of zero prior to the actual "slow" store is
      done through a simple ST instruction based on r10 (frame pointer) with
      relative offset to the stack location that the verifier has been tracking
      on the original used register for STX, which does not have to be r10. Thus,
      there are no memory dependencies for this store, since it's only using r10
      and immediate constant of zero; hence af86ca4e /assumed/ a low latency
      operation.
      
      However, a recent attack demonstrated that this mitigation is not sufficient
      since the preemptive store of zero could also be turned into a "slow" store
      and is thus bypassed as well:
      
        [...]
        // r2 = oob address (e.g. scalar)
        // r7 = pointer to map value
        31: (7b) *(u64 *)(r10 -16) = r2
        // r9 will remain "fast" register, r10 will become "slow" register below
        32: (bf) r9 = r10
        // JIT maps BPF reg to x86 reg:
        //  r9  -> r15 (callee saved)
        //  r10 -> rbp
        // train store forward prediction to break dependency link between both r9
        // and r10 by evicting them from the predictor's LRU table.
        33: (61) r0 = *(u32 *)(r7 +24576)
        34: (63) *(u32 *)(r7 +29696) = r0
        35: (61) r0 = *(u32 *)(r7 +24580)
        36: (63) *(u32 *)(r7 +29700) = r0
        37: (61) r0 = *(u32 *)(r7 +24584)
        38: (63) *(u32 *)(r7 +29704) = r0
        39: (61) r0 = *(u32 *)(r7 +24588)
        40: (63) *(u32 *)(r7 +29708) = r0
        [...]
        543: (61) r0 = *(u32 *)(r7 +25596)
        544: (63) *(u32 *)(r7 +30716) = r0
        // prepare call to bpf_ringbuf_output() helper. the latter will cause rbp
        // to spill to stack memory while r13/r14/r15 (all callee saved regs) remain
        // in hardware registers. rbp becomes slow due to push/pop latency. below is
        // disasm of bpf_ringbuf_output() helper for better visual context:
        //
        // ffffffff8117ee20: 41 54                 push   r12
        // ffffffff8117ee22: 55                    push   rbp
        // ffffffff8117ee23: 53                    push   rbx
        // ffffffff8117ee24: 48 f7 c1 fc ff ff ff  test   rcx,0xfffffffffffffffc
        // ffffffff8117ee2b: 0f 85 af 00 00 00     jne    ffffffff8117eee0 <-- jump taken
        // [...]
        // ffffffff8117eee0: 49 c7 c4 ea ff ff ff  mov    r12,0xffffffffffffffea
        // ffffffff8117eee7: 5b                    pop    rbx
        // ffffffff8117eee8: 5d                    pop    rbp
        // ffffffff8117eee9: 4c 89 e0              mov    rax,r12
        // ffffffff8117eeec: 41 5c                 pop    r12
        // ffffffff8117eeee: c3                    ret
        545: (18) r1 = map[id:4]
        547: (bf) r2 = r7
        548: (b7) r3 = 0
        549: (b7) r4 = 4
        550: (85) call bpf_ringbuf_output#194288
        // instruction 551 inserted by verifier    \
        551: (7a) *(u64 *)(r10 -16) = 0            | /both/ are now slow stores here
        // storing map value pointer r7 at fp-16   | since value of r10 is "slow".
        552: (7b) *(u64 *)(r10 -16) = r7           /
        // following "fast" read to the same memory location, but due to dependency
        // misprediction it will speculatively execute before insn 551/552 completes.
        553: (79) r2 = *(u64 *)(r9 -16)
        // in speculative domain contains attacker controlled r2. in non-speculative
        // domain this contains r7, and thus accesses r7 +0 below.
        554: (71) r3 = *(u8 *)(r2 +0)
        // leak r3
      
      As can be seen, the current speculative store bypass mitigation which the
      verifier inserts at line 551 is insufficient since /both/, the write of
      the zero sanitation as well as the map value pointer are a high latency
      instruction due to prior memory access via push/pop of r10 (rbp) in contrast
      to the low latency read in line 553 as r9 (r15) which stays in hardware
      registers. Thus, architecturally, fp-16 is r7, however, microarchitecturally,
      fp-16 can still be r2.
      
      Initial thoughts to address this issue was to track spilled pointer loads
      from stack and enforce their load via LDX through r10 as well so that /both/
      the preemptive store of zero /as well as/ the load use the /same/ register
      such that a dependency is created between the store and load. However, this
      option is not sufficient either since it can be bypassed as well under
      speculation. An updated attack with pointer spill/fills now _all_ based on
      r10 would look as follows:
      
        [...]
        // r2 = oob address (e.g. scalar)
        // r7 = pointer to map value
        [...]
        // longer store forward prediction training sequence than before.
        2062: (61) r0 = *(u32 *)(r7 +25588)
        2063: (63) *(u32 *)(r7 +30708) = r0
        2064: (61) r0 = *(u32 *)(r7 +25592)
        2065: (63) *(u32 *)(r7 +30712) = r0
        2066: (61) r0 = *(u32 *)(r7 +25596)
        2067: (63) *(u32 *)(r7 +30716) = r0
        // store the speculative load address (scalar) this time after the store
        // forward prediction training.
        2068: (7b) *(u64 *)(r10 -16) = r2
        // preoccupy the CPU store port by running sequence of dummy stores.
        2069: (63) *(u32 *)(r7 +29696) = r0
        2070: (63) *(u32 *)(r7 +29700) = r0
        2071: (63) *(u32 *)(r7 +29704) = r0
        2072: (63) *(u32 *)(r7 +29708) = r0
        2073: (63) *(u32 *)(r7 +29712) = r0
        2074: (63) *(u32 *)(r7 +29716) = r0
        2075: (63) *(u32 *)(r7 +29720) = r0
        2076: (63) *(u32 *)(r7 +29724) = r0
        2077: (63) *(u32 *)(r7 +29728) = r0
        2078: (63) *(u32 *)(r7 +29732) = r0
        2079: (63) *(u32 *)(r7 +29736) = r0
        2080: (63) *(u32 *)(r7 +29740) = r0
        2081: (63) *(u32 *)(r7 +29744) = r0
        2082: (63) *(u32 *)(r7 +29748) = r0
        2083: (63) *(u32 *)(r7 +29752) = r0
        2084: (63) *(u32 *)(r7 +29756) = r0
        2085: (63) *(u32 *)(r7 +29760) = r0
        2086: (63) *(u32 *)(r7 +29764) = r0
        2087: (63) *(u32 *)(r7 +29768) = r0
        2088: (63) *(u32 *)(r7 +29772) = r0
        2089: (63) *(u32 *)(r7 +29776) = r0
        2090: (63) *(u32 *)(r7 +29780) = r0
        2091: (63) *(u32 *)(r7 +29784) = r0
        2092: (63) *(u32 *)(r7 +29788) = r0
        2093: (63) *(u32 *)(r7 +29792) = r0
        2094: (63) *(u32 *)(r7 +29796) = r0
        2095: (63) *(u32 *)(r7 +29800) = r0
        2096: (63) *(u32 *)(r7 +29804) = r0
        2097: (63) *(u32 *)(r7 +29808) = r0
        2098: (63) *(u32 *)(r7 +29812) = r0
        // overwrite scalar with dummy pointer; same as before, also including the
        // sanitation store with 0 from the current mitigation by the verifier.
        2099: (7a) *(u64 *)(r10 -16) = 0         | /both/ are now slow stores here
        2100: (7b) *(u64 *)(r10 -16) = r7        | since store unit is still busy.
        // load from stack intended to bypass stores.
        2101: (79) r2 = *(u64 *)(r10 -16)
        2102: (71) r3 = *(u8 *)(r2 +0)
        // leak r3
        [...]
      
      Looking at the CPU microarchitecture, the scheduler might issue loads (such
      as seen in line 2101) before stores (line 2099,2100) because the load execution
      units become available while the store execution unit is still busy with the
      sequence of dummy stores (line 2069-2098). And so the load may use the prior
      stored scalar from r2 at address r10 -16 for speculation. The updated attack
      may work less reliable on CPU microarchitectures where loads and stores share
      execution resources.
      
      This concludes that the sanitizing with zero stores from af86ca4e ("bpf:
      Prevent memory disambiguation attack") is insufficient. Moreover, the detection
      of stack reuse from af86ca4e where previously data (STACK_MISC) has been
      written to a given stack slot where a pointer value is now to be stored does
      not have sufficient coverage as precondition for the mitigation either; for
      several reasons outlined as follows:
      
       1) Stack content from prior program runs could still be preserved and is
          therefore not "random", best example is to split a speculative store
          bypass attack between tail calls, program A would prepare and store the
          oob address at a given stack slot and then tail call into program B which
          does the "slow" store of a pointer to the stack with subsequent "fast"
          read. From program B PoV such stack slot type is STACK_INVALID, and
          therefore also must be subject to mitigation.
      
       2) The STACK_SPILL must not be coupled to register_is_const(&stack->spilled_ptr)
          condition, for example, the previous content of that memory location could
          also be a pointer to map or map value. Without the fix, a speculative
          store bypass is not mitigated in such precondition and can then lead to
          a type confusion in the speculative domain leaking kernel memory near
          these pointer types.
      
      While brainstorming on various alternative mitigation possibilities, we also
      stumbled upon a retrospective from Chrome developers [0]:
      
        [...] For variant 4, we implemented a mitigation to zero the unused memory
        of the heap prior to allocation, which cost about 1% when done concurrently
        and 4% for scavenging. Variant 4 defeats everything we could think of. We
        explored more mitigations for variant 4 but the threat proved to be more
        pervasive and dangerous than we anticipated. For example, stack slots used
        by the register allocator in the optimizing compiler could be subject to
        type confusion, leading to pointer crafting. Mitigating type confusion for
        stack slots alone would have required a complete redesign of the backend of
        the optimizing compiler, perhaps man years of work, without a guarantee of
        completeness. [...]
      
      From BPF side, the problem space is reduced, however, options are rather
      limited. One idea that has been explored was to xor-obfuscate pointer spills
      to the BPF stack:
      
        [...]
        // preoccupy the CPU store port by running sequence of dummy stores.
        [...]
        2106: (63) *(u32 *)(r7 +29796) = r0
        2107: (63) *(u32 *)(r7 +29800) = r0
        2108: (63) *(u32 *)(r7 +29804) = r0
        2109: (63) *(u32 *)(r7 +29808) = r0
        2110: (63) *(u32 *)(r7 +29812) = r0
        // overwrite scalar with dummy pointer; xored with random 'secret' value
        // of 943576462 before store ...
        2111: (b4) w11 = 943576462
        2112: (af) r11 ^= r7
        2113: (7b) *(u64 *)(r10 -16) = r11
        2114: (79) r11 = *(u64 *)(r10 -16)
        2115: (b4) w2 = 943576462
        2116: (af) r2 ^= r11
        // ... and restored with the same 'secret' value with the help of AX reg.
        2117: (71) r3 = *(u8 *)(r2 +0)
        [...]
      
      While the above would not prevent speculation, it would make data leakage
      infeasible by directing it to random locations. In order to be effective
      and prevent type confusion under speculation, such random secret would have
      to be regenerated for each store. The additional complexity involved for a
      tracking mechanism that prevents jumps such that restoring spilled pointers
      would not get corrupted is not worth the gain for unprivileged. Hence, the
      fix in here eventually opted for emitting a non-public BPF_ST | BPF_NOSPEC
      instruction which the x86 JIT translates into a lfence opcode. Inserting the
      latter in between the store and load instruction is one of the mitigations
      options [1]. The x86 instruction manual notes:
      
        [...] An LFENCE that follows an instruction that stores to memory might
        complete before the data being stored have become globally visible. [...]
      
      The latter meaning that the preceding store instruction finished execution
      and the store is at minimum guaranteed to be in the CPU's store queue, but
      it's not guaranteed to be in that CPU's L1 cache at that point (globally
      visible). The latter would only be guaranteed via sfence. So the load which
      is guaranteed to execute after the lfence for that local CPU would have to
      rely on store-to-load forwarding. [2], in section 2.3 on store buffers says:
      
        [...] For every store operation that is added to the ROB, an entry is
        allocated in the store buffer. This entry requires both the virtual and
        physical address of the target. Only if there is no free entry in the store
        buffer, the frontend stalls until there is an empty slot available in the
        store buffer again. Otherwise, the CPU can immediately continue adding
        subsequent instructions to the ROB and execute them out of order. On Intel
        CPUs, the store buffer has up to 56 entries. [...]
      
      One small upside on the fix is that it lifts constraints from af86ca4e
      where the sanitize_stack_off relative to r10 must be the same when coming
      from different paths. The BPF_ST | BPF_NOSPEC gets emitted after a BPF_STX
      or BPF_ST instruction. This happens either when we store a pointer or data
      value to the BPF stack for the first time, or upon later pointer spills.
      The former needs to be enforced since otherwise stale stack data could be
      leaked under speculation as outlined earlier. For non-x86 JITs the BPF_ST |
      BPF_NOSPEC mapping is currently optimized away, but others could emit a
      speculation barrier as well if necessary. For real-world unprivileged
      programs e.g. generated by LLVM, pointer spill/fill is only generated upon
      register pressure and LLVM only tries to do that for pointers which are not
      used often. The program main impact will be the initial BPF_ST | BPF_NOSPEC
      sanitation for the STACK_INVALID case when the first write to a stack slot
      occurs e.g. upon map lookup. In future we might refine ways to mitigate
      the latter cost.
      
        [0] https://arxiv.org/pdf/1902.05178.pdf
        [1] https://msrc-blog.microsoft.com/2018/05/21/analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639/
        [2] https://arxiv.org/pdf/1905.05725.pdf
      
      Fixes: af86ca4e ("bpf: Prevent memory disambiguation attack")
      Fixes: f7cf25b2 ("bpf: track spill/fill of constants")
      Co-developed-by: default avatarPiotr Krysiuk <piotras@gmail.com>
      Co-developed-by: default avatarBenedict Schlueter <benedict.schlueter@rub.de>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarPiotr Krysiuk <piotras@gmail.com>
      Signed-off-by: default avatarBenedict Schlueter <benedict.schlueter@rub.de>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2039f26f
    • Daniel Borkmann's avatar
      bpf: Introduce BPF nospec instruction for mitigating Spectre v4 · f5e81d11
      Daniel Borkmann authored
      In case of JITs, each of the JIT backends compiles the BPF nospec instruction
      /either/ to a machine instruction which emits a speculation barrier /or/ to
      /no/ machine instruction in case the underlying architecture is not affected
      by Speculative Store Bypass or has different mitigations in place already.
      
      This covers both x86 and (implicitly) arm64: In case of x86, we use 'lfence'
      instruction for mitigation. In case of arm64, we rely on the firmware mitigation
      as controlled via the ssbd kernel parameter. Whenever the mitigation is enabled,
      it works for all of the kernel code with no need to provide any additional
      instructions here (hence only comment in arm64 JIT). Other archs can follow
      as needed. The BPF nospec instruction is specifically targeting Spectre v4
      since i) we don't use a serialization barrier for the Spectre v1 case, and
      ii) mitigation instructions for v1 and v4 might be different on some archs.
      
      The BPF nospec is required for a future commit, where the BPF verifier does
      annotate intermediate BPF programs with speculation barriers.
      Co-developed-by: default avatarPiotr Krysiuk <piotras@gmail.com>
      Co-developed-by: default avatarBenedict Schlueter <benedict.schlueter@rub.de>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarPiotr Krysiuk <piotras@gmail.com>
      Signed-off-by: default avatarBenedict Schlueter <benedict.schlueter@rub.de>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f5e81d11
    • Wang Hai's avatar
      sis900: Fix missing pci_disable_device() in probe and remove · 89fb62fd
      Wang Hai authored
      Replace pci_enable_device() with pcim_enable_device(),
      pci_disable_device() and pci_release_regions() will be
      called in release automatically.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarWang Hai <wanghai38@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      89fb62fd
    • zhang kai's avatar
      net: let flow have same hash in two directions · 1e60cebf
      zhang kai authored
      using same source and destination ip/port for flow hash calculation
      within the two directions.
      Signed-off-by: default avatarzhang kai <zhangkaiheb@126.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e60cebf
    • Krzysztof Kozlowski's avatar
      nfc: nfcsim: fix use after free during module unload · 5e7b30d2
      Krzysztof Kozlowski authored
      There is a use after free memory corruption during module exit:
       - nfcsim_exit()
        - nfcsim_device_free(dev0)
          - nfc_digital_unregister_device()
            This iterates over command queue and frees all commands,
          - dev->up = false
          - nfcsim_link_shutdown()
            - nfcsim_link_recv_wake()
              This wakes the sleeping thread nfcsim_link_recv_skb().
      
       - nfcsim_link_recv_skb()
         Wake from wait_event_interruptible_timeout(),
         call directly the deb->cb callback even though (dev->up == false),
         - digital_send_cmd_complete()
           Dereference of "struct digital_cmd" cmd which was freed earlier by
           nfc_digital_unregister_device().
      
      This causes memory corruption shortly after (with unrelated stack
      trace):
      
        nfc nfc0: NFC: nfcsim_recv_wq: Device is down
        llcp: nfc_llcp_recv: err -19
        nfc nfc1: NFC: nfcsim_recv_wq: Device is down
        BUG: unable to handle page fault for address: ffffffffffffffed
        Call Trace:
         fsnotify+0x54b/0x5c0
         __fsnotify_parent+0x1fe/0x300
         ? vfs_write+0x27c/0x390
         vfs_write+0x27c/0x390
         ksys_write+0x63/0xe0
         do_syscall_64+0x3b/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      KASAN report:
      
        BUG: KASAN: use-after-free in digital_send_cmd_complete+0x16/0x50
        Write of size 8 at addr ffff88800a05f720 by task kworker/0:2/71
        Workqueue: events nfcsim_recv_wq [nfcsim]
        Call Trace:
         dump_stack_lvl+0x45/0x59
         print_address_description.constprop.0+0x21/0x140
         ? digital_send_cmd_complete+0x16/0x50
         ? digital_send_cmd_complete+0x16/0x50
         kasan_report.cold+0x7f/0x11b
         ? digital_send_cmd_complete+0x16/0x50
         ? digital_dep_link_down+0x60/0x60
         digital_send_cmd_complete+0x16/0x50
         nfcsim_recv_wq+0x38f/0x3d5 [nfcsim]
         ? nfcsim_in_send_cmd+0x4a/0x4a [nfcsim]
         ? lock_is_held_type+0x98/0x110
         ? finish_wait+0x110/0x110
         ? rcu_read_lock_sched_held+0x9c/0xd0
         ? rcu_read_lock_bh_held+0xb0/0xb0
         ? lockdep_hardirqs_on_prepare+0x12e/0x1f0
      
      This flow of calling digital_send_cmd_complete() callback on driver exit
      is specific to nfcsim which implements reading and sending work queues.
      Since the NFC digital device was unregistered, the callback should not
      be called.
      
      Fixes: 204bddcb ("NFC: nfcsim: Make use of the Digital layer")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e7b30d2
    • Wang Hai's avatar
      tulip: windbond-840: Fix missing pci_disable_device() in probe and remove · 76a16be0
      Wang Hai authored
      Replace pci_enable_device() with pcim_enable_device(),
      pci_disable_device() and pci_release_regions() will be
      called in release automatically.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarWang Hai <wanghai38@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      76a16be0
    • Marcelo Ricardo Leitner's avatar
      sctp: fix return value check in __sctp_rcv_asconf_lookup · 557fb586
      Marcelo Ricardo Leitner authored
      As Ben Hutchings noticed, this check should have been inverted: the call
      returns true in case of success.
      Reported-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Fixes: 0c5dc070 ("sctp: validate from_addr_param return")
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Reviewed-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      557fb586
    • Tang Bin's avatar
      nfc: s3fwrn5: fix undefined parameter values in dev_err() · 46573e3a
      Tang Bin authored
      In the function s3fwrn5_fw_download(), the 'ret' is not assigned,
      so the correct value should be given in dev_err function.
      
      Fixes: a0302ff5 ("nfc: s3fwrn5: remove unnecessary label")
      Signed-off-by: default avatarZhang Shengju <zhangshengju@cmss.chinamobile.com>
      Signed-off-by: default avatarTang Bin <tangbin@cmss.chinamobile.com>
      Reviewed-by: default avatarNathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      46573e3a
    • David S. Miller's avatar
      Merge tag 'mlx5-fixes-2021-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 9d0279d0
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      mlx5 fixes 2021-07-27
      
      This series introduces some fixes to mlx5 driver.
      Please pull and let me know if there is any problem.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9d0279d0
  3. 27 Jul, 2021 20 commits
    • Chris Mi's avatar
      net/mlx5: Fix mlx5_vport_tbl_attr chain from u16 to u32 · 740452e0
      Chris Mi authored
      The offending refactor commit uses u16 chain wrongly. Actually, it
      should be u32.
      
      Fixes: c620b772 ("net/mlx5: Refactor tc flow attributes structure")
      CC: Ariel Levkovich <lariel@nvidia.com>
      Signed-off-by: default avatarChris Mi <cmi@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      740452e0
    • Dima Chumak's avatar
      net/mlx5e: Fix nullptr in mlx5e_hairpin_get_mdev() · b1c2f631
      Dima Chumak authored
      The result of __dev_get_by_index() is not checked for NULL and then gets
      dereferenced immediately.
      
      Also, __dev_get_by_index() must be called while holding either RTNL lock
      or @dev_base_lock, which isn't satisfied by mlx5e_hairpin_get_mdev() or
      its callers. This makes the underlying hlist_for_each_entry() loop not
      safe, and can have adverse effects in itself.
      
      Fix by using dev_get_by_index() and handling nullptr return value when
      ifindex device is not found. Update mlx5e_hairpin_get_mdev() callers to
      check for possible PTR_ERR() result.
      
      Fixes: 77ab67b7 ("net/mlx5e: Basic setup of hairpin object")
      Addresses-Coverity: ("Dereference null return value")
      Signed-off-by: default avatarDima Chumak <dchumak@nvidia.com>
      Reviewed-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      b1c2f631
    • Aya Levin's avatar
      net/mlx5: Unload device upon firmware fatal error · 7f331bf0
      Aya Levin authored
      When fw_fatal reporter reports an error, the firmware in not responding.
      Unload the device to ensure that the driver closes all its resources,
      even if recovery is not due (user disabled auto-recovery or reporter is
      in grace period). On successful recovery the device is loaded back up.
      
      Fixes: b3bd076f ("net/mlx5: Report devlink health on FW fatal issues")
      Signed-off-by: default avatarAya Levin <ayal@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      7f331bf0
    • Aya Levin's avatar
      net/mlx5e: Fix page allocation failure for ptp-RQ over SF · 678b1ae1
      Aya Levin authored
      Set the correct pci-device pointer to the ptp-RQ. This allows access to
      dma_mask and avoids allocation request with wrong pci-device.
      
      Fixes: a099da8f ("net/mlx5e: Add RQ to PTP channel")
      Signed-off-by: default avatarAya Levin <ayal@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      678b1ae1
    • Aya Levin's avatar
      net/mlx5e: Fix page allocation failure for trap-RQ over SF · 497008e7
      Aya Levin authored
      Set the correct device pointer to the trap-RQ, to allow access to
      dma_mask and avoid allocation request with the wrong pci-dev.
      
      WARNING: CPU: 1 PID: 12005 at kernel/dma/mapping.c:151 dma_map_page_attrs+0x139/0x1c0
      ...
      all Trace:
      <IRQ>
      ? __page_pool_alloc_pages_slow+0x5a/0x210
      mlx5e_post_rx_wqes+0x258/0x400 [mlx5_core]
      mlx5e_trap_napi_poll+0x44/0xc0 [mlx5_core]
      __napi_poll+0x24/0x150
      net_rx_action+0x22b/0x280
      __do_softirq+0xc7/0x27e
      do_softirq+0x61/0x80
      </IRQ>
      __local_bh_enable_ip+0x4b/0x50
      mlx5e_handle_action_trap+0x2dd/0x4d0 [mlx5_core]
      blocking_notifier_call_chain+0x5a/0x80
      mlx5_devlink_trap_action_set+0x8b/0x100 [mlx5_core]
      
      Fixes: 5543e989 ("net/mlx5e: Add trap entity to ETH driver")
      Signed-off-by: default avatarAya Levin <ayal@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      497008e7
    • Aya Levin's avatar
      net/mlx5e: Consider PTP-RQ when setting RX VLAN stripping · a759f845
      Aya Levin authored
      Add PTP-RQ to the loop when setting rx-vlan-offload feature via ethtool.
      On PTP-RQ's creation, set rx-vlan-offload into its parameters.
      
      Fixes: a099da8f ("net/mlx5e: Add RQ to PTP channel")
      Signed-off-by: default avatarAya Levin <ayal@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      a759f845
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Add NETIF_F_HW_TC to hw_features when HTB offload is available · 9841d58f
      Maxim Mikityanskiy authored
      If a feature flag is only present in features, but not in hw_features,
      the user can't reset it. Although hw_features may contain NETIF_F_HW_TC
      by the point where the driver checks whether HTB offload is supported,
      this flag is controlled by another condition that may not hold. Set it
      explicitly to make sure the user can disable it.
      
      Fixes: 214baf22 ("net/mlx5e: Support HTB offload")
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      9841d58f
    • Tariq Toukan's avatar
      net/mlx5e: RX, Avoid possible data corruption when relaxed ordering and LRO combined · e2351e51
      Tariq Toukan authored
      When HW aggregates packets for an LRO session, it writes the payload
      of two consecutive packets of a flow contiguously, so that they usually
      share a cacheline.
      
      The first byte of a packet's payload is written immediately after
      the last byte of the preceding packet.
      In this flow, there are two consecutive write requests to the shared
      cacheline:
      1. Regular write for the earlier packet.
      2. Read-modify-write for the following packet.
      
      In case of relaxed-ordering on, these two writes might be re-ordered.
      Using the end padding optimization (to avoid partial write for the last
      cacheline of a packet) becomes problematic if the two writes occur
      out-of-order, as the padding would overwrite payload that belongs to
      the following packet, causing data corruption.
      
      Avoid this by disabling the end padding optimization when both
      LRO and relaxed-ordering are enabled.
      
      Fixes: 17347d54 ("net/mlx5e: Add support for PCI relaxed ordering")
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      e2351e51
    • Roi Dayan's avatar
      net/mlx5: E-Switch, handle devcom events only for ports on the same device · dd3fddb8
      Roi Dayan authored
      This is the same check as LAG mode checks if to enable lag.
      This will fix adding peer miss rules if lag is not supported
      and even an incorrect rules in socket direct mode.
      
      Also fix the incorrect comment on mlx5_get_next_phys_dev() as flow #1
      doesn't exists.
      
      Fixes: ac004b83 ("net/mlx5e: E-Switch, Add peer miss rules")
      Signed-off-by: default avatarRoi Dayan <roid@nvidia.com>
      Reviewed-by: default avatarMaor Dickman <maord@nvidia.com>
      Reviewed-by: default avatarMark Bloch <mbloch@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      dd3fddb8
    • Maor Dickman's avatar
      net/mlx5: E-Switch, Set destination vport vhca id only when merged eswitch is supported · c6719725
      Maor Dickman authored
      Destination vport vhca id is valid flag is set only merged eswitch isn't supported.
      Change destination vport vhca id value to be set also only when merged eswitch
      is supported.
      
      Fixes: e4ad91f2 ("net/mlx5e: Split offloaded eswitch TC rules for port mirroring")
      Signed-off-by: default avatarMaor Dickman <maord@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      c6719725
    • Maor Dickman's avatar
      net/mlx5e: Disable Rx ntuple offload for uplink representor · 90b22b9b
      Maor Dickman authored
      Rx ntuple offload is not supported in switchdev mode.
      Tryng to enable it cause kernel panic.
      
       BUG: kernel NULL pointer dereference, address: 0000000000000008
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 80000001065a5067 P4D 80000001065a5067 PUD 106594067 PMD 0
       Oops: 0000 [#1] SMP PTI
       CPU: 7 PID: 1089 Comm: ethtool Not tainted 5.13.0-rc7_for_upstream_min_debug_2021_06_23_16_44 #1
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       RIP: 0010:mlx5e_arfs_enable+0x70/0xd0 [mlx5_core]
       Code: 44 24 10 00 00 00 00 48 c7 44 24 18 00 00 00 00 49 63 c4 48 89 e2 44 89 e6 48 69 c0 20 08 00 00 48 89 ef 48 03 85 68 ac 00 00 <48> 8b 40 08 48 89 44 24 08 e8 d2 aa fd ff 48 83 05 82 96 18 00 01
       RSP: 0018:ffff8881047679e0 EFLAGS: 00010246
       RAX: 0000000000000000 RBX: 0000004000000000 RCX: 0000004000000000
       RDX: ffff8881047679e0 RSI: 0000000000000000 RDI: ffff888115100880
       RBP: ffff888115100880 R08: ffffffffa00f6cb0 R09: ffff888104767a18
       R10: ffff8881151000a0 R11: ffff888109479540 R12: 0000000000000000
       R13: ffff888104767bb8 R14: ffff888115100000 R15: ffff8881151000a0
       FS:  00007f41a64ab740(0000) GS:ffff8882f5dc0000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000008 CR3: 0000000104cbc005 CR4: 0000000000370ea0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        set_feature_arfs+0x1e/0x40 [mlx5_core]
        mlx5e_handle_feature+0x43/0xa0 [mlx5_core]
        mlx5e_set_features+0x139/0x1b0 [mlx5_core]
        __netdev_update_features+0x2b3/0xaf0
        ethnl_set_features+0x176/0x3a0
        ? __nla_parse+0x22/0x30
        genl_family_rcv_msg_doit+0xe2/0x140
        genl_rcv_msg+0xde/0x1d0
        ? features_reply_size+0xe0/0xe0
        ? genl_get_cmd+0xd0/0xd0
        netlink_rcv_skb+0x4e/0xf0
        genl_rcv+0x24/0x40
        netlink_unicast+0x1f6/0x2b0
        netlink_sendmsg+0x225/0x450
        sock_sendmsg+0x33/0x40
        __sys_sendto+0xd4/0x120
        ? __sys_recvmsg+0x4e/0x90
        ? exc_page_fault+0x219/0x740
        __x64_sys_sendto+0x25/0x30
        do_syscall_64+0x3f/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
       RIP: 0033:0x7f41a65b0cba
       Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 c3 0f 1f 44 00 00 55 48 83 ec 30 44 89 4c
       RSP: 002b:00007ffd8d688358 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
       RAX: ffffffffffffffda RBX: 00000000010f42a0 RCX: 00007f41a65b0cba
       RDX: 0000000000000058 RSI: 00000000010f43b0 RDI: 0000000000000003
       RBP: 000000000047ae60 R08: 00007f41a667c000 R09: 000000000000000c
       R10: 0000000000000000 R11: 0000000000000246 R12: 00000000010f4340
       R13: 00000000010f4350 R14: 00007ffd8d688400 R15: 00000000010f42a0
       Modules linked in: mlx5_vdpa vhost_iotlb vdpa xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad ib_ipoib rdma_cm iw_cm ib_cm mlx5_ib ib_uverbs ib_core overlay mlx5_core ptp pps_core fuse
       CR2: 0000000000000008
       ---[ end trace c66523f2aba94b43 ]---
      
      Fixes: 7a9fb35e ("net/mlx5e: Do not reload ethernet ports when changing eswitch mode")
      Signed-off-by: default avatarMaor Dickman <maord@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      90b22b9b
    • Maor Gottlieb's avatar
      net/mlx5: Fix flow table chaining · 8b54874e
      Maor Gottlieb authored
      Fix a bug when flow table is created in priority that already
      has other flow tables as shown in the below diagram.
      If the new flow table (FT-B) has the lowest level in the priority,
      we need to connect the flow tables from the previous priority (p0)
      to this new table. In addition when this flow table is destroyed
      (FT-B), we need to connect the flow tables from the previous
      priority (p0) to the next level flow table (FT-C) in the same
      priority of the destroyed table (if exists).
      
                             ---------
                             |root_ns|
                             ---------
                                  |
                  --------------------------------
                  |               |              |
             ----------      ----------      ---------
             |p(prio)-x|     |   p-y  |      |   p-n |
             ----------      ----------      ---------
                  |               |
           ----------------  ------------------
           |ns(e.g bypass)|  |ns(e.g. kernel) |
           ----------------  ------------------
                  |            |           |
      	-------	       ------       ----
              |  p0 |        | p1 |       |p2|
              -------        ------       ----
                 |             |    \
              --------       ------- ------
              | FT-A |       |FT-B | |FT-C|
              --------       ------- ------
      
      Fixes: f90edfd2 ("net/mlx5_core: Connect flow tables")
      Signed-off-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Reviewed-by: default avatarMark Bloch <mbloch@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      8b54874e
    • Andrii Nakryiko's avatar
      Merge branch 'sockmap fixes picked up by stress tests' · f1fdee33
      Andrii Nakryiko authored
      John Fastabend says:
      
      ====================
      
      Running stress tests with recent patch to remove an extra lock in sockmap
      resulted in a couple new issues popping up. It seems only one of them
      is actually related to the patch:
      
      799aa7f9 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
      
      The other two issues had existed long before, but I guess the timing
      with the serialization we had before was too tight to get any of
      our tests or deployments to hit it.
      
      With attached series stress testing sockmap+TCP with workloads that
      create lots of short-lived connections no more splats like below were
      seen on upstream bpf branch.
      
      [224913.935822] WARNING: CPU: 3 PID: 32100 at net/core/stream.c:208 sk_stream_kill_queues+0x212/0x220
      [224913.935841] Modules linked in: fuse overlay bpf_preload x86_pkg_temp_thermal intel_uncore wmi_bmof squashfs sch_fq_codel efivarfs ip_tables x_tables uas xhci_pci ixgbe mdio xfrm_algo xhci_hcd wmi
      [224913.935897] CPU: 3 PID: 32100 Comm: fgs-bench Tainted: G          I       5.14.0-rc1alu+ #181
      [224913.935908] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019
      [224913.935914] RIP: 0010:sk_stream_kill_queues+0x212/0x220
      [224913.935923] Code: 8b 83 20 02 00 00 85 c0 75 20 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 df e8 2b 11 fe ff eb c3 0f 0b e9 7c ff ff ff 0f 0b eb ce <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 0f 1f 44 00 00 41 57 41
      [224913.935932] RSP: 0018:ffff88816271fd38 EFLAGS: 00010206
      [224913.935941] RAX: 0000000000000ae8 RBX: ffff88815acd5240 RCX: dffffc0000000000
      [224913.935948] RDX: 0000000000000003 RSI: 0000000000000ae8 RDI: ffff88815acd5460
      [224913.935954] RBP: ffff88815acd5460 R08: ffffffff955c0ae8 R09: fffffbfff2e6f543
      [224913.935961] R10: ffffffff9737aa17 R11: fffffbfff2e6f542 R12: ffff88815acd5390
      [224913.935967] R13: ffff88815acd5480 R14: ffffffff98d0c080 R15: ffffffff96267500
      [224913.935974] FS:  00007f86e6bd1700(0000) GS:ffff888451cc0000(0000) knlGS:0000000000000000
      [224913.935981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [224913.935988] CR2: 000000c0008eb000 CR3: 00000001020e0005 CR4: 00000000003706e0
      [224913.935994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [224913.936000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [224913.936007] Call Trace:
      [224913.936016]  inet_csk_destroy_sock+0xba/0x1f0
      [224913.936033]  __tcp_close+0x620/0x790
      [224913.936047]  tcp_close+0x20/0x80
      [224913.936056]  inet_release+0x8f/0xf0
      [224913.936070]  __sock_release+0x72/0x120
      
      v3: make sock_drop inline in skmsg.h
      v2: init skb to null and fix a space/tab issue. Added Jakub's acks.
      ====================
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      f1fdee33
    • John Fastabend's avatar
      bpf, sockmap: Fix memleak on ingress msg enqueue · 9635720b
      John Fastabend authored
      If backlog handler is running during a tear down operation we may enqueue
      data on the ingress msg queue while tear down is trying to free it.
      
       sk_psock_backlog()
         sk_psock_handle_skb()
           skb_psock_skb_ingress()
             sk_psock_skb_ingress_enqueue()
               sk_psock_queue_msg(psock,msg)
                                                 spin_lock(ingress_lock)
                                                  sk_psock_zap_ingress()
                                                   _sk_psock_purge_ingerss_msg()
                                                    _sk_psock_purge_ingress_msg()
                                                  -- free ingress_msg list --
                                                 spin_unlock(ingress_lock)
                 spin_lock(ingress_lock)
                 list_add_tail(msg,ingress_msg) <- entry on list with no one
                                                   left to free it.
                 spin_unlock(ingress_lock)
      
      To fix we only enqueue from backlog if the ENABLED bit is set. The tear
      down logic clears the bit with ingress_lock set so we wont enqueue the
      msg in the last step.
      
      Fixes: 799aa7f9 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20210727160500.1713554-4-john.fastabend@gmail.com
      9635720b
    • John Fastabend's avatar
      bpf, sockmap: On cleanup we additionally need to remove cached skb · 476d9801
      John Fastabend authored
      Its possible if a socket is closed and the receive thread is under memory
      pressure it may have cached a skb. We need to ensure these skbs are
      free'd along with the normal ingress_skb queue.
      
      Before 799aa7f9 ("skmsg: Avoid lock_sock() in sk_psock_backlog()") tear
      down and backlog processing both had sock_lock for the common case of
      socket close or unhash. So it was not possible to have both running in
      parrallel so all we would need is the kfree in those kernels.
      
      But, latest kernels include the commit 799aa7f98d5e and this requires a
      bit more work. Without the ingress_lock guarding reading/writing the
      state->skb case its possible the tear down could run before the state
      update causing it to leak memory or worse when the backlog reads the state
      it could potentially run interleaved with the tear down and we might end up
      free'ing the state->skb from tear down side but already have the reference
      from backlog side. To resolve such races we wrap accesses in ingress_lock
      on both sides serializing tear down and backlog case. In both cases this
      only happens after an EAGAIN error case so having an extra lock in place
      is likely fine. The normal path will skip the locks.
      
      Note, we check state->skb before grabbing lock. This works because
      we can only enqueue with the mutex we hold already. Avoiding a race
      on adding state->skb after the check. And if tear down path is running
      that is also fine if the tear down path then removes state->skb we
      will simply set skb=NULL and the subsequent goto is skipped. This
      slight complication avoids locking in normal case.
      
      With this fix we no longer see this warning splat from tcp side on
      socket close when we hit the above case with redirect to ingress self.
      
      [224913.935822] WARNING: CPU: 3 PID: 32100 at net/core/stream.c:208 sk_stream_kill_queues+0x212/0x220
      [224913.935841] Modules linked in: fuse overlay bpf_preload x86_pkg_temp_thermal intel_uncore wmi_bmof squashfs sch_fq_codel efivarfs ip_tables x_tables uas xhci_pci ixgbe mdio xfrm_algo xhci_hcd wmi
      [224913.935897] CPU: 3 PID: 32100 Comm: fgs-bench Tainted: G          I       5.14.0-rc1alu+ #181
      [224913.935908] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019
      [224913.935914] RIP: 0010:sk_stream_kill_queues+0x212/0x220
      [224913.935923] Code: 8b 83 20 02 00 00 85 c0 75 20 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 df e8 2b 11 fe ff eb c3 0f 0b e9 7c ff ff ff 0f 0b eb ce <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 0f 1f 44 00 00 41 57 41
      [224913.935932] RSP: 0018:ffff88816271fd38 EFLAGS: 00010206
      [224913.935941] RAX: 0000000000000ae8 RBX: ffff88815acd5240 RCX: dffffc0000000000
      [224913.935948] RDX: 0000000000000003 RSI: 0000000000000ae8 RDI: ffff88815acd5460
      [224913.935954] RBP: ffff88815acd5460 R08: ffffffff955c0ae8 R09: fffffbfff2e6f543
      [224913.935961] R10: ffffffff9737aa17 R11: fffffbfff2e6f542 R12: ffff88815acd5390
      [224913.935967] R13: ffff88815acd5480 R14: ffffffff98d0c080 R15: ffffffff96267500
      [224913.935974] FS:  00007f86e6bd1700(0000) GS:ffff888451cc0000(0000) knlGS:0000000000000000
      [224913.935981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [224913.935988] CR2: 000000c0008eb000 CR3: 00000001020e0005 CR4: 00000000003706e0
      [224913.935994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [224913.936000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [224913.936007] Call Trace:
      [224913.936016]  inet_csk_destroy_sock+0xba/0x1f0
      [224913.936033]  __tcp_close+0x620/0x790
      [224913.936047]  tcp_close+0x20/0x80
      [224913.936056]  inet_release+0x8f/0xf0
      [224913.936070]  __sock_release+0x72/0x120
      [224913.936083]  sock_close+0x14/0x20
      
      Fixes: a136678c ("bpf: sk_msg, zap ingress queue on psock down")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20210727160500.1713554-3-john.fastabend@gmail.com
      476d9801
    • John Fastabend's avatar
      bpf, sockmap: Zap ingress queues after stopping strparser · 343597d5
      John Fastabend authored
      We don't want strparser to run and pass skbs into skmsg handlers when
      the psock is null. We just sk_drop them in this case. When removing
      a live socket from map it means extra drops that we do not need to
      incur. Move the zap below strparser close to avoid this condition.
      
      This way we stop the stream parser first stopping it from processing
      packets and then delete the psock.
      
      Fixes: a136678c ("bpf: sk_msg, zap ingress queue on psock down")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20210727160500.1713554-2-john.fastabend@gmail.com
      343597d5
    • Yufeng Mo's avatar
      net: hns3: change the method of obtaining default ptp cycle · 8373cd38
      Yufeng Mo authored
      The ptp cycle is related to the hardware, so it may cause compatibility
      issues if a fixed value is used in driver. Therefore, the method of
      obtaining this value is changed to read from the register rather than
      use a fixed value in driver.
      
      Fixes: 0bf5eb78 ("net: hns3: add support for PTP")
      Signed-off-by: default avatarYufeng Mo <moyufeng@huawei.com>
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8373cd38
    • Tang Bin's avatar
      nfc: s3fwrn5: fix undefined parameter values in dev_err() · 801e541c
      Tang Bin authored
      In the function s3fwrn5_fw_download(), the 'ret' is not assigned,
      so the correct value should be given in dev_err function.
      
      Fixes: a0302ff5 ("nfc: s3fwrn5: remove unnecessary label")
      Signed-off-by: default avatarZhang Shengju <zhangshengju@cmss.chinamobile.com>
      Signed-off-by: default avatarTang Bin <tangbin@cmss.chinamobile.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      801e541c
    • Pavel Skripkin's avatar
      net: llc: fix skb_over_panic · c7c9d210
      Pavel Skripkin authored
      Syzbot reported skb_over_panic() in llc_pdu_init_as_xid_cmd(). The
      problem was in wrong LCC header manipulations.
      
      Syzbot's reproducer tries to send XID packet. llc_ui_sendmsg() is
      doing following steps:
      
      	1. skb allocation with size = len + header size
      		len is passed from userpace and header size
      		is 3 since addr->sllc_xid is set.
      
      	2. skb_reserve() for header_len = 3
      	3. filling all other space with memcpy_from_msg()
      
      Ok, at this moment we have fully loaded skb, only headers needs to be
      filled.
      
      Then code comes to llc_sap_action_send_xid_c(). This function pushes 3
      bytes for LLC PDU header and initializes it. Then comes
      llc_pdu_init_as_xid_cmd(). It initalizes next 3 bytes *AFTER* LLC PDU
      header and call skb_push(skb, 3). This looks wrong for 2 reasons:
      
      	1. Bytes rigth after LLC header are user data, so this function
      	   was overwriting payload.
      
      	2. skb_push(skb, 3) call can cause skb_over_panic() since
      	   all free space was filled in llc_ui_sendmsg(). (This can
      	   happen is user passed 686 len: 686 + 14 (eth header) + 3 (LLC
      	   header) = 703. SKB_DATA_ALIGN(703) = 704)
      
      So, in this patch I added 2 new private constansts: LLC_PDU_TYPE_U_XID
      and LLC_PDU_LEN_U_XID. LLC_PDU_LEN_U_XID is used to correctly reserve
      header size to handle LLC + XID case. LLC_PDU_TYPE_U_XID is used by
      llc_pdu_header_init() function to push 6 bytes instead of 3. And finally
      I removed skb_push() call from llc_pdu_init_as_xid_cmd().
      
      This changes should not affect other parts of LLC, since after
      all steps we just transmit buffer.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-and-tested-by: syzbot+5e5a981ad7cc54c4b2b4@syzkaller.appspotmail.com
      Signed-off-by: default avatarPavel Skripkin <paskripkin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c7c9d210
    • Sunil Goutham's avatar
      octeontx2-af: Do NIX_RX_SW_SYNC twice · fcef709c
      Sunil Goutham authored
      NIX_RX_SW_SYNC ensures all existing transactions are finished and
      pkts are written to LLC/DRAM, queues should be teared down after
      successful SW_SYNC. Due to a HW errata, in some rare scenarios
      an existing transaction might end after SW_SYNC operation. To
      ensure operation is fully done, do the SW_SYNC twice.
      Signed-off-by: default avatarSunil Goutham <sgoutham@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fcef709c
  4. 26 Jul, 2021 4 commits
  5. 25 Jul, 2021 2 commits
    • David S. Miller's avatar
      Merge branch 'sctp-pmtu-probe' · 832df96d
      David S. Miller authored
      Xin Long says:
      
      ====================
      sctp: improve the pmtu probe in Search Complete state
      
      Timo recently suggested to use the loss of (data) packets as
      indication to send pmtu probe for Search Complete state, which
      should also be implied by RFC8899. This patchset is to change
      the current one that is doing probe with current pmtu all the
      time.
      
      v1->v2:
        - see Patch 2/2.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      832df96d
    • Xin Long's avatar
      sctp: send pmtu probe only if packet loss in Search Complete state · eacf078c
      Xin Long authored
      This patch is to introduce last_rtx_chunks into sctp_transport to detect
      if there's any packet retransmission/loss happened by checking against
      asoc's rtx_data_chunks in sctp_transport_pl_send().
      
      If there is, namely, transport->last_rtx_chunks != asoc->rtx_data_chunks,
      the pmtu probe will be sent out. Otherwise, increment the pl.raise_count
      and return when it's in Search Complete state.
      
      With this patch, if in Search Complete state, which is a long period, it
      doesn't need to keep probing the current pmtu unless there's data packet
      loss. This will save quite some traffic.
      
      v1->v2:
        - add the missing Fixes tag.
      
      Fixes: 0dac127c ("sctp: do black hole detection in search complete state")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eacf078c