1. 13 Sep, 2023 2 commits
  2. 11 Sep, 2023 1 commit
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: disallow element removal on anonymous sets · 23a3bfd4
      Pablo Neira Ayuso authored
      Anonymous sets need to be populated once at creation and then they are
      bound to rule since 938154b9 ("netfilter: nf_tables: reject unbound
      anonymous set before commit phase"), otherwise transaction reports
      EINVAL.
      
      Userspace does not need to delete elements of anonymous sets that are
      not yet bound, reject this with EOPNOTSUPP.
      
      From flush command path, skip anonymous sets, they are expected to be
      bound already. Otherwise, EINVAL is hit at the end of this transaction
      for unbound sets.
      
      Fixes: 96518518 ("netfilter: add nftables")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      23a3bfd4
  3. 08 Sep, 2023 5 commits
  4. 07 Sep, 2023 13 commits
  5. 06 Sep, 2023 19 commits
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: Unbreak audit log reset · 9b5ba5c9
      Pablo Neira Ayuso authored
      Deliver audit log from __nf_tables_dump_rules(), table dereference at
      the end of the table list loop might point to the list head, leading to
      this crash.
      
      [ 4137.407349] BUG: unable to handle page fault for address: 00000000001f3c50
      [ 4137.407357] #PF: supervisor read access in kernel mode
      [ 4137.407359] #PF: error_code(0x0000) - not-present page
      [ 4137.407360] PGD 0 P4D 0
      [ 4137.407363] Oops: 0000 [#1] PREEMPT SMP PTI
      [ 4137.407365] CPU: 4 PID: 500177 Comm: nft Not tainted 6.5.0+ #277
      [ 4137.407369] RIP: 0010:string+0x49/0xd0
      [ 4137.407374] Code: ff 77 36 45 89 d1 31 f6 49 01 f9 66 45 85 d2 75 19 eb 1e 49 39 f8 76 02 88 07 48 83 c7 01 83 c6 01 48 83 c2 01 4c 39 cf 74 07 <0f> b6 02 84 c0 75 e2 4c 89 c2 e9 58 e5 ff ff 48 c7 c0 0e b2 ff 81
      [ 4137.407377] RSP: 0018:ffff8881179737f0 EFLAGS: 00010286
      [ 4137.407379] RAX: 00000000001f2c50 RBX: ffff888117973848 RCX: ffff0a00ffffff04
      [ 4137.407380] RDX: 00000000001f3c50 RSI: 0000000000000000 RDI: 0000000000000000
      [ 4137.407381] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffffffff
      [ 4137.407383] R10: ffffffffffffffff R11: ffff88813584d200 R12: 0000000000000000
      [ 4137.407384] R13: ffffffffa15cf709 R14: 0000000000000000 R15: ffffffffa15cf709
      [ 4137.407385] FS:  00007fcfc18bb580(0000) GS:ffff88840e700000(0000) knlGS:0000000000000000
      [ 4137.407387] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 4137.407388] CR2: 00000000001f3c50 CR3: 00000001055b2001 CR4: 00000000001706e0
      [ 4137.407390] Call Trace:
      [ 4137.407392]  <TASK>
      [ 4137.407393]  ? __die+0x1b/0x60
      [ 4137.407397]  ? page_fault_oops+0x6b/0xa0
      [ 4137.407399]  ? exc_page_fault+0x60/0x120
      [ 4137.407403]  ? asm_exc_page_fault+0x22/0x30
      [ 4137.407408]  ? string+0x49/0xd0
      [ 4137.407410]  vsnprintf+0x257/0x4f0
      [ 4137.407414]  kvasprintf+0x3e/0xb0
      [ 4137.407417]  kasprintf+0x3e/0x50
      [ 4137.407419]  nf_tables_dump_rules+0x1c0/0x360 [nf_tables]
      [ 4137.407439]  ? __alloc_skb+0xc3/0x170
      [ 4137.407442]  netlink_dump+0x170/0x330
      [ 4137.407447]  __netlink_dump_start+0x227/0x300
      [ 4137.407449]  nf_tables_getrule+0x205/0x390 [nf_tables]
      
      Deliver audit log only once at the end of the rule dump+reset for
      consistency with the set dump+reset.
      
      Ensure audit reset access to table under rcu read side lock. The table
      list iteration holds rcu read lock side, but recent audit code
      dereferences table object out of the rcu read lock side.
      
      Fixes: ea078ae9 ("netfilter: nf_tables: Audit log rule reset")
      Fixes: 7e9be112 ("netfilter: nf_tables: Audit log setelem reset")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: default avatarPhil Sutter <phil@nwl.cc>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      9b5ba5c9
    • Kyle Zeng's avatar
      netfilter: ipset: add the missing IP_SET_HASH_WITH_NET0 macro for ip_set_hash_netportnet.c · 050d91c0
      Kyle Zeng authored
      The missing IP_SET_HASH_WITH_NET0 macro in ip_set_hash_netportnet can
      lead to the use of wrong `CIDR_POS(c)` for calculating array offsets,
      which can lead to integer underflow. As a result, it leads to slab
      out-of-bound access.
      This patch adds back the IP_SET_HASH_WITH_NET0 macro to
      ip_set_hash_netportnet to address the issue.
      
      Fixes: 886503f3 ("netfilter: ipset: actually allow allowable CIDR 0 in hash:net,port,net")
      Suggested-by: default avatarJozsef Kadlecsik <kadlec@netfilter.org>
      Signed-off-by: default avatarKyle Zeng <zengyhkyle@gmail.com>
      Acked-by: default avatarJozsef Kadlecsik <kadlec@netfilter.org>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      050d91c0
    • Pablo Neira Ayuso's avatar
      netfilter: nft_set_rbtree: skip sync GC for new elements in this transaction · 2ee52ae9
      Pablo Neira Ayuso authored
      New elements in this transaction might expired before such transaction
      ends. Skip sync GC for such elements otherwise commit path might walk
      over an already released object. Once transaction is finished, async GC
      will collect such expired element.
      
      Fixes: f6c383b8 ("netfilter: nf_tables: adapt set backend to use GC transaction API")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      2ee52ae9
    • Phil Sutter's avatar
      netfilter: nf_tables: uapi: Describe NFTA_RULE_CHAIN_ID · fdc04cc2
      Phil Sutter authored
      Add a brief description to the enum's comment.
      
      Fixes: 837830a4 ("netfilter: nf_tables: add NFTA_RULE_CHAIN_ID attribute")
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      fdc04cc2
    • Wander Lairson Costa's avatar
      netfilter: nfnetlink_osf: avoid OOB read · f4f8a780
      Wander Lairson Costa authored
      The opt_num field is controlled by user mode and is not currently
      validated inside the kernel. An attacker can take advantage of this to
      trigger an OOB read and potentially leak information.
      
      BUG: KASAN: slab-out-of-bounds in nf_osf_match_one+0xbed/0xd10 net/netfilter/nfnetlink_osf.c:88
      Read of size 2 at addr ffff88804bc64272 by task poc/6431
      
      CPU: 1 PID: 6431 Comm: poc Not tainted 6.0.0-rc4 #1
      Call Trace:
       nf_osf_match_one+0xbed/0xd10 net/netfilter/nfnetlink_osf.c:88
       nf_osf_find+0x186/0x2f0 net/netfilter/nfnetlink_osf.c:281
       nft_osf_eval+0x37f/0x590 net/netfilter/nft_osf.c:47
       expr_call_ops_eval net/netfilter/nf_tables_core.c:214
       nft_do_chain+0x2b0/0x1490 net/netfilter/nf_tables_core.c:264
       nft_do_chain_ipv4+0x17c/0x1f0 net/netfilter/nft_chain_filter.c:23
       [..]
      
      Also add validation to genre, subtype and version fields.
      
      Fixes: 11eeef41 ("netfilter: passive OS fingerprint xtables match")
      Reported-by: default avatarLucas Leong <wmliang@infosec.exchange>
      Signed-off-by: default avatarWander Lairson Costa <wander@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      f4f8a780
    • Florian Westphal's avatar
      netfilter: nftables: exthdr: fix 4-byte stack OOB write · fd94d9da
      Florian Westphal authored
      If priv->len is a multiple of 4, then dst[len / 4] can write past
      the destination array which leads to stack corruption.
      
      This construct is necessary to clean the remainder of the register
      in case ->len is NOT a multiple of the register size, so make it
      conditional just like nft_payload.c does.
      
      The bug was added in 4.1 cycle and then copied/inherited when
      tcp/sctp and ip option support was added.
      
      Bug reported by Zero Day Initiative project (ZDI-CAN-21950,
      ZDI-CAN-21951, ZDI-CAN-21961).
      
      Fixes: 49499c3e ("netfilter: nf_tables: switch registers to 32 bit addressing")
      Fixes: 935b7f64 ("netfilter: nft_exthdr: add TCP option matching")
      Fixes: 133dc203 ("netfilter: nft_exthdr: Support SCTP chunks")
      Fixes: dbb5281a ("netfilter: nf_tables: add support for matching IPv4 options")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      fd94d9da
    • Martin KaFai Lau's avatar
      selftests/bpf: Check bpf_sk_storage has uncharged sk_omem_alloc · a96d1cfb
      Martin KaFai Lau authored
      This patch checks the sk_omem_alloc has been uncharged by bpf_sk_storage
      during the __sk_destruct.
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230901231129.578493-4-martin.lau@linux.dev
      a96d1cfb
    • Martin KaFai Lau's avatar
      bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc · 55d49f75
      Martin KaFai Lau authored
      The commit c83597fa ("bpf: Refactor some inode/task/sk storage functions
      for reuse"), refactored the bpf_{sk,task,inode}_storage_free() into
      bpf_local_storage_unlink_nolock() which then later renamed to
      bpf_local_storage_destroy(). The commit accidentally passed the
      "bool uncharge_mem = false" argument to bpf_selem_unlink_storage_nolock()
      which then stopped the uncharge from happening to the sk->sk_omem_alloc.
      
      This missing uncharge only happens when the sk is going away (during
      __sk_destruct).
      
      This patch fixes it by always passing "uncharge_mem = true". It is a
      noop to the task/inode/cgroup storage because they do not have the
      map_local_storage_(un)charge enabled in the map_ops. A followup patch
      will be done in bpf-next to remove the uncharge_mem argument.
      
      A selftest is added in the next patch.
      
      Fixes: c83597fa ("bpf: Refactor some inode/task/sk storage functions for reuse")
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230901231129.578493-3-martin.lau@linux.dev
      55d49f75
    • Martin KaFai Lau's avatar
      bpf: bpf_sk_storage: Fix invalid wait context lockdep report · a96a44ab
      Martin KaFai Lau authored
      './test_progs -t test_local_storage' reported a splat:
      
      [   27.137569] =============================
      [   27.138122] [ BUG: Invalid wait context ]
      [   27.138650] 6.5.0-03980-gd11ae1b1 #247 Tainted: G           O
      [   27.139542] -----------------------------
      [   27.140106] test_progs/1729 is trying to lock:
      [   27.140713] ffff8883ef047b88 (stock_lock){-.-.}-{3:3}, at: local_lock_acquire+0x9/0x130
      [   27.141834] other info that might help us debug this:
      [   27.142437] context-{5:5}
      [   27.142856] 2 locks held by test_progs/1729:
      [   27.143352]  #0: ffffffff84bcd9c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x40
      [   27.144492]  #1: ffff888107deb2c0 (&storage->lock){..-.}-{2:2}, at: bpf_local_storage_update+0x39e/0x8e0
      [   27.145855] stack backtrace:
      [   27.146274] CPU: 0 PID: 1729 Comm: test_progs Tainted: G           O       6.5.0-03980-gd11ae1b1 #247
      [   27.147550] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [   27.149127] Call Trace:
      [   27.149490]  <TASK>
      [   27.149867]  dump_stack_lvl+0x130/0x1d0
      [   27.152609]  dump_stack+0x14/0x20
      [   27.153131]  __lock_acquire+0x1657/0x2220
      [   27.153677]  lock_acquire+0x1b8/0x510
      [   27.157908]  local_lock_acquire+0x29/0x130
      [   27.159048]  obj_cgroup_charge+0xf4/0x3c0
      [   27.160794]  slab_pre_alloc_hook+0x28e/0x2b0
      [   27.161931]  __kmem_cache_alloc_node+0x51/0x210
      [   27.163557]  __kmalloc+0xaa/0x210
      [   27.164593]  bpf_map_kzalloc+0xbc/0x170
      [   27.165147]  bpf_selem_alloc+0x130/0x510
      [   27.166295]  bpf_local_storage_update+0x5aa/0x8e0
      [   27.167042]  bpf_fd_sk_storage_update_elem+0xdb/0x1a0
      [   27.169199]  bpf_map_update_value+0x415/0x4f0
      [   27.169871]  map_update_elem+0x413/0x550
      [   27.170330]  __sys_bpf+0x5e9/0x640
      [   27.174065]  __x64_sys_bpf+0x80/0x90
      [   27.174568]  do_syscall_64+0x48/0xa0
      [   27.175201]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      [   27.175932] RIP: 0033:0x7effb40e41ad
      [   27.176357] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d8
      [   27.179028] RSP: 002b:00007ffe64c21fc8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
      [   27.180088] RAX: ffffffffffffffda RBX: 00007ffe64c22768 RCX: 00007effb40e41ad
      [   27.181082] RDX: 0000000000000020 RSI: 00007ffe64c22008 RDI: 0000000000000002
      [   27.182030] RBP: 00007ffe64c21ff0 R08: 0000000000000000 R09: 00007ffe64c22788
      [   27.183038] R10: 0000000000000064 R11: 0000000000000202 R12: 0000000000000000
      [   27.184006] R13: 00007ffe64c22788 R14: 00007effb42a1000 R15: 0000000000000000
      [   27.184958]  </TASK>
      
      It complains about acquiring a local_lock while holding a raw_spin_lock.
      It means it should not allocate memory while holding a raw_spin_lock
      since it is not safe for RT.
      
      raw_spin_lock is needed because bpf_local_storage supports tracing
      context. In particular for task local storage, it is easy to
      get a "current" task PTR_TO_BTF_ID in tracing bpf prog.
      However, task (and cgroup) local storage has already been moved to
      bpf mem allocator which can be used after raw_spin_lock.
      
      The splat is for the sk storage. For sk (and inode) storage,
      it has not been moved to bpf mem allocator. Using raw_spin_lock or not,
      kzalloc(GFP_ATOMIC) could theoretically be unsafe in tracing context.
      However, the local storage helper requires a verifier accepted
      sk pointer (PTR_TO_BTF_ID), it is hypothetical if that (mean running
      a bpf prog in a kzalloc unsafe context and also able to hold a verifier
      accepted sk pointer) could happen.
      
      This patch avoids kzalloc after raw_spin_lock to silent the splat.
      There is an existing kzalloc before the raw_spin_lock. At that point,
      a kzalloc is very likely required because a lookup has just been done
      before. Thus, this patch always does the kzalloc before acquiring
      the raw_spin_lock and remove the later kzalloc usage after the
      raw_spin_lock. After this change, it will have a charge and then
      uncharge during the syscall bpf_map_update_elem() code path.
      This patch opts for simplicity and not continue the old
      optimization to save one charge and uncharge.
      
      This issue is dated back to the very first commit of bpf_sk_storage
      which had been refactored multiple times to create task, inode, and
      cgroup storage. This patch uses a Fixes tag with a more recent
      commit that should be easier to do backport.
      
      Fixes: b00fa38a ("bpf: Enable non-atomic allocations in local storage")
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230901231129.578493-2-martin.lau@linux.dev
      a96a44ab
    • Ilya Leoshkevich's avatar
      s390/bpf: Pass through tail call counter in trampolines · a192103a
      Ilya Leoshkevich authored
      s390x eBPF programs use the following extension to the s390x calling
      convention: tail call counter is passed on stack at offset
      STK_OFF_TCCNT, which callees otherwise use as scratch space.
      
      Currently trampoline does not respect this and clobbers tail call
      counter. This breaks enforcing tail call limits in eBPF programs, which
      have trampolines attached to them.
      
      Fix by forwarding a copy of the tail call counter to the original eBPF
      program in the trampoline (for fexit), and by restoring it at the end
      of the trampoline (for fentry).
      
      Fixes: 528eb2cb ("s390/bpf: Implement arch_prepare_bpf_trampoline()")
      Reported-by: default avatarLeon Hwang <hffilwlqm@gmail.com>
      Signed-off-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230906004448.111674-1-iii@linux.ibm.com
      a192103a
    • Sebastian Andrzej Siewior's avatar
      bpf: Assign bpf_tramp_run_ctx::saved_run_ctx before recursion check. · 6764e767
      Sebastian Andrzej Siewior authored
      __bpf_prog_enter_recur() assigns bpf_tramp_run_ctx::saved_run_ctx before
      performing the recursion check which means in case of a recursion
      __bpf_prog_exit_recur() uses the previously set bpf_tramp_run_ctx::saved_run_ctx
      value.
      
      __bpf_prog_enter_sleepable_recur() assigns bpf_tramp_run_ctx::saved_run_ctx
      after the recursion check which means in case of a recursion
      __bpf_prog_exit_sleepable_recur() uses an uninitialized value. This does not
      look right. If I read the entry trampoline code right, then bpf_tramp_run_ctx
      isn't initialized upfront.
      
      Align __bpf_prog_enter_sleepable_recur() with __bpf_prog_enter_recur() and
      set bpf_tramp_run_ctx::saved_run_ctx before the recursion check is made.
      Remove the assignment of saved_run_ctx in kern_sys_bpf() since it happens
      a few cycles later.
      
      Fixes: e384c7b7 ("bpf, x86: Create bpf_tramp_run_ctx on the caller thread's stack")
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/bpf/20230830080405.251926-3-bigeasy@linutronix.de
      6764e767
    • Sebastian Andrzej Siewior's avatar
      bpf: Invoke __bpf_prog_exit_sleepable_recur() on recursion in kern_sys_bpf(). · 7645629f
      Sebastian Andrzej Siewior authored
      If __bpf_prog_enter_sleepable_recur() detects recursion then it returns
      0 without undoing rcu_read_lock_trace(), migrate_disable() or
      decrementing the recursion counter. This is fine in the JIT case because
      the JIT code will jump in the 0 case to the end and invoke the matching
      exit trampoline (__bpf_prog_exit_sleepable_recur()).
      
      This is not the case in kern_sys_bpf() which returns directly to the
      caller with an error code.
      
      Add __bpf_prog_exit_sleepable_recur() as clean up in the recursion case.
      
      Fixes: b1d18a75 ("bpf: Extend sys_bpf commands for bpf_syscall programs.")
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/bpf/20230830080405.251926-2-bigeasy@linutronix.de
      7645629f
    • Jakub Kicinski's avatar
      net: phylink: fix sphinx complaint about invalid literal · 1a961e74
      Jakub Kicinski authored
      sphinx complains about the use of "%PHYLINK_PCS_NEG_*":
      
      Documentation/networking/kapi:144: ./include/linux/phylink.h:601: WARNING: Inline literal start-string without end-string.
      Documentation/networking/kapi:144: ./include/linux/phylink.h:633: WARNING: Inline literal start-string without end-string.
      
      These are not valid symbols so drop the '%' prefix.
      
      Alternatively we could use %PHYLINK_PCS_NEG_\* (escape the *)
      or use normal literal ``PHYLINK_PCS_NEG_*`` but there is already
      a handful of un-adorned DEFINE_* in this file.
      
      Fixes: f99d471a ("net: phylink: add PCS negotiation mode")
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Link: https://lore.kernel.org/all/20230626162908.2f149f98@canb.auug.org.au/Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarBagas Sanjaya <bagasdotme@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a961e74
    • David S. Miller's avatar
      Merge branch 'sja1105-fixes' · f8fdd54e
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      tc-cbs offload fixes for SJA1105 DSA
      
      Yanan Yang has pointed out to me that certain tc-cbs offloaded
      configurations do not appear to do any shaping on the LS1021A-TSN board
      (SJA1105T).
      
      This is due to an apparent documentation error that also made its way
      into the driver, which patch 1/3 now fixes.
      
      While investigating and then testing, I've found 2 more bugs, which are
      patches 2/3 and 3/3.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8fdd54e
    • Vladimir Oltean's avatar
      net: dsa: sja1105: complete tc-cbs offload support on SJA1110 · 180a7419
      Vladimir Oltean authored
      The blamed commit left this delta behind:
      
        struct sja1105_cbs_entry {
       -	u64 port;
       -	u64 prio;
       +	u64 port; /* Not used for SJA1110 */
       +	u64 prio; /* Not used for SJA1110 */
        	u64 credit_hi;
        	u64 credit_lo;
        	u64 send_slope;
        	u64 idle_slope;
        };
      
      but did not actually implement tc-cbs offload fully for the new switch.
      The offload is accepted, but it doesn't work.
      
      The difference compared to earlier switch generations is that now, the
      table of CBS shapers is sparse, because there are many more shapers, so
      the mapping between a {port, prio} and a table index is static, rather
      than requiring us to store the port and prio into the sja1105_cbs_entry.
      
      So, the problem is that the code programs the CBS shaper parameters at a
      dynamic table index which is incorrect.
      
      All that needs to be done for SJA1110 CBS shapers to work is to bypass
      the logic which allocates shapers in a dense manner, as for SJA1105, and
      use the fixed mapping instead.
      
      Fixes: 3e77e59b ("net: dsa: sja1105: add support for the SJA1110 switch family")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      180a7419
    • Vladimir Oltean's avatar
      net: dsa: sja1105: fix -ENOSPC when replacing the same tc-cbs too many times · 894cafc5
      Vladimir Oltean authored
      After running command [2] too many times in a row:
      
      [1] $ tc qdisc add dev sw2p0 root handle 1: mqprio num_tc 8 \
      	map 0 1 2 3 4 5 6 7 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0
      [2] $ tc qdisc replace dev sw2p0 parent 1:1 cbs offload 1 \
      	idleslope 120000 sendslope -880000 locredit -1320 hicredit 180
      
      (aka more than priv->info->num_cbs_shapers times)
      
      we start seeing the following error message:
      
      Error: Specified device failed to setup cbs hardware offload.
      
      This comes from the fact that ndo_setup_tc(TC_SETUP_QDISC_CBS) presents
      the same API for the qdisc create and replace cases, and the sja1105
      driver fails to distinguish between the 2. Thus, it always thinks that
      it must allocate the same shaper for a {port, queue} pair, when it may
      instead have to replace an existing one.
      
      Fixes: 4d752508 ("net: dsa: sja1105: offload the Credit-Based Shaper qdisc")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      894cafc5
    • Vladimir Oltean's avatar
      net: dsa: sja1105: fix bandwidth discrepancy between tc-cbs software and offload · 954ad9bf
      Vladimir Oltean authored
      More careful measurement of the tc-cbs bandwidth shows that the stream
      bandwidth (effectively idleslope) increases, there is a larger and
      larger discrepancy between the rate limit obtained by the software
      Qdisc, and the rate limit obtained by its offloaded counterpart.
      
      The discrepancy becomes so large, that e.g. at an idleslope of 40000
      (40Mbps), the offloaded cbs does not actually rate limit anything, and
      traffic will pass at line rate through a 100 Mbps port.
      
      The reason for the discrepancy is that the hardware documentation I've
      been following is incorrect. UM11040.pdf (for SJA1105P/Q/R/S) states
      about IDLE_SLOPE that it is "the rate (in unit of bytes/sec) at which
      the credit counter is increased".
      
      Cross-checking with UM10944.pdf (for SJA1105E/T) and UM11107.pdf
      (for SJA1110), the wording is different: "This field specifies the
      value, in bytes per second times link speed, by which the credit counter
      is increased".
      
      So there's an extra scaling for link speed that the driver is currently
      not accounting for, and apparently (empirically), that link speed is
      expressed in Kbps.
      
      I've pondered whether to pollute the sja1105_mac_link_up()
      implementation with CBS shaper reprogramming, but I don't think it is
      worth it. IMO, the UAPI exposed by tc-cbs requires user space to
      recalculate the sendslope anyway, since the formula for that depends on
      port_transmit_rate (see man tc-cbs), which is not an invariant from tc's
      perspective.
      
      So we use the offload->sendslope and offload->idleslope to deduce the
      original port_transmit_rate from the CBS formula, and use that value to
      scale the offload->sendslope and offload->idleslope to values that the
      hardware understands.
      
      Some numerical data points:
      
       40Mbps stream, max interfering frame size 1500, port speed 100M
       ---------------------------------------------------------------
      
       tc-cbs parameters:
       idleslope 40000 sendslope -60000 locredit -900 hicredit 600
      
       which result in hardware values:
      
       Before (doesn't work)           After (works)
       credit_hi    600                600
       credit_lo    900                900
       send_slope   7500000            75
       idle_slope   5000000            50
      
       40Mbps stream, max interfering frame size 1500, port speed 1G
       -------------------------------------------------------------
      
       tc-cbs parameters:
       idleslope 40000 sendslope -960000 locredit -1440 hicredit 60
      
       which result in hardware values:
      
       Before (doesn't work)           After (works)
       credit_hi    60                 60
       credit_lo    1440               1440
       send_slope   120000000          120
       idle_slope   5000000            5
      
       5.12Mbps stream, max interfering frame size 1522, port speed 100M
       -----------------------------------------------------------------
      
       tc-cbs parameters:
       idleslope 5120 sendslope -94880 locredit -1444 hicredit 77
      
       which result in hardware values:
      
       Before (doesn't work)           After (works)
       credit_hi    77                 77
       credit_lo    1444               1444
       send_slope   11860000           118
       idle_slope   640000             6
      
      Tested on SJA1105T, SJA1105S and SJA1110A, at 1Gbps and 100Mbps.
      
      Fixes: 4d752508 ("net: dsa: sja1105: offload the Credit-Based Shaper qdisc")
      Reported-by: default avatarYanan Yang <yanan.yang@nxp.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      954ad9bf
    • David S. Miller's avatar
      Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · ca7cfd73
      David S. Miller authored
      Tony Nguyen says:
      
      ====================
      Change MIN_TXD and MIN_RXD to allow set rx/tx value between 64 and 80
      
      Olga Zaborska says:
      
      Change the minimum value of RX/TX descriptors to 64 to enable setting the rx/tx
      value between 64 and 80. All igb, igbvf and igc devices can use as low as 64
      descriptors.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca7cfd73
    • Bodong Wang's avatar
      mlx5/core: E-Switch, Create ACL FT for eswitch manager in switchdev mode · 34413460
      Bodong Wang authored
      ACL flow table is required in switchdev mode when metadata is enabled,
      driver creates such table when loading each vport. However, not every
      vport is loaded in switchdev mode. Such as ECPF if it's the eswitch manager.
      In this case, ACL flow table is still needed.
      
      To make it modularized, create ACL flow table for eswitch manager as
      default and skip such operations when loading manager vport.
      
      Also, there is no need to load the eswitch manager vport in switchdev mode.
      This means there is no need to load it on regular connect-x HCAs where
      the PF is the eswitch manager. This will avoid creating duplicate ACL
      flow table for host PF vport.
      
      Fixes: 29bcb6e4 ("net/mlx5e: E-Switch, Use metadata for vport matching in send-to-vport rules")
      Fixes: eb8e9fae ("mlx5/core: E-Switch, Allocate ECPF vport if it's an eswitch manager")
      Fixes: 5019833d ("net/mlx5: E-switch, Introduce helper function to enable/disable vports")
      Signed-off-by: default avatarBodong Wang <bodong@nvidia.com>
      Reviewed-by: default avatarMark Bloch <mbloch@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      34413460