An error occurred fetching the project authors.
  1. 19 Apr, 2024 1 commit
  2. 10 Apr, 2024 1 commit
    • Kuan-Wei Chiu's avatar
      net: sched: cake: Optimize the number of function calls and branches in heap construction · d034d02d
      Kuan-Wei Chiu authored
      When constructing a heap, heapify operations are required on all
      non-leaf nodes. Thus, determining the index of the first non-leaf node
      is crucial. In a heap, the left child's index of node i is 2 * i + 1
      and the right child's index is 2 * i + 2. Node CAKE_MAX_TINS *
      CAKE_QUEUES / 2 has its left and right children at indexes
      CAKE_MAX_TINS * CAKE_QUEUES + 1 and CAKE_MAX_TINS * CAKE_QUEUES + 2,
      respectively, which are beyond the heap's range, indicating it as a
      leaf node. Conversely, node CAKE_MAX_TINS * CAKE_QUEUES / 2 - 1 has a
      left child at index CAKE_MAX_TINS * CAKE_QUEUES - 1, confirming its
      non-leaf status. The loop should start from it since it's not a leaf
      node.
      
      By starting the loop from CAKE_MAX_TINS * CAKE_QUEUES / 2 - 1, we
      minimize function calls and branch condition evaluations. This
      adjustment theoretically reduces two function calls (one for
      cake_heapify() and another for cake_heap_get_backlog()) and five branch
      evaluations (one for iterating all non-leaf nodes, one within
      cake_heapify()'s while loop, and three more within the while loop
      with if conditions).
      Signed-off-by: default avatarKuan-Wei Chiu <visitorckw@gmail.com>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@toke.dk>
      Link: https://lore.kernel.org/r/20240408174716.751069-1-visitorckw@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d034d02d
  3. 02 Feb, 2024 1 commit
    • Michal Koutný's avatar
      net/sched: Add module aliases for cls_,sch_,act_ modules · 241a94ab
      Michal Koutný authored
      No functional change intended, aliases will be used in followup commits.
      Note for backporters: you may need to add aliases also for modules that
      are already removed in mainline kernel but still in your version.
      
      Patches were generated with the help of Coccinelle scripts like:
      
      cat >scripts/coccinelle/misc/tcf_alias.cocci <<EOD
      virtual patch
      virtual report
      
      @ haskernel @
      @@
      
      @ tcf_has_kind depends on report && haskernel @
      identifier ops;
      constant K;
      @@
      
        static struct tcf_proto_ops ops = {
          .kind = K,
          ...
        };
      +char module_alias = K;
      EOD
      
      /usr/bin/spatch -D report --cocci-file scripts/coccinelle/misc/tcf_alias.cocci \
              --dir . \
              -I ./arch/x86/include -I ./arch/x86/include/generated -I ./include \
              -I ./arch/x86/include/uapi -I ./arch/x86/include/generated/uapi \
              -I ./include/uapi -I ./include/generated/uapi \
              --include ./include/linux/compiler-version.h --include ./include/linux/kconfig.h \
              --jobs 8 --chunksize 1 2>/dev/null | \
              sed 's/char module_alias = "\([^"]*\)";/MODULE_ALIAS_NET_CLS("\1");/'
      
      And analogously for:
      
        static struct tc_action_ops ops = {
          .kind = K,
      
        static struct Qdisc_ops ops = {
          .id = K,
      
      (Someone familiar would be able to fit those into one .cocci file
      without sed post processing.)
      Signed-off-by: default avatarMichal Koutný <mkoutny@suse.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Link: https://lore.kernel.org/r/20240201130943.19536-3-mkoutny@suse.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      241a94ab
  4. 10 Jun, 2023 1 commit
  5. 23 Mar, 2023 1 commit
  6. 02 Feb, 2023 1 commit
  7. 19 Oct, 2022 1 commit
    • Zhengchao Shao's avatar
      net: sched: cake: fix null pointer access issue when cake_init() fails · 51f9a892
      Zhengchao Shao authored
      When the default qdisc is cake, if the qdisc of dev_queue fails to be
      inited during mqprio_init(), cake_reset() is invoked to clear
      resources. In this case, the tins is NULL, and it will cause gpf issue.
      
      The process is as follows:
      qdisc_create_dflt()
      	cake_init()
      		q->tins = kvcalloc(...)        --->failed, q->tins is NULL
      	...
      	qdisc_put()
      		...
      		cake_reset()
      			...
      			cake_dequeue_one()
      				b = &q->tins[...]   --->q->tins is NULL
      
      The following is the Call Trace information:
      general protection fault, probably for non-canonical address
      0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
      KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
      RIP: 0010:cake_dequeue_one+0xc9/0x3c0
      Call Trace:
      <TASK>
      cake_reset+0xb1/0x140
      qdisc_reset+0xed/0x6f0
      qdisc_destroy+0x82/0x4c0
      qdisc_put+0x9e/0xb0
      qdisc_create_dflt+0x2c3/0x4a0
      mqprio_init+0xa71/0x1760
      qdisc_create+0x3eb/0x1000
      tc_modify_qdisc+0x408/0x1720
      rtnetlink_rcv_msg+0x38e/0xac0
      netlink_rcv_skb+0x12d/0x3a0
      netlink_unicast+0x4a2/0x740
      netlink_sendmsg+0x826/0xcc0
      sock_sendmsg+0xc5/0x100
      ____sys_sendmsg+0x583/0x690
      ___sys_sendmsg+0xe8/0x160
      __sys_sendmsg+0xbf/0x160
      do_syscall_64+0x35/0x80
      entry_SYSCALL_64_after_hwframe+0x46/0xb0
      RIP: 0033:0x7f89e5122d04
      </TASK>
      
      Fixes: 046f6fd5 ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      51f9a892
  8. 11 Oct, 2022 2 commits
    • Jason A. Donenfeld's avatar
      treewide: use get_random_u32() when possible · a251c17a
      Jason A. Donenfeld authored
      The prandom_u32() function has been a deprecated inline wrapper around
      get_random_u32() for several releases now, and compiles down to the
      exact same code. Replace the deprecated wrapper with a direct call to
      the real function. The same also applies to get_random_int(), which is
      just a wrapper around get_random_u32(). This was done as a basic find
      and replace.
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarYury Norov <yury.norov@gmail.com>
      Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
      Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
      Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt
      Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
      Acked-by: Helge Deller <deller@gmx.de> # for parisc
      Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      a251c17a
    • Jason A. Donenfeld's avatar
      treewide: use get_random_{u8,u16}() when possible, part 1 · 7e3cf084
      Jason A. Donenfeld authored
      Rather than truncate a 32-bit value to a 16-bit value or an 8-bit value,
      simply use the get_random_{u8,u16}() functions, which are faster than
      wasting the additional bytes from a 32-bit value. This was done
      mechanically with this coccinelle script:
      
      @@
      expression E;
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      typedef u16;
      typedef __be16;
      typedef __le16;
      typedef u8;
      @@
      (
      - (get_random_u32() & 0xffff)
      + get_random_u16()
      |
      - (get_random_u32() & 0xff)
      + get_random_u8()
      |
      - (get_random_u32() % 65536)
      + get_random_u16()
      |
      - (get_random_u32() % 256)
      + get_random_u8()
      |
      - (get_random_u32() >> 16)
      + get_random_u16()
      |
      - (get_random_u32() >> 24)
      + get_random_u8()
      |
      - (u16)get_random_u32()
      + get_random_u16()
      |
      - (u8)get_random_u32()
      + get_random_u8()
      |
      - (__be16)get_random_u32()
      + (__be16)get_random_u16()
      |
      - (__le16)get_random_u32()
      + (__le16)get_random_u16()
      |
      - prandom_u32_max(65536)
      + get_random_u16()
      |
      - prandom_u32_max(256)
      + get_random_u8()
      |
      - E->inet_id = get_random_u32()
      + E->inet_id = get_random_u16()
      )
      
      @@
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      typedef u16;
      identifier v;
      @@
      - u16 v = get_random_u32();
      + u16 v = get_random_u16();
      
      @@
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      typedef u8;
      identifier v;
      @@
      - u8 v = get_random_u32();
      + u8 v = get_random_u8();
      
      @@
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      typedef u16;
      u16 v;
      @@
      -  v = get_random_u32();
      +  v = get_random_u16();
      
      @@
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      typedef u8;
      u8 v;
      @@
      -  v = get_random_u32();
      +  v = get_random_u8();
      
      // Find a potential literal
      @literal_mask@
      expression LITERAL;
      type T;
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      position p;
      @@
      
              ((T)get_random_u32()@p & (LITERAL))
      
      // Examine limits
      @script:python add_one@
      literal << literal_mask.LITERAL;
      RESULT;
      @@
      
      value = None
      if literal.startswith('0x'):
              value = int(literal, 16)
      elif literal[0] in '123456789':
              value = int(literal, 10)
      if value is None:
              print("I don't know how to handle %s" % (literal))
              cocci.include_match(False)
      elif value < 256:
              coccinelle.RESULT = cocci.make_ident("get_random_u8")
      elif value < 65536:
              coccinelle.RESULT = cocci.make_ident("get_random_u16")
      else:
              print("Skipping large mask of %s" % (literal))
              cocci.include_match(False)
      
      // Replace the literal mask with the calculated result.
      @plus_one@
      expression literal_mask.LITERAL;
      position literal_mask.p;
      identifier add_one.RESULT;
      identifier FUNC;
      @@
      
      -       (FUNC()@p & (LITERAL))
      +       (RESULT() & LITERAL)
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarYury Norov <yury.norov@gmail.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      7e3cf084
  9. 23 Sep, 2022 1 commit
  10. 01 Sep, 2022 2 commits
  11. 31 Aug, 2022 1 commit
  12. 07 Jan, 2022 1 commit
  13. 10 Dec, 2021 1 commit
    • Eric Dumazet's avatar
      sch_cake: do not call cake_destroy() from cake_init() · ab443c53
      Eric Dumazet authored
      qdiscs are not supposed to call their own destroy() method
      from init(), because core stack already does that.
      
      syzbot was able to trigger use after free:
      
      DEBUG_LOCKS_WARN_ON(lock->magic != lock)
      WARNING: CPU: 0 PID: 21902 at kernel/locking/mutex.c:586 __mutex_lock_common kernel/locking/mutex.c:586 [inline]
      WARNING: CPU: 0 PID: 21902 at kernel/locking/mutex.c:586 __mutex_lock+0x9ec/0x12f0 kernel/locking/mutex.c:740
      Modules linked in:
      CPU: 0 PID: 21902 Comm: syz-executor189 Not tainted 5.16.0-rc4-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:__mutex_lock_common kernel/locking/mutex.c:586 [inline]
      RIP: 0010:__mutex_lock+0x9ec/0x12f0 kernel/locking/mutex.c:740
      Code: 08 84 d2 0f 85 19 08 00 00 8b 05 97 38 4b 04 85 c0 0f 85 27 f7 ff ff 48 c7 c6 20 00 ac 89 48 c7 c7 a0 fe ab 89 e8 bf 76 ba ff <0f> 0b e9 0d f7 ff ff 48 8b 44 24 40 48 8d b8 c8 08 00 00 48 89 f8
      RSP: 0018:ffffc9000627f290 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff88802315d700 RSI: ffffffff815f1db8 RDI: fffff52000c4fe44
      RBP: ffff88818f28e000 R08: 0000000000000000 R09: 0000000000000000
      R10: ffffffff815ebb5e R11: 0000000000000000 R12: 0000000000000000
      R13: dffffc0000000000 R14: ffffc9000627f458 R15: 0000000093c30000
      FS:  0000555556abc400(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fda689c3303 CR3: 000000001cfbb000 CR4: 0000000000350ef0
      Call Trace:
       <TASK>
       tcf_chain0_head_change_cb_del+0x2e/0x3d0 net/sched/cls_api.c:810
       tcf_block_put_ext net/sched/cls_api.c:1381 [inline]
       tcf_block_put_ext net/sched/cls_api.c:1376 [inline]
       tcf_block_put+0xbc/0x130 net/sched/cls_api.c:1394
       cake_destroy+0x3f/0x80 net/sched/sch_cake.c:2695
       qdisc_create.constprop.0+0x9da/0x10f0 net/sched/sch_api.c:1293
       tc_modify_qdisc+0x4c5/0x1980 net/sched/sch_api.c:1660
       rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5571
       netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2496
       netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
       netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1345
       netlink_sendmsg+0x904/0xdf0 net/netlink/af_netlink.c:1921
       sock_sendmsg_nosec net/socket.c:704 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:724
       ____sys_sendmsg+0x6e8/0x810 net/socket.c:2409
       ___sys_sendmsg+0xf3/0x170 net/socket.c:2463
       __sys_sendmsg+0xe5/0x1b0 net/socket.c:2492
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f1bb06badb9
      Code: Unable to access opcode bytes at RIP 0x7f1bb06bad8f.
      RSP: 002b:00007fff3012a658 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f1bb06badb9
      RDX: 0000000000000000 RSI: 00000000200007c0 RDI: 0000000000000003
      RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000000003
      R10: 0000000000000003 R11: 0000000000000246 R12: 00007fff3012a688
      R13: 00007fff3012a6a0 R14: 00007fff3012a6e0 R15: 00000000000013c2
       </TASK>
      
      Fixes: 046f6fd5 ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@toke.dk>
      Link: https://lore.kernel.org/r/20211210142046.698336-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ab443c53
  14. 18 Aug, 2021 1 commit
  15. 29 Jul, 2021 1 commit
  16. 14 Jun, 2021 1 commit
  17. 10 Jun, 2021 1 commit
    • Maxim Mikityanskiy's avatar
      sch_cake: Fix out of bounds when parsing TCP options and header · ba91c49d
      Maxim Mikityanskiy authored
      The TCP option parser in cake qdisc (cake_get_tcpopt and
      cake_tcph_may_drop) could read one byte out of bounds. When the length
      is 1, the execution flow gets into the loop, reads one byte of the
      opcode, and if the opcode is neither TCPOPT_EOL nor TCPOPT_NOP, it reads
      one more byte, which exceeds the length of 1.
      
      This fix is inspired by commit 9609dad2 ("ipv4: tcp_input: fix stack
      out of bounds when parsing TCP options.").
      
      v2 changes:
      
      Added doff validation in cake_get_tcphdr to avoid parsing garbage as TCP
      header. Although it wasn't strictly an out-of-bounds access (memory was
      allocated), garbage values could be read where CAKE expected the TCP
      header if doff was smaller than 5.
      
      Cc: Young Xiao <92siuyang@gmail.com>
      Fixes: 8b713881 ("sch_cake: Add optional ACK filter")
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba91c49d
  18. 23 Aug, 2020 1 commit
  19. 16 Jul, 2020 2 commits
  20. 03 Jul, 2020 1 commit
    • Toke Høiland-Jørgensen's avatar
      sched: consistently handle layer3 header accesses in the presence of VLANs · d7bf2ebe
      Toke Høiland-Jørgensen authored
      There are a couple of places in net/sched/ that check skb->protocol and act
      on the value there. However, in the presence of VLAN tags, the value stored
      in skb->protocol can be inconsistent based on whether VLAN acceleration is
      enabled. The commit quoted in the Fixes tag below fixed the users of
      skb->protocol to use a helper that will always see the VLAN ethertype.
      
      However, most of the callers don't actually handle the VLAN ethertype, but
      expect to find the IP header type in the protocol field. This means that
      things like changing the ECN field, or parsing diffserv values, stops
      working if there's a VLAN tag, or if there are multiple nested VLAN
      tags (QinQ).
      
      To fix this, change the helper to take an argument that indicates whether
      the caller wants to skip the VLAN tags or not. When skipping VLAN tags, we
      make sure to skip all of them, so behaviour is consistent even in QinQ
      mode.
      
      To make the helper usable from the ECN code, move it to if_vlan.h instead
      of pkt_sched.h.
      
      v3:
      - Remove empty lines
      - Move vlan variable definitions inside loop in skb_protocol()
      - Also use skb_protocol() helper in IP{,6}_ECN_decapsulate() and
        bpf_skb_ecn_set_ce()
      
      v2:
      - Use eth_type_vlan() helper in skb_protocol()
      - Also fix code that reads skb->protocol directly
      - Change a couple of 'if/else if' statements to switch constructs to avoid
        calling the helper twice
      Reported-by: default avatarIlya Ponetayev <i.ponetaev@ndmsystems.com>
      Fixes: d8b9605d ("net: sched: fix skb->protocol use in case of accelerated vlan path")
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7bf2ebe
  21. 30 Jun, 2020 1 commit
    • Petr Machata's avatar
      net: sched: Pass root lock to Qdisc_ops.enqueue · aebe4426
      Petr Machata authored
      A following patch introduces qevents, points in qdisc algorithm where
      packet can be processed by user-defined filters. Should this processing
      lead to a situation where a new packet is to be enqueued on the same port,
      holding the root lock would lead to deadlocks. To solve the issue, qevent
      handler needs to unlock and relock the root lock when necessary.
      
      To that end, add the root lock argument to the qdisc op enqueue, and
      propagate throughout.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aebe4426
  22. 25 Jun, 2020 4 commits
  23. 31 May, 2020 1 commit
    • Toke Høiland-Jørgensen's avatar
      sch_cake: Take advantage of skb->hash where appropriate · b0c19ed6
      Toke Høiland-Jørgensen authored
      While the other fq-based qdiscs take advantage of skb->hash and doesn't
      recompute it if it is already set, sch_cake does not.
      
      This was a deliberate choice because sch_cake hashes various parts of the
      packet header to support its advanced flow isolation modes. However,
      foregoing the use of skb->hash entirely loses a few important benefits:
      
      - When skb->hash is set by hardware, a few CPU cycles can be saved by not
        hashing again in software.
      
      - Tunnel encapsulations will generally preserve the value of skb->hash from
        before the encapsulation, which allows flow-based qdiscs to distinguish
        between flows even though the outer packet header no longer has flow
        information.
      
      It turns out that we can preserve these desirable properties in many cases,
      while still supporting the advanced flow isolation properties of sch_cake.
      This patch does so by reusing the skb->hash value as the flow_hash part of
      the hashing procedure in cake_hash() only in the following conditions:
      
      - If the skb->hash is marked as covering the flow headers (skb->l4_hash is
        set)
      
      AND
      
      - NAT header rewriting is either disabled, or did not change any values
        used for hashing. The latter is important to match local-origin packets
        such as those of a tunnel endpoint.
      
      The immediate motivation for fixing this was the recent patch to WireGuard
      to preserve the skb->hash on encapsulation. As such, this is also what I
      tested against; with this patch, added latency under load for competing
      flows drops from ~8 ms to sub-1ms on an RRUL test over a WireGuard tunnel
      going through a virtual link shaped to 1Gbps using sch_cake. This matches
      the results we saw with a similar setup using sch_fq_codel when testing the
      WireGuard patch.
      
      Fixes: 046f6fd5 ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0c19ed6
  24. 14 Jan, 2020 1 commit
  25. 03 Jan, 2020 1 commit
    • Wen Yang's avatar
      sch_cake: avoid possible divide by zero in cake_enqueue() · 68aab823
      Wen Yang authored
      The variables 'window_interval' is u64 and do_div()
      truncates it to 32 bits, which means it can test
      non-zero and be truncated to zero for division.
      The unit of window_interval is nanoseconds,
      so its lower 32-bit is relatively easy to exceed.
      Fix this issue by using div64_u64() instead.
      
      Fixes: 7298de9c ("sch_cake: Add ingress mode")
      Signed-off-by: default avatarWen Yang <wenyang@linux.alibaba.com>
      Cc: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
      Cc: Toke Høiland-Jørgensen <toke@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: cake@lists.bufferbloat.net
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Acked-by: default avatarToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      68aab823
  26. 18 Dec, 2019 1 commit
  27. 02 Dec, 2019 1 commit
  28. 27 Apr, 2019 2 commits
    • Johannes Berg's avatar
      netlink: make validation more configurable for future strictness · 8cb08174
      Johannes Berg authored
      We currently have two levels of strict validation:
      
       1) liberal (default)
           - undefined (type >= max) & NLA_UNSPEC attributes accepted
           - attribute length >= expected accepted
           - garbage at end of message accepted
       2) strict (opt-in)
           - NLA_UNSPEC attributes accepted
           - attribute length >= expected accepted
      
      Split out parsing strictness into four different options:
       * TRAILING     - check that there's no trailing data after parsing
                        attributes (in message or nested)
       * MAXTYPE      - reject attrs > max known type
       * UNSPEC       - reject attributes with NLA_UNSPEC policy entries
       * STRICT_ATTRS - strictly validate attribute size
      
      The default for future things should be *everything*.
      The current *_strict() is a combination of TRAILING and MAXTYPE,
      and is renamed to _deprecated_strict().
      The current regular parsing has none of this, and is renamed to
      *_parse_deprecated().
      
      Additionally it allows us to selectively set one of the new flags
      even on old policies. Notably, the UNSPEC flag could be useful in
      this case, since it can be arranged (by filling in the policy) to
      not be an incompatible userspace ABI change, but would then going
      forward prevent forgetting attribute entries. Similar can apply
      to the POLICY flag.
      
      We end up with the following renames:
       * nla_parse           -> nla_parse_deprecated
       * nla_parse_strict    -> nla_parse_deprecated_strict
       * nlmsg_parse         -> nlmsg_parse_deprecated
       * nlmsg_parse_strict  -> nlmsg_parse_deprecated_strict
       * nla_parse_nested    -> nla_parse_nested_deprecated
       * nla_validate_nested -> nla_validate_nested_deprecated
      
      Using spatch, of course:
          @@
          expression TB, MAX, HEAD, LEN, POL, EXT;
          @@
          -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
          +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)
      
          @@
          expression NLH, HDRLEN, TB, MAX, POL, EXT;
          @@
          -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
          +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)
      
          @@
          expression NLH, HDRLEN, TB, MAX, POL, EXT;
          @@
          -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
          +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
      
          @@
          expression TB, MAX, NLA, POL, EXT;
          @@
          -nla_parse_nested(TB, MAX, NLA, POL, EXT)
          +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)
      
          @@
          expression START, MAX, POL, EXT;
          @@
          -nla_validate_nested(START, MAX, POL, EXT)
          +nla_validate_nested_deprecated(START, MAX, POL, EXT)
      
          @@
          expression NLH, HDRLEN, MAX, POL, EXT;
          @@
          -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
          +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)
      
      For this patch, don't actually add the strict, non-renamed versions
      yet so that it breaks compile if I get it wrong.
      
      Also, while at it, make nla_validate and nla_parse go down to a
      common __nla_validate_parse() function to avoid code duplication.
      
      Ultimately, this allows us to have very strict validation for every
      new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
      next patch, while existing things will continue to work as is.
      
      In effect then, this adds fully strict validation for any new command.
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8cb08174
    • Michal Kubecek's avatar
      netlink: make nla_nest_start() add NLA_F_NESTED flag · ae0be8de
      Michal Kubecek authored
      Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
      netlink based interfaces (including recently added ones) are still not
      setting it in kernel generated messages. Without the flag, message parsers
      not aware of attribute semantics (e.g. wireshark dissector or libmnl's
      mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
      the structure of their contents.
      
      Unfortunately we cannot just add the flag everywhere as there may be
      userspace applications which check nlattr::nla_type directly rather than
      through a helper masking out the flags. Therefore the patch renames
      nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
      as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
      are rewritten to use nla_nest_start().
      
      Except for changes in include/net/netlink.h, the patch was generated using
      this semantic patch:
      
      @@ expression E1, E2; @@
      -nla_nest_start(E1, E2)
      +nla_nest_start_noflag(E1, E2)
      
      @@ expression E1, E2; @@
      -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
      +nla_nest_start(E1, E2)
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ae0be8de
  29. 04 Apr, 2019 2 commits
  30. 15 Mar, 2019 1 commit
    • Toke Høiland-Jørgensen's avatar
      sch_cake: Interpret fwmark parameter as a bitmask · eab2fc82
      Toke Høiland-Jørgensen authored
      We initially interpreted the fwmark parameter as a flag that simply turned
      on the feature, using the whole skb->mark field as the index into the CAKE
      tin_order array. However, it is quite common for different applications to
      use different parts of the mask field for their own purposes, each using a
      different mask.
      
      Support this use of subsets of the mark by interpreting the TCA_CAKE_FWMARK
      parameter as a bitmask to apply to the fwmark field when reading it. The
      result will be right-shifted by the number of unset lower bits of the mask
      before looking up the tin.
      
      In the original commit message we also failed to credit Felix Resch with
      originally suggesting the fwmark feature back in 2017; so the Suggested-By
      in this commit covers the whole fwmark feature.
      
      Fixes: 0b5c7efd ("sch_cake: Permit use of connmarks as tin classifiers")
      Suggested-by: default avatarFelix Resch <fuller@beif.de>
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eab2fc82
  31. 04 Mar, 2019 2 commits
    • Toke Høiland-Jørgensen's avatar
      sch_cake: Simplify logic in cake_select_tin() · 4976e3c6
      Toke Høiland-Jørgensen authored
      With more modes added the logic in cake_select_tin() was getting a bit
      hairy, and it turns out we can actually simplify it quite a bit. This also
      allows us to get rid of one of the two diffserv parsing functions, which
      has the added benefit that already-zeroed DSCP fields won't get re-written.
      Suggested-by: default avatarKevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4976e3c6
    • Kevin Darbyshire-Bryant's avatar
      sch_cake: Permit use of connmarks as tin classifiers · 0b5c7efd
      Kevin Darbyshire-Bryant authored
      Add flag 'FWMARK' to enable use of firewall connmarks as tin selector.
      The connmark (skbuff->mark) needs to be in the range 1->tin_cnt ie.
      for diffserv3 the mark needs to be 1->3.
      
      Background
      
      Typically CAKE uses DSCP as the basis for tin selection.  DSCP values
      are relatively easily changed as part of the egress path, usually with
      iptables & the mangle table, ingress is more challenging.  CAKE is often
      used on the WAN interface of a residential gateway where passthrough of
      DSCP from the ISP is either missing or set to unhelpful values thus use
      of ingress DSCP values for tin selection isn't helpful in that
      environment.
      
      An approach to solving the ingress tin selection problem is to use
      CAKE's understanding of tc filters.  Naive tc filters could match on
      source/destination port numbers and force tin selection that way, but
      multiple filters don't scale particularly well as each filter must be
      traversed whether it matches or not. e.g. a simple example to map 3
      firewall marks to tins:
      
      MAJOR=$( tc qdisc show dev $DEV | head -1 | awk '{print $3}' )
      tc filter add dev $DEV parent $MAJOR protocol all handle 0x01 fw action skbedit priority ${MAJOR}1
      tc filter add dev $DEV parent $MAJOR protocol all handle 0x02 fw action skbedit priority ${MAJOR}2
      tc filter add dev $DEV parent $MAJOR protocol all handle 0x03 fw action skbedit priority ${MAJOR}3
      
      Another option is to use eBPF cls_act with tc filters e.g.
      
      MAJOR=$( tc qdisc show dev $DEV | head -1 | awk '{print $3}' )
      tc filter add dev $DEV parent $MAJOR bpf da obj my-bpf-fwmark-to-class.o
      
      This has the disadvantages of a) needing someone to write & maintain
      the bpf program, b) a bpf toolchain to compile it and c) needing to
      hardcode the major number in the bpf program so it matches the cake
      instance (or forcing the cake instance to a particular major number)
      since the major number cannot be passed to the bpf program via tc
      command line.
      
      As already hinted at by the previous examples, it would be helpful
      to associate tins with something that survives the Internet path and
      ideally allows tin selection on both egress and ingress.  Netfilter's
      conntrack permits setting an identifying mark on a connection which
      can also be restored to an ingress packet with tc action connmark e.g.
      
      tc filter add dev eth0 parent ffff: protocol all prio 10 u32 \
      	match u32 0 0 flowid 1:1 action connmark action mirred egress redirect dev ifb1
      
      Since tc's connmark action has restored any connmark into skb->mark,
      any of the previous solutions are based upon it and in one form or
      another copy that mark to the skb->priority field where again CAKE
      picks this up.
      
      This change cuts out at least one of the (less intuitive &
      non-scalable) middlemen and permit direct access to skb->mark.
      Signed-off-by: default avatarKevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b5c7efd