1. 27 Jan, 2020 18 commits
    • David S. Miller's avatar
      Merge branch 'Support-fraglist-GRO-GSO' · 4d434705
      David S. Miller authored
      Steffen Klassert says:
      
      ====================
      Support fraglist GRO/GSO
      
      This patchset adds support to do GRO/GSO by chaining packets
      of the same flow at the SKB frag_list pointer. This avoids
      the overhead to merge payloads into one big packet, and
      on the other end, if GSO is needed it avoids the overhead
      of splitting the big packet back to the native form.
      
      Patch 1 adds netdev feature flags to enable fraglist GRO,
      this implements one of the configuration options discussed
      at netconf 2019.
      
      Patch 2 adds a netdev software feature set that defaults to off
      and assigns the new fraglist GRO feature flag to it.
      
      Patch 3 adds the core infrastructure to do fraglist GRO/GSO.
      
      Patch 4 enables UDP to use fraglist GRO/GSO if configured.
      
      I have only meaningful forwarding performance measurements.
      I did some tests for the local receive path with netperf and iperf,
      but in this case the sender that generates the packets is the
      bottleneck. So the benchmarks are not that meaningful for the
      receive path.
      
      Paolo Abeni did some benchmarks of the local receive path for the
      RFC v2 version of this pachset, results can be found here:
      
      https://www.spinics.net/lists/netdev/msg551158.html
      
      I used my IPsec forwarding test setup for the performance measurements:
      
                 ------------         ------------
              -->| router 1 |-------->| router 2 |--
              |  ------------         ------------  |
              |                                     |
              |       --------------------          |
              --------|Spirent Testcenter|<----------
                      --------------------
      
      net-next (September 7th 2019):
      
      Single stream UDP frame size 1460 Bytes: 1.161.000 fps (13.5 Gbps).
      
      ----------------------------------------------------------------------
      
      net-next (September 7th 2019) + standard UDP GRO/GSO (not implemented
      in this patchset):
      
      Single stream UDP frame size 1460 Bytes: 1.801.000 fps (21 Gbps).
      
      ----------------------------------------------------------------------
      
      net-next (September 7th 2019) + fraglist UDP GRO/GSO:
      
      Single stream UDP frame size 1460 Bytes: 2.860.000 fps (33.4 Gbps).
      
      =======================================================================
      
      net-next (January 23th 2020):
      
      Single stream UDP frame size 1460 Bytes: 919.000 fps (10.73 Gbps).
      
      ----------------------------------------------------------------------
      
      net-next (January 23th 2020) + fraglist UDP GRO/GSO:
      
      Single stream UDP frame size 1460 Bytes: 2.430.000 fps (28.38 Gbps).
      
      -----------------------------------------------------------------------
      
      Changes from RFC v1:
      
      - Add IPv6 support.
      - Split patchset to enable UDP GRO by default before adding
        fraglist GRO support.
      - Mark fraglist GRO packets as CHECKSUM_NONE.
      - Take a refcount on the first segment skb when doing fraglist
        segmentation. With this we can use the same error handling
        path as with standard segmentation.
      
      Changes from RFC v2:
      
      - Add a netdev feature flag to configure listifyed GRO.
      - Fix UDP GRO enabling for IPv6.
      - Fix a rcu_read_lock() imbalance.
      - Fix error path in skb_segment_list().
      
      Changes from RFC v3:
      
      - Rename NETIF_F_GRO_LIST to NETIF_F_GRO_FRAGLIST and add
        NETIF_F_GSO_FRAGLIST.
      - Move introduction of SKB_GSO_FRAGLIST to patch 2.
      - Use udpv6_encap_needed_key instead of udp_encap_needed_key in IPv6.
      - Move some missplaced code from patch 5 to patch 1 where it belongs to.
      
      Changes from RFC v4:
      
      - Drop the 'UDP: enable GRO by default' patch for now. Standard UDP GRO
        is not changed with this patchset.
      - Rebase to net-next current.
      
      Changes fom v1 (December 18th):
      
      - Do a full __copy_skb_header instead of tryng to find the really
        needed subset header fields. Thisa can be done later.
      - Mark all fraglist GRO packets with CHECKSUM_UNNECESSARY.
      - Rebase to net-next current.
      
      Changes fom v2 (January 24th):
      
      - Do the CHECKSUM_UNNECESSARY setting from IPv4 for IPv6 too.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4d434705
    • Steffen Klassert's avatar
      udp: Support UDP fraglist GRO/GSO. · 9fd1ff5d
      Steffen Klassert authored
      This patch extends UDP GRO to support fraglist GRO/GSO
      by using the previously introduced infrastructure.
      If the feature is enabled, all UDP packets are going to
      fraglist GRO (local input and forward).
      
      After validating the csum,  we mark ip_summed as
      CHECKSUM_UNNECESSARY for fraglist GRO packets to
      make sure that the csum is not touched.
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9fd1ff5d
    • Steffen Klassert's avatar
      net: Support GRO/GSO fraglist chaining. · 3a1296a3
      Steffen Klassert authored
      This patch adds the core functions to chain/unchain
      GSO skbs at the frag_list pointer. This also adds
      a new GSO type SKB_GSO_FRAGLIST and a is_flist
      flag to napi_gro_cb which indicates that this
      flow will be GROed by fraglist chaining.
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3a1296a3
    • Steffen Klassert's avatar
      net: Add a netdev software feature set that defaults to off. · 1a3c998f
      Steffen Klassert authored
      The previous patch added the NETIF_F_GRO_FRAGLIST feature.
      This is a software feature that should default to off.
      Current software features default to on, so add a new
      feature set that defaults to off.
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a3c998f
    • Steffen Klassert's avatar
      net: Add fraglist GRO/GSO feature flags · 3b335832
      Steffen Klassert authored
      This adds new Fraglist GRO/GSO feature flags. They will be used
      to configure fraglist GRO/GSO what will be implemented with some
      followup paches.
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b335832
    • Sven Auhagen's avatar
      mvneta driver disallow XDP program on hardware buffer management · 79572c98
      Sven Auhagen authored
      Recently XDP Support was added to the mvneta driver
      for software buffer management only.
      It is still possible to attach an XDP program if
      hardware buffer management is used.
      It is not doing anything at that point.
      
      The patch disallows attaching XDP programs to mvneta
      if hardware buffer management is used.
      
      I am sorry about that. It is my first submission and I am having
      some troubles with the format of my emails.
      
      v4 -> v5:
      - Remove extra tabs
      
      v3 -> v4:
      - Please ignore v3 I accidentally submitted
        my other patch with git-send-mail and v4 is correct
      
      v2 -> v3:
      - My mailserver corrupted the patch
        resubmission with git-send-email
      
      v1 -> v2:
      - Fixing the patches indentation
      Signed-off-by: default avatarSven Auhagen <sven.auhagen@voleatech.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      79572c98
    • David Howells's avatar
      rxrpc: Fix use-after-free in rxrpc_receive_data() · 122d74fa
      David Howells authored
      The subpacket scanning loop in rxrpc_receive_data() references the
      subpacket count in the private data part of the sk_buff in the loop
      termination condition.  However, when the final subpacket is pasted into
      the ring buffer, the function is no longer has a ref on the sk_buff and
      should not be looking at sp->* any more.  This point is actually marked in
      the code when skb is cleared (but sp is not - which is an error).
      
      Fix this by caching sp->nr_subpackets in a local variable and using that
      instead.
      
      Also clear 'sp' to catch accesses after that point.
      
      This can show up as an oops in rxrpc_get_skb() if sp->nr_subpackets gets
      trashed by the sk_buff getting freed and reused in the meantime.
      
      Fixes: e2de6c40 ("rxrpc: Use info in skbuff instead of reparsing a jumbo packet")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      122d74fa
    • Eric Dumazet's avatar
      net_sched: ematch: reject invalid TCF_EM_SIMPLE · 55cd9f67
      Eric Dumazet authored
      It is possible for malicious userspace to set TCF_EM_SIMPLE bit
      even for matches that should not have this bit set.
      
      This can fool two places using tcf_em_is_simple()
      
      1) tcf_em_tree_destroy() -> memory leak of em->data
         if ops->destroy() is NULL
      
      2) tcf_em_tree_dump() wrongly report/leak 4 low-order bytes
         of a kernel pointer.
      
      BUG: memory leak
      unreferenced object 0xffff888121850a40 (size 32):
        comm "syz-executor927", pid 7193, jiffies 4294941655 (age 19.840s)
        hex dump (first 32 bytes):
          00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000f67036ea>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
          [<00000000f67036ea>] slab_post_alloc_hook mm/slab.h:586 [inline]
          [<00000000f67036ea>] slab_alloc mm/slab.c:3320 [inline]
          [<00000000f67036ea>] __do_kmalloc mm/slab.c:3654 [inline]
          [<00000000f67036ea>] __kmalloc_track_caller+0x165/0x300 mm/slab.c:3671
          [<00000000fab0cc8e>] kmemdup+0x27/0x60 mm/util.c:127
          [<00000000d9992e0a>] kmemdup include/linux/string.h:453 [inline]
          [<00000000d9992e0a>] em_nbyte_change+0x5b/0x90 net/sched/em_nbyte.c:32
          [<000000007e04f711>] tcf_em_validate net/sched/ematch.c:241 [inline]
          [<000000007e04f711>] tcf_em_tree_validate net/sched/ematch.c:359 [inline]
          [<000000007e04f711>] tcf_em_tree_validate+0x332/0x46f net/sched/ematch.c:300
          [<000000007a769204>] basic_set_parms net/sched/cls_basic.c:157 [inline]
          [<000000007a769204>] basic_change+0x1d7/0x5f0 net/sched/cls_basic.c:219
          [<00000000e57a5997>] tc_new_tfilter+0x566/0xf70 net/sched/cls_api.c:2104
          [<0000000074b68559>] rtnetlink_rcv_msg+0x3b2/0x4b0 net/core/rtnetlink.c:5415
          [<00000000b7fe53fb>] netlink_rcv_skb+0x61/0x170 net/netlink/af_netlink.c:2477
          [<00000000e83a40d0>] rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5442
          [<00000000d62ba933>] netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
          [<00000000d62ba933>] netlink_unicast+0x223/0x310 net/netlink/af_netlink.c:1328
          [<0000000088070f72>] netlink_sendmsg+0x2c0/0x570 net/netlink/af_netlink.c:1917
          [<00000000f70b15ea>] sock_sendmsg_nosec net/socket.c:639 [inline]
          [<00000000f70b15ea>] sock_sendmsg+0x54/0x70 net/socket.c:659
          [<00000000ef95a9be>] ____sys_sendmsg+0x2d0/0x300 net/socket.c:2330
          [<00000000b650f1ab>] ___sys_sendmsg+0x8a/0xd0 net/socket.c:2384
          [<0000000055bfa74a>] __sys_sendmsg+0x80/0xf0 net/socket.c:2417
          [<000000002abac183>] __do_sys_sendmsg net/socket.c:2426 [inline]
          [<000000002abac183>] __se_sys_sendmsg net/socket.c:2424 [inline]
          [<000000002abac183>] __x64_sys_sendmsg+0x23/0x30 net/socket.c:2424
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: syzbot+03c4738ed29d5d366ddf@syzkaller.appspotmail.com
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      55cd9f67
    • Stephen Worley's avatar
      net: include struct nhmsg size in nh nlmsg size · f9e95555
      Stephen Worley authored
      Include the size of struct nhmsg size when calculating
      how much of a payload to allocate in a new netlink nexthop
      notification message.
      
      Without this, we will fail to fill the skbuff at certain nexthop
      group sizes.
      
      You can reproduce the failure with the following iproute2 commands:
      
      ip link add dummy1 type dummy
      ip link add dummy2 type dummy
      ip link add dummy3 type dummy
      ip link add dummy4 type dummy
      ip link add dummy5 type dummy
      ip link add dummy6 type dummy
      ip link add dummy7 type dummy
      ip link add dummy8 type dummy
      ip link add dummy9 type dummy
      ip link add dummy10 type dummy
      ip link add dummy11 type dummy
      ip link add dummy12 type dummy
      ip link add dummy13 type dummy
      ip link add dummy14 type dummy
      ip link add dummy15 type dummy
      ip link add dummy16 type dummy
      ip link add dummy17 type dummy
      ip link add dummy18 type dummy
      ip link add dummy19 type dummy
      
      ip ro add 1.1.1.1/32 dev dummy1
      ip ro add 1.1.1.2/32 dev dummy2
      ip ro add 1.1.1.3/32 dev dummy3
      ip ro add 1.1.1.4/32 dev dummy4
      ip ro add 1.1.1.5/32 dev dummy5
      ip ro add 1.1.1.6/32 dev dummy6
      ip ro add 1.1.1.7/32 dev dummy7
      ip ro add 1.1.1.8/32 dev dummy8
      ip ro add 1.1.1.9/32 dev dummy9
      ip ro add 1.1.1.10/32 dev dummy10
      ip ro add 1.1.1.11/32 dev dummy11
      ip ro add 1.1.1.12/32 dev dummy12
      ip ro add 1.1.1.13/32 dev dummy13
      ip ro add 1.1.1.14/32 dev dummy14
      ip ro add 1.1.1.15/32 dev dummy15
      ip ro add 1.1.1.16/32 dev dummy16
      ip ro add 1.1.1.17/32 dev dummy17
      ip ro add 1.1.1.18/32 dev dummy18
      ip ro add 1.1.1.19/32 dev dummy19
      
      ip next add id 1 via 1.1.1.1 dev dummy1
      ip next add id 2 via 1.1.1.2 dev dummy2
      ip next add id 3 via 1.1.1.3 dev dummy3
      ip next add id 4 via 1.1.1.4 dev dummy4
      ip next add id 5 via 1.1.1.5 dev dummy5
      ip next add id 6 via 1.1.1.6 dev dummy6
      ip next add id 7 via 1.1.1.7 dev dummy7
      ip next add id 8 via 1.1.1.8 dev dummy8
      ip next add id 9 via 1.1.1.9 dev dummy9
      ip next add id 10 via 1.1.1.10 dev dummy10
      ip next add id 11 via 1.1.1.11 dev dummy11
      ip next add id 12 via 1.1.1.12 dev dummy12
      ip next add id 13 via 1.1.1.13 dev dummy13
      ip next add id 14 via 1.1.1.14 dev dummy14
      ip next add id 15 via 1.1.1.15 dev dummy15
      ip next add id 16 via 1.1.1.16 dev dummy16
      ip next add id 17 via 1.1.1.17 dev dummy17
      ip next add id 18 via 1.1.1.18 dev dummy18
      ip next add id 19 via 1.1.1.19 dev dummy19
      
      ip next add id 1111 group 1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19
      ip next del id 1111
      
      Fixes: 430a0491 ("nexthop: Add support for nexthop groups")
      Signed-off-by: default avatarStephen Worley <sworley@cumulusnetworks.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9e95555
    • Cong Wang's avatar
      net_sched: walk through all child classes in tc_bind_tclass() · 760d228e
      Cong Wang authored
      In a complex TC class hierarchy like this:
      
      tc qdisc add dev eth0 root handle 1:0 cbq bandwidth 100Mbit         \
        avpkt 1000 cell 8
      tc class add dev eth0 parent 1:0 classid 1:1 cbq bandwidth 100Mbit  \
        rate 6Mbit weight 0.6Mbit prio 8 allot 1514 cell 8 maxburst 20      \
        avpkt 1000 bounded
      
      tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip \
        sport 80 0xffff flowid 1:3
      tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip \
        sport 25 0xffff flowid 1:4
      
      tc class add dev eth0 parent 1:1 classid 1:3 cbq bandwidth 100Mbit  \
        rate 5Mbit weight 0.5Mbit prio 5 allot 1514 cell 8 maxburst 20      \
        avpkt 1000
      tc class add dev eth0 parent 1:1 classid 1:4 cbq bandwidth 100Mbit  \
        rate 3Mbit weight 0.3Mbit prio 5 allot 1514 cell 8 maxburst 20      \
        avpkt 1000
      
      where filters are installed on qdisc 1:0, so we can't merely
      search from class 1:1 when creating class 1:3 and class 1:4. We have
      to walk through all the child classes of the direct parent qdisc.
      Otherwise we would miss filters those need reverse binding.
      
      Fixes: 07d79fc7 ("net_sched: add reverse binding for tc class")
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      760d228e
    • Cong Wang's avatar
      net_sched: fix ops->bind_class() implementations · 2e24cd75
      Cong Wang authored
      The current implementations of ops->bind_class() are merely
      searching for classid and updating class in the struct tcf_result,
      without invoking either of cl_ops->bind_tcf() or
      cl_ops->unbind_tcf(). This breaks the design of them as qdisc's
      like cbq use them to count filters too. This is why syzbot triggered
      the warning in cbq_destroy_class().
      
      In order to fix this, we have to call cl_ops->bind_tcf() and
      cl_ops->unbind_tcf() like the filter binding path. This patch does
      so by refactoring out two helper functions __tcf_bind_filter()
      and __tcf_unbind_filter(), which are lockless and accept a Qdisc
      pointer, then teaching each implementation to call them correctly.
      
      Note, we merely pass the Qdisc pointer as an opaque pointer to
      each filter, they only need to pass it down to the helper
      functions without understanding it at all.
      
      Fixes: 07d79fc7 ("net_sched: add reverse binding for tc class")
      Reported-and-tested-by: syzbot+0a0596220218fcb603a8@syzkaller.appspotmail.com
      Reported-and-tested-by: syzbot+63bdb6006961d8c917c6@syzkaller.appspotmail.com
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2e24cd75
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · 16b25d1a
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter updates for net-next
      
      This batch contains Netfilter updates for net-next:
      
      1) Add nft_setelem_parse_key() helper function.
      
      2) Add NFTA_SET_ELEM_KEY_END to specify a range with one single element.
      
      3) Add NFTA_SET_DESC_CONCAT to describe the set element concatenation,
         from Stefano Brivio.
      
      4) Add bitmap_cut() to copy n-bits from source to destination,
         from Stefano Brivio.
      
      5) Add set to match on arbitrary concatenations, from Stefano Brivio.
      
      6) Add selftest for this new set type. An extract of Stefano's
         description follows:
      
      "Existing nftables set implementations allow matching entries with
      interval expressions (rbtree), e.g. 192.0.2.1-192.0.2.4, entries
      specifying field concatenation (hash, rhash), e.g. 192.0.2.1:22,
      but not both.
      
      In other words, none of the set types allows matching on range
      expressions for more than one packet field at a time, such as ipset
      does with types bitmap:ip,mac, and, to a more limited extent
      (netmasks, not arbitrary ranges), with types hash:net,net,
      hash:net,port, hash:ip,port,net, and hash:net,port,net.
      
      As a pure hash-based approach is unsuitable for matching on ranges,
      and "proxying" the existing red-black tree type looks impractical as
      elements would need to be shared and managed across all employed
      trees, this new set implementation intends to fill the functionality
      gap by employing a relatively novel approach.
      
      The fundamental idea, illustrated in deeper detail in patch 5/9, is to
      use lookup tables classifying a small number of grouped bits from each
      field, and map the lookup results in a way that yields a verdict for
      the full set of specified fields.
      
      The grouping bit aspect is loosely inspired by the Grouper algorithm,
      by Jay Ligatti, Josh Kuhn, and Chris Gage (see patch 5/9 for the full
      reference).
      
      A reference, stand-alone implementation of the algorithm itself is
      available at:
              https://pipapo.lameexcu.se
      
      Some notes about possible future optimisations are also mentioned
      there. This algorithm reduces the matching problem to, essentially,
      a repetitive sequence of simple bitwise operations, and is
      particularly suitable to be optimised by leveraging SIMD instruction
      sets."
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      16b25d1a
    • Stefano Brivio's avatar
      selftests: netfilter: Introduce tests for sets with range concatenation · 611973c1
      Stefano Brivio authored
      This test covers functionality and stability of the newly added
      nftables set implementation supporting concatenation of ranged
      fields.
      
      For some selected set expression types, test:
      - correctness, by checking that packets match or don't
      - concurrency, by attempting races between insertion, deletion, lookup
      - timeout feature, checking that packets don't match expired entries
      
      and (roughly) estimate matching rates, comparing to baselines for
      simple drop on netdev ingress hook and for hash and rbtrees sets.
      
      In order to send packets, this needs one of sendip, netcat or bash.
      To flood with traffic, iperf3, iperf and netperf are supported. For
      performance measurements, this relies on the sample pktgen script
      pktgen_bench_xmit_mode_netif_receive.sh.
      
      If none of the tools suitable for a given test are available, specific
      tests will be skipped.
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      611973c1
    • Stefano Brivio's avatar
      nf_tables: Add set type for arbitrary concatenation of ranges · 3c4287f6
      Stefano Brivio authored
      This new set type allows for intervals in concatenated fields,
      which are expressed in the usual way, that is, simple byte
      concatenation with padding to 32 bits for single fields, and
      given as ranges by specifying start and end elements containing,
      each, the full concatenation of start and end values for the
      single fields.
      
      Ranges are expanded to composing netmasks, for each field: these
      are inserted as rules in per-field lookup tables. Bits to be
      classified are divided in 4-bit groups, and for each group, the
      lookup table contains 4^2 buckets, representing all the possible
      values of a bit group. This approach was inspired by the Grouper
      algorithm:
      	http://www.cse.usf.edu/~ligatti/projects/grouper/
      
      Matching is performed by a sequence of AND operations between
      bucket values, with buckets selected according to the value of
      packet bits, for each group. The result of this sequence tells
      us which rules matched for a given field.
      
      In order to concatenate several ranged fields, per-field rules
      are mapped using mapping arrays, one per field, that specify
      which rules should be considered while matching the next field.
      The mapping array for the last field contains a reference to
      the element originally inserted.
      
      The notes in nft_set_pipapo.c cover the algorithm in deeper
      detail.
      
      A pure hash-based approach is of no use here, as ranges need
      to be classified. An implementation based on "proxying" the
      existing red-black tree set type, creating a tree for each
      field, was considered, but deemed impractical due to the fact
      that elements would need to be shared between trees, at least
      as long as we want to keep UAPI changes to a minimum.
      
      A stand-alone implementation of this algorithm is available at:
      	https://pipapo.lameexcu.se
      together with notes about possible future optimisations
      (in pipapo.c).
      
      This algorithm was designed with data locality in mind, and can
      be highly optimised for SIMD instruction sets, as the bulk of
      the matching work is done with repetitive, simple bitwise
      operations.
      
      At this point, without further optimisations, nft_concat_range.sh
      reports, for one AMD Epyc 7351 thread (2.9GHz, 512 KiB L1D$, 8 MiB
      L2$):
      
      TEST: performance
        net,port                                                      [ OK ]
          baseline (drop from netdev hook):              10190076pps
          baseline hash (non-ranged entries):             6179564pps
          baseline rbtree (match on first field only):    2950341pps
          set with  1000 full, ranged entries:            2304165pps
        port,net                                                      [ OK ]
          baseline (drop from netdev hook):              10143615pps
          baseline hash (non-ranged entries):             6135776pps
          baseline rbtree (match on first field only):    4311934pps
          set with   100 full, ranged entries:            4131471pps
        net6,port                                                     [ OK ]
          baseline (drop from netdev hook):               9730404pps
          baseline hash (non-ranged entries):             4809557pps
          baseline rbtree (match on first field only):    1501699pps
          set with  1000 full, ranged entries:            1092557pps
        port,proto                                                    [ OK ]
          baseline (drop from netdev hook):              10812426pps
          baseline hash (non-ranged entries):             6929353pps
          baseline rbtree (match on first field only):    3027105pps
          set with 30000 full, ranged entries:             284147pps
        net6,port,mac                                                 [ OK ]
          baseline (drop from netdev hook):               9660114pps
          baseline hash (non-ranged entries):             3778877pps
          baseline rbtree (match on first field only):    3179379pps
          set with    10 full, ranged entries:            2082880pps
        net6,port,mac,proto                                           [ OK ]
          baseline (drop from netdev hook):               9718324pps
          baseline hash (non-ranged entries):             3799021pps
          baseline rbtree (match on first field only):    1506689pps
          set with  1000 full, ranged entries:             783810pps
        net,mac                                                       [ OK ]
          baseline (drop from netdev hook):              10190029pps
          baseline hash (non-ranged entries):             5172218pps
          baseline rbtree (match on first field only):    2946863pps
          set with  1000 full, ranged entries:            1279122pps
      
      v4:
       - fix build for 32-bit architectures: 64-bit division needs
         div_u64() (kbuild test robot <lkp@intel.com>)
      v3:
       - rework interface for field length specification,
         NFT_SET_SUBKEY disappears and information is stored in
         description
       - remove scratch area to store closing element of ranges,
         as elements now come with an actual attribute to specify
         the upper range limit (Pablo Neira Ayuso)
       - also remove pointer to 'start' element from mapping table,
         closing key is now accessible via extension data
       - use bytes right away instead of bits for field lengths,
         this way we can also double the inner loop of the lookup
         function to take care of upper and lower bits in a single
         iteration (minor performance improvement)
       - make it clearer that set operations are actually atomic
         API-wise, but we can't e.g. implement flush() as one-shot
         action
       - fix type for 'dup' in nft_pipapo_insert(), check for
         duplicates only in the next generation, and in general take
         care of differentiating generation mask cases depending on
         the operation (Pablo Neira Ayuso)
       - report C implementation matching rate in commit message, so
         that AVX2 implementation can be compared (Pablo Neira Ayuso)
      v2:
       - protect access to scratch maps in nft_pipapo_lookup() with
         local_bh_disable/enable() (Florian Westphal)
       - drop rcu_read_lock/unlock() from nft_pipapo_lookup(), it's
         already implied (Florian Westphal)
       - explain why partial allocation failures don't need handling
         in pipapo_realloc_scratch(), rename 'm' to clone and update
         related kerneldoc to make it clear we're not operating on
         the live copy (Florian Westphal)
       - add expicit check for priv->start_elem in
         nft_pipapo_insert() to avoid ending up in nft_pipapo_walk()
         with a NULL start element, and also zero it out in every
         operation that might make it invalid, so that insertion
         doesn't proceed with an invalid element (Florian Westphal)
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      3c4287f6
    • Stefano Brivio's avatar
      bitmap: Introduce bitmap_cut(): cut bits and shift remaining · 20927671
      Stefano Brivio authored
      The new bitmap function bitmap_cut() copies bits from source to
      destination by removing the region specified by parameters first
      and cut, and remapping the bits above the cut region by right
      shifting them.
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      20927671
    • Stefano Brivio's avatar
      netfilter: nf_tables: Support for sets with multiple ranged fields · f3a2181e
      Stefano Brivio authored
      Introduce a new nested netlink attribute, NFTA_SET_DESC_CONCAT, used
      to specify the length of each field in a set concatenation.
      
      This allows set implementations to support concatenation of multiple
      ranged items, as they can divide the input key into matching data for
      every single field. Such set implementations would be selected as
      they specify support for NFT_SET_INTERVAL and allow desc->field_count
      to be greater than one. Explicitly disallow this for nft_set_rbtree.
      
      In order to specify the interval for a set entry, userspace would
      include in NFTA_SET_DESC_CONCAT attributes field lengths, and pass
      range endpoints as two separate keys, represented by attributes
      NFTA_SET_ELEM_KEY and NFTA_SET_ELEM_KEY_END.
      
      While at it, export the number of 32-bit registers available for
      packet matching, as nftables will need this to know the maximum
      number of field lengths that can be specified.
      
      For example, "packets with an IPv4 address between 192.0.2.0 and
      192.0.2.42, with destination port between 22 and 25", can be
      expressed as two concatenated elements:
      
        NFTA_SET_ELEM_KEY:            192.0.2.0 . 22
        NFTA_SET_ELEM_KEY_END:        192.0.2.42 . 25
      
      and NFTA_SET_DESC_CONCAT attribute would contain:
      
        NFTA_LIST_ELEM
          NFTA_SET_FIELD_LEN:		4
        NFTA_LIST_ELEM
          NFTA_SET_FIELD_LEN:		2
      
      v4: No changes
      v3: Complete rework, NFTA_SET_DESC_CONCAT instead of NFTA_SET_SUBKEY
      v2: No changes
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      f3a2181e
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: add NFTA_SET_ELEM_KEY_END attribute · 7b225d0b
      Pablo Neira Ayuso authored
      Add NFTA_SET_ELEM_KEY_END attribute to convey the closing element of the
      interval between kernel and userspace.
      
      This patch also adds the NFT_SET_EXT_KEY_END extension to store the
      closing element value in this interval.
      
      v4: No changes
      v3: New patch
      
      [sbrivio: refactor error paths and labels; add corresponding
        nft_set_ext_type for new key; rebase]
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      7b225d0b
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: add nft_setelem_parse_key() · 20a1452c
      Pablo Neira Ayuso authored
      Add helper function to parse the set element key netlink attribute.
      
      v4: No changes
      v3: New patch
      
      [sbrivio: refactor error paths and labels; use NFT_DATA_VALUE_MAXLEN
        instead of sizeof(*key) in helper, value can be longer than that;
        rebase]
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      20a1452c
  2. 26 Jan, 2020 15 commits
  3. 25 Jan, 2020 7 commits
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm · 2821e26f
      Linus Torvalds authored
      Pull ARM fixes from Russell King:
      
       - fix ftrace relocation type filtering
      
       - relax arch timer version check
      
      * tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
        ARM: 8955/1: virt: Relax arch timer version check during early boot
        ARM: 8950/1: ftrace/recordmcount: filter relocation types
      2821e26f
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 84809aaf
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Off by one in mt76 airtime calculation, from Dan Carpenter.
      
       2) Fix TLV fragment allocation loop condition in iwlwifi, from Luca
          Coelho.
      
       3) Don't confirm neigh entries when doing ipsec pmtu updates, from Xu
          Wang.
      
       4) More checks to make sure we only send TSO packets to lan78xx chips
          that they can actually handle. From James Hughes.
      
       5) Fix ip_tunnel namespace move, from William Dauchy.
      
       6) Fix unintended packet reordering due to cooperation between
          listification done by GRO and non-GRO paths. From Maxim
          Mikityanskiy.
      
       7) Add Jakub Kicincki formally as networking co-maintainer.
      
       8) Info leak in airo ioctls, from Michael Ellerman.
      
       9) IFLA_MTU attribute needs validation during rtnl_create_link(), from
          Eric Dumazet.
      
      10) Use after free during reload in mlxsw, from Ido Schimmel.
      
      11) Dangling pointers are possible in tp->highest_sack, fix from Eric
          Dumazet.
      
      12) Missing *pos++ in various networking seq_next handlers, from Vasily
          Averin.
      
      13) CHELSIO_GET_MEM operation neds CAP_NET_ADMIN check, from Michael
          Ellerman.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (109 commits)
        firestream: fix memory leaks
        net: cxgb3_main: Add CAP_NET_ADMIN check to CHELSIO_GET_MEM
        net: bcmgenet: Use netif_tx_napi_add() for TX NAPI
        tipc: change maintainer email address
        net: stmmac: platform: fix probe for ACPI devices
        net/mlx5e: kTLS, Do not send decrypted-marked SKBs via non-accel path
        net/mlx5e: kTLS, Remove redundant posts in TX resync flow
        net/mlx5e: kTLS, Fix corner-case checks in TX resync flow
        net/mlx5e: Clear VF config when switching modes
        net/mlx5: DR, use non preemptible call to get the current cpu number
        net/mlx5: E-Switch, Prevent ingress rate configuration of uplink rep
        net/mlx5: DR, Enable counter on non-fwd-dest objects
        net/mlx5: Update the list of the PCI supported devices
        net/mlx5: Fix lowest FDB pool size
        net: Fix skb->csum update in inet_proto_csum_replace16().
        netfilter: nf_tables: autoload modules from the abort path
        netfilter: nf_tables: add __nft_chain_type_get()
        netfilter: nf_tables_offload: fix check the chain offload flag
        netfilter: conntrack: sctp: use distinct states for new SCTP connections
        ipv6_route_seq_next should increase position index
        ...
      84809aaf
    • Linus Torvalds's avatar
      Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc · f041eada
      Linus Torvalds authored
      Pull ARM SoC fixes from Olof Johansson:
       "A couple of fixes have come in that would be good to include in this
        release:
      
         - A fix for amount of memory on Beaglebone Black. Surfaced now since
           GRUB2 doesn't update memory size in the booted kernel.
      
         - A fix to make SPI interfaces work on am43x-epos-evm.
      
         - Small Kconfig fix for OPTEE (adds a depend on MMU) to avoid build
           failures"
      
      * tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
        ARM: dts: am43x-epos-evm: set data pin directions for spi0 and spi1
        tee: optee: Fix compilation issue with nommu
        ARM: dts: am335x-boneblack-common: fix memory size
      f041eada
    • Wenwen Wang's avatar
      firestream: fix memory leaks · fa865ba1
      Wenwen Wang authored
      In fs_open(), 'vcc' is allocated through kmalloc() and assigned to
      'atm_vcc->dev_data.' In the following execution, if an error occurs, e.g.,
      there is no more free channel, an error code EBUSY or ENOMEM will be
      returned. However, 'vcc' is not deallocated, leading to memory leaks. Note
      that, in normal cases where fs_open() returns 0, 'vcc' will be deallocated
      in fs_close(). But, if fs_open() fails, there is no guarantee that
      fs_close() will be invoked.
      
      To fix this issue, deallocate 'vcc' before the error code is returned.
      Signed-off-by: default avatarWenwen Wang <wenwen@cs.uga.edu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fa865ba1
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · 6badad1c
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for net:
      
      1) Missing netlink attribute sanity check for NFTA_OSF_DREG,
         from Florian Westphal.
      
      2) Use bitmap infrastructure in ipset to fix KASAN slab-out-of-bounds
         reads, from Jozsef Kadlecsik.
      
      3) Missing initial CLOSED state in new sctp connection through
         ctnetlink events, from Jiri Wiesner.
      
      4) Missing check for NFT_CHAIN_HW_OFFLOAD in nf_tables offload
         indirect block infrastructure, from wenxu.
      
      5) Add __nft_chain_type_get() to sanity check family and chain type.
      
      6) Autoload modules from the nf_tables abort path to fix races
         reported by syzbot.
      
      7) Remove unnecessary skb->csum update on inet_proto_csum_replace16(),
         from Praveen Chaudhary.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6badad1c
    • Linus Torvalds's avatar
      Merge tag 'for-5.5-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · a075f23d
      Linus Torvalds authored
      Pull btrfs fix from David Sterba:
       "Here's a last minute fix for a regression introduced in this
        development cycle.
      
        There's a small chance of a silent corruption when device replace and
        NOCOW data writes happen at the same time in one block group. Metadata
        or COW data writes are unaffected.
      
        The extra fixup patch is there to silence an unnecessary warning"
      
      * tag 'for-5.5-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: dev-replace: remove warning for unknown return codes when finished
        btrfs: scrub: Require mandatory block group RO for dev-replace
      a075f23d
    • Linus Torvalds's avatar
      Merge tag 'pinctrl-v5.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · 93d1a05e
      Linus Torvalds authored
      Pull pin control fix from Linus Walleij:
       "A single fix for the Intel Sunrisepoint pin controller that makes the
        interrupts work properly on it"
      
      * tag 'pinctrl-v5.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
        pinctrl: sunrisepoint: Add missing Interrupt Status register offset
      93d1a05e