1. 16 Jul, 2018 26 commits
    • Taehee Yoo's avatar
      netfilter: nft_reject_bridge: remove unnecessary ttl set · 6542df2f
      Taehee Yoo authored
      In the nft_reject_br_send_v4_tcp_reset(), a ttl is set by the
      nf_reject_iphdr_put(). so, below code is unnecessary.
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      6542df2f
    • David S. Miller's avatar
      Merge branch 'TLS-offload-rx-netdev-and-mlx5' · aea06eb2
      David S. Miller authored
      Boris Pismenny says:
      
      ====================
      TLS offload rx, netdev & mlx5
      
      The following series provides TLS RX inline crypto offload.
      
      v5->v4:
          - Remove the Kconfig to mutually exclude both IPsec and TLS
      
      v4->v3:
          - Remove the iov revert for zero copy send flow
      
      v2->v3:
          - Fix typo
          - Adjust cover letter
          - Fix bug in zero copy flows
          - Use network byte order for the record number in resync
          - Adjust the sequence provided in resync
      
      v1->v2:
          - Fix bisectability problems due to variable name changes
          - Fix potential uninitialized return value
      
      This series completes the generic infrastructure to offload TLS crypto to
      a network devices. It enables the kernel TLS socket to skip decryption and
      authentication operations for SKBs marked as decrypted on the receive
      side of the data path. Leaving those computationally expensive operations
      to the NIC.
      
      This infrastructure doesn't require a TCP offload engine. Instead, the
      NIC decrypts a packet's payload if the packet contains the expected TCP
      sequence number. The TLS record authentication tag remains unmodified
      regardless of decryption. If the packet is decrypted successfully and it
      contains an authentication tag, then the authentication check has passed.
      Otherwise, if the authentication fails, then the packet is provided
      unmodified and the KTLS layer is responsible for handling it.
      Out-Of-Order TCP packets are provided unmodified. As a result,
      in the slow path some of the SKBs are decrypted while others remain as
      ciphertext.
      
      The GRO and TCP layers must not coalesce decrypted and non-decrypted SKBs.
      At the worst case a received TLS record consists of both plaintext
      and ciphertext packets. These partially decrypted records must be
      reencrypted, only to be decrypted.
      
      The notable differences between SW KTLS and NIC offloaded TLS
      implementations are as follows:
      1. Partial decryption - Software must handle the case of a TLS record
      that was only partially decrypted by HW. This can happen due to packet
      reordering.
      2. Resynchronization - tls_read_size calls the device driver to
      resynchronize HW whenever it lost track of the TLS record framing in
      the TCP stream.
      
      The infrastructure should be extendable to support various NIC offload
      implementations.  However it is currently written with the
      implementation below in mind:
      The NIC identifies packets that should be offloaded according to
      the 5-tuple and the TCP sequence number. If these match and the
      packet is decrypted and authenticated successfully, then a syndrome
      is provided to software. Otherwise, the packet is unmodified.
      Decrypted and non-decrypted packets aren't coalesced by the network stack,
      and the KTLS layer decrypts and authenticates partially decrypted records.
      The NIC provides an indication whenever a resync is required. The resync
      operation is triggered by the KTLS layer while parsing TLS record headers.
      
      Finally, we measure the performance obtained by running single stream
      iperf with two Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz machines connected
      back-to-back with Innova TLS (40Gbps) NICs. We compare TCP (upper bound)
      and KTLS-Offload running both in Tx and Rx. The results show that the
      performance of offload is comparable to TCP.
      
                                | Bandwidth (Gbps) | CPU Tx (%) | CPU rx (%)
      TCP                       | 28.8             | 5          | 12
      KTLS-Offload-Tx-Rx 	  | 28.6	     | 7          | 14
      
      Paper: https://netdevconf.org/2.2/papers/pismenny-tlscrypto-talk.pdf
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aea06eb2
    • Boris Pismenny's avatar
      net/mlx5e: IPsec, fix byte count in CQE · b3ccf978
      Boris Pismenny authored
      This patch fixes the byte count indication in CQE for processed IPsec
      packets that contain a metadata header.
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b3ccf978
    • Boris Pismenny's avatar
      net/mlx5: Accel, add common metadata functions · 10e71acc
      Boris Pismenny authored
      This patch adds common functions to handle mellanox metadata headers.
      These functions are used by IPsec and TLS to process FPGA metadata.
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      10e71acc
    • Boris Pismenny's avatar
      net/mlx5e: TLS, build TLS netdev from capabilities · 790af90c
      Boris Pismenny authored
      This patch enables TLS Rx based on available HW capabilities.
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      790af90c
    • Boris Pismenny's avatar
      net/mlx5e: TLS, add software statistics · afd3baaa
      Boris Pismenny authored
      This patch adds software statistics for TLS to count important
      events.
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      afd3baaa
    • Boris Pismenny's avatar
      net/mlx5e: TLS, add Innova TLS rx data path · 00aebab2
      Boris Pismenny authored
      Implement the TLS rx offload data path according to the
      requirements of the TLS generic NIC offload infrastructure.
      
      Special metadata ethertype is used to pass information to
      the hardware.
      
      When hardware loses synchronization a special resync request
      metadata message is used to request resync.
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      00aebab2
    • Boris Pismenny's avatar
      net/mlx5e: TLS, add innova rx support · ca942c78
      Boris Pismenny authored
      Add the mlx5 implementation of the TLS Rx routines to add/del TLS
      contexts, also add the tls_dev_resync_rx routine
      to work with the TLS inline Rx crypto offload infrastructure.
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca942c78
    • Boris Pismenny's avatar
      net/mlx5: Accel, add TLS rx offload routines · ab412e1d
      Boris Pismenny authored
      In Innova TLS, TLS contexts are added or deleted
      via a command message over the SBU connection.
      The HW then sends a response message over the same connection.
      
      Complete the implementation for Innova TLS (FPGA-based) hardware by
      adding support for rx inline crypto offload.
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab412e1d
    • Boris Pismenny's avatar
      net/mlx5e: TLS, refactor variable names · 0aadb2fc
      Boris Pismenny authored
      For symmetry, we rename mlx5e_tls_offload_context to
      mlx5e_tls_offload_context_tx before we add mlx5e_tls_offload_context_rx.
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Reviewed-by: default avatarAviad Yehezkel <aviadye@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0aadb2fc
    • Boris Pismenny's avatar
      tls: Fix zerocopy_from_iter iov handling · 47187998
      Boris Pismenny authored
      zerocopy_from_iter iterates over the message, but it doesn't revert the
      updates made by the iov iteration. This patch fixes it. Now, the iov can
      be used after calling zerocopy_from_iter.
      
      Fixes: 3c4d7559 ("tls: kernel TLS support")
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      47187998
    • Boris Pismenny's avatar
      tls: Add rx inline crypto offload · 4799ac81
      Boris Pismenny authored
      This patch completes the generic infrastructure to offload TLS crypto to a
      network device. It enables the kernel to skip decryption and
      authentication of some skbs marked as decrypted by the NIC. In the fast
      path, all packets received are decrypted by the NIC and the performance
      is comparable to plain TCP.
      
      This infrastructure doesn't require a TCP offload engine. Instead, the
      NIC only decrypts packets that contain the expected TCP sequence number.
      Out-Of-Order TCP packets are provided unmodified. As a result, at the
      worst case a received TLS record consists of both plaintext and ciphertext
      packets. These partially decrypted records must be reencrypted,
      only to be decrypted.
      
      The notable differences between SW KTLS Rx and this offload are as
      follows:
      1. Partial decryption - Software must handle the case of a TLS record
      that was only partially decrypted by HW. This can happen due to packet
      reordering.
      2. Resynchronization - tls_read_size calls the device driver to
      resynchronize HW after HW lost track of TLS record framing in
      the TCP stream.
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4799ac81
    • Boris Pismenny's avatar
      tls: Fill software context without allocation · b190a587
      Boris Pismenny authored
      This patch allows tls_set_sw_offload to fill the context in case it was
      already allocated previously.
      
      We will use it in TLS_DEVICE to fill the RX software context.
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b190a587
    • Boris Pismenny's avatar
      tls: Split tls_sw_release_resources_rx · 39f56e1a
      Boris Pismenny authored
      This patch splits tls_sw_release_resources_rx into two functions one
      which releases all inner software tls structures and another that also
      frees the containing structure.
      
      In TLS_DEVICE we will need to release the software structures without
      freeeing the containing structure, which contains other information.
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      39f56e1a
    • Boris Pismenny's avatar
      tls: Split decrypt_skb to two functions · dafb67f3
      Boris Pismenny authored
      Previously, decrypt_skb also updated the TLS context.
      Now, decrypt_skb only decrypts the payload using the current context,
      while decrypt_skb_update also updates the state.
      
      Later, in the tls_device Rx flow, we will use decrypt_skb directly.
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dafb67f3
    • Boris Pismenny's avatar
      tls: Refactor tls_offload variable names · d80a1b9d
      Boris Pismenny authored
      For symmetry, we rename tls_offload_context to
      tls_offload_context_tx before we add tls_offload_context_rx.
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d80a1b9d
    • Boris Pismenny's avatar
      tcp: Don't coalesce decrypted and encrypted SKBs · 41ed9c04
      Boris Pismenny authored
      Prevent coalescing of decrypted and encrypted SKBs in GRO
      and TCP layer.
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      41ed9c04
    • Boris Pismenny's avatar
      net: Add TLS rx resync NDO · 16e4edc2
      Boris Pismenny authored
      Add new netdev tls op for resynchronizing HW tls context
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      16e4edc2
    • Ilya Lesokhin's avatar
      net: Add TLS RX offload feature · 14136564
      Ilya Lesokhin authored
      This patch adds a netdev feature to configure TLS RX inline crypto offload.
      Signed-off-by: default avatarIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      14136564
    • Boris Pismenny's avatar
      net: Add decrypted field to skb · 784abe24
      Boris Pismenny authored
      The decrypted bit is propogated to cloned/copied skbs.
      This will be used later by the inline crypto receive side offload
      of tls.
      Signed-off-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      784abe24
    • David S. Miller's avatar
      Merge branch 'mvpp2-add-debugfs-interface' · cc98419a
      David S. Miller authored
      Maxime Chevallier says:
      
      ====================
      net: mvpp2: add debugfs interface
      
      The PPv2 Header Parser and Classifier are not straightforward to debug,
      having easy access to some of the many lookup tables configuration is
      helpful during development and debug.
      
      This series adds a basic debugfs interface, allowing to read data from
      the Header Parser and some of the Classifier tables.
      
      For now, the interface is read-only, and contains only some basic info.
      
      This was actually used during RSS development, and might be useful to
      troubleshoot some issues we might find.
      
      The first patch of the series converts the mvpp2 files to SPDX, which
      eases adding the new debugfs dedicated file.
      
      The second patch adds the interface, and exposes basic Header Parser data.
      
      The 3rd patch adds a hit counter for the Header Parser TCAM.
      
      The 4th patch exposes classifier info.
      
      The 5th patch adds some hit counters for some of the classifier engines.
      
      Changes since V1:
      - Rebased on the lastest net-next
      - Made cls_flow_get non static so that it can be used in mvpp2_debugfs
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cc98419a
    • Maxime Chevallier's avatar
      net: mvpp2: debugfs: add classifier hit counters · f9d30d5b
      Maxime Chevallier authored
      The classification operations that are used for RSS make use of several
      lookup tables. Having hit counters for these tables is really helpful
      to determine what flows were matched by ingress traffic, and see the
      path of packets among all the classifier tables.
      
      This commit adds hit counters for the 3 tables used at the moment :
      
       - The decoding table (also called lookup_id table), that links flows
         identified by the Header Parser to the flow table.
      
         There's one entry per flow, located at :
         .../mvpp2/<controller>/flows/XX/dec_hits
      
         Note that there are 21 flows in the decoding table, whereas there are
         52 flows in the Header Parser. That's because there are several kind
         of traffic that will match a given flow. Reading the hit counter from
         one sub-flow will clear all hit counter that have the same flow_id.
      
         This also applies to the flow_hits.
      
       - The flow table, that contains all the different lookups to be
         performed by the classifier for each packet of a given flow. The match
         is done on the first entry of the flow sequence.
      
       - The C2 engine entries, that are used to assign the default rx queue,
         and enable or disable RSS for a given port.
      
         There's one entry per flow, located at:
         .../mvpp2/<controller>/flows/XX/flow_hits
      
         There is one C2 entry per port, so the c2 hit counter is located at :
         .../mvpp2/<controller>/ethX/c2_hits
      
      All hit counter values are 16-bits clear-on-read values.
      Signed-off-by: default avatarMaxime Chevallier <maxime.chevallier@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9d30d5b
    • Maxime Chevallier's avatar
      net: mvpp2: debugfs: add entries for classifier flows · dba1d918
      Maxime Chevallier authored
      The classifier configuration for RSS is quite complex, with several
      lookup tables being used. This commit adds useful info in debugfs to
      see how the different tables are configured :
      
      Added 2 new entries in the per-port directory :
      
        - .../eth0/default_rxq : The default rx queue on that port
        - .../eth0/rss_enable : Indicates if RSS is enabled in the C2 entry
      
      Added the 'flows' directory :
      
        It contains one entry per sub-flow. a 'sub-flow' is a unique path from
        Header Parser to the flow table. Multiple sub-flows can point to the
        same 'flow' (each flow has an id from 8 to 29, which is its index in the
        Lookup Id table) :
      
        - .../flows/00/...
                   /01/...
                   ...
                   /51/id : The flow id. There are 21 unique flows. There's one
                             flow per combination of the following parameters :
                             - L4 protocol (TCP, UDP, none)
                             - L3 protocol (IPv4, IPv6)
                             - L3 parameters (Fragmented or not)
                             - L2 parameters (Vlan tag presence or not)
                    .../type : The flow type. This is an even higher level flow,
                               that we manipulate with ethtool. It can be :
                               "udp4" "tcp4" "udp6" "tcp6" "ipv4" "ipv6" "other".
                    .../eth0/...
                    .../eth1/engine : The hash generation engine used for this
      	                        flow on the given port
                        .../hash_opts : The hash generation options indicating on
                                        what data we base the hash (vlan tag, src
                                        IP, src port, etc.)
      Signed-off-by: default avatarMaxime Chevallier <maxime.chevallier@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dba1d918
    • Maxime Chevallier's avatar
      net: mvpp2: debugfs: add hit counter stats for Header Parser entries · 1203341c
      Maxime Chevallier authored
      One helpful feature to help debug the Header Parser TCAM filter in PPv2
      is to be able to see if the entries did match something when a packet
      comes in. This can be done by using the built-in hit counter for TCAM
      entries.
      
      This commit implements reading the counter, and exposing its value on
      debugfs for each filter entry.
      
      The counter is a 16-bits clear-on-read value, located at:
       .../mvpp2/<controller>/parser/XXX/hits
      Signed-off-by: default avatarMaxime Chevallier <maxime.chevallier@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1203341c
    • Maxime Chevallier's avatar
      net: mvpp2: add a debugfs interface for the Header Parser · 21da57a2
      Maxime Chevallier authored
      Marvell PPv2 Packer Header Parser has a TCAM based filter, that is not
      trivial to configure and debug. Being able to dump TCAM entries from
      userspace can be really helpful to help development of new features
      and debug existing ones.
      
      This commit adds a basic debugfs interface for the PPv2 driver, focusing
      on TCAM related features.
      
      <mnt>/mvpp2/ --- f2000000.ethernet
                    \- f4000000.ethernet --- parser --- 000 ...
                                          |          \- 001
                                          |          \- ...
                                          |          \- 255 --- ai
                                          |                  \- header_data
                                          |                  \- lookup_id
                                          |                  \- sram
                                          |                  \- valid
                                          \- eth1 ...
                                          \- eth2 --- mac_filter
                                                   \- parser_entries
                                                   \- vid_filter
      
      There's one directory per PPv2 instance, named after pdev->name to make
      sure names are uniques. In each of these directories, there's :
      
       - one directory per interface on the controller, each containing :
      
         - "mac_filter", which lists all filtered addresses for this port
           (based on TCAM, not on the kernel's uc / mc lists)
      
         - "parser_entries", which lists the indices of all valid TCAM
            entries that have this port in their port map
      
         - "vid_filter", which lists the vids allowed on this port, based on
           TCAM
      
       - one "parser" directory (the parser is common to all ports), containing :
      
         - one directory per TCAM entry (256 of them, from 0 to 255), each
           containing :
      
           - "ai" : Contains the 1 byte Additional Info field from TCAM, and
      
           - "header_data" : Contains the 8 bytes Header Data extracted from
             the packet
      
           - "lookup_id" : Contains the 4 bits LU_ID
      
           - "sram" : contains the raw SRAM data, which is the result of the TCAM
      		lookup. This readonly at the moment.
      
           - "valid" : Indicates if the entry is valid of not.
      
      All entries are read-only, and everything is output in hex form.
      Signed-off-by: default avatarMaxime Chevallier <maxime.chevallier@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      21da57a2
    • Antoine Tenart's avatar
      net: mvpp2: switch to SPDX identifiers · f1e37e31
      Antoine Tenart authored
      Use the appropriate SPDX license identifiers and drop the license text.
      This patch is only cosmetic.
      Signed-off-by: default avatarAntoine Tenart <antoine.tenart@bootlin.com>
      Signed-off-by: default avatarMaxime Chevallier <maxime.chevallier@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f1e37e31
  2. 15 Jul, 2018 1 commit
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 2aa4a337
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf-next 2018-07-15
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      The main changes are:
      
      1) Various different arm32 JIT improvements in order to optimize code emission
         and make the JIT code itself more robust, from Russell.
      
      2) Support simultaneous driver and offloaded XDP in order to allow for advanced
         use-cases where some work is offloaded to the NIC and some to the host. Also
         add ability for bpftool to load programs and maps beyond just the cgroup case,
         from Jakub.
      
      3) Add BPF JIT support in nfp for multiplication as well as division. For the
         latter in particular, it uses the reciprocal algorithm to emulate it, from Jiong.
      
      4) Add BTF pretty print functionality to bpftool in plain and JSON output
         format, from Okash.
      
      5) Add build and installation to the BPF helper man page into bpftool, from Quentin.
      
      6) Add a TCP BPF callback for listening sockets which is triggered right after
         the socket transitions to TCP_LISTEN state, from Andrey.
      
      7) Add a new cgroup tree command to bpftool which iterates over the whole cgroup
         tree and prints all attached programs, from Roman.
      
      8) Improve xdp_redirect_cpu sample to support parsing of double VLAN tagged
         packets, from Jesper.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2aa4a337
  3. 14 Jul, 2018 13 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-tcp-listen-cb' · 13f7432b
      Daniel Borkmann authored
      Andrey Ignatov says:
      
      ====================
      This patchset adds TCP-BPF callback for listening sockets.
      
      Patch 0001 provides more details and is the main patch in the set.
      
      Patch 0006 adds selftest for the new callback.
      
      Other patches are bug fixes and improvements in TCP-BPF selftest
      to make it easier to extend in 0006.
      ====================
      Acked-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      13f7432b
    • Andrey Ignatov's avatar
      selftests/bpf: Test case for BPF_SOCK_OPS_TCP_LISTEN_CB · 78d8e26d
      Andrey Ignatov authored
      Cover new TCP-BPF callback in test_tcpbpf: when listen() is called on
      socket, set BPF_SOCK_OPS_STATE_CB_FLAG so that BPF_SOCK_OPS_STATE_CB
      callback can be called on future state transition, and when such a
      transition happens (TCP_LISTEN -> TCP_CLOSE), track it in the map and
      verify it in user space later.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      78d8e26d
    • Andrey Ignatov's avatar
      selftests/bpf: Better verification in test_tcpbpf · 2044e4ef
      Andrey Ignatov authored
      Reduce amount of copy/paste for debug info when result is verified in
      the test and keep that info together with values being checked so that
      they won't get out of sync.
      
      It also improves debug experience: instead of checking manually what
      doesn't match in debug output for all fields, only unexpected field is
      printed.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      2044e4ef
    • Andrey Ignatov's avatar
      selftests/bpf: Switch test_tcpbpf_user to cgroup_helpers · c65267e5
      Andrey Ignatov authored
      Switch to cgroup_helpers to simplify the code and fix cgroup cleanup:
      before cgroup was not cleaned up after the test.
      
      It also removes SYSTEM macro, that only printed error, but didn't
      terminate the test.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c65267e5
    • Andrey Ignatov's avatar
      selftests/bpf: Fix const'ness in cgroup_helpers · 04c13411
      Andrey Ignatov authored
      Lack of const in cgroup helpers signatures forces to write ugly client
      code. Fix it.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      04c13411
    • Andrey Ignatov's avatar
      bpf: Sync bpf.h to tools/ · 060a7fcc
      Andrey Ignatov authored
      Sync BPF_SOCK_OPS_TCP_LISTEN_CB related UAPI changes to tools/.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      060a7fcc
    • Andrey Ignatov's avatar
      bpf: Add BPF_SOCK_OPS_TCP_LISTEN_CB · f333ee0c
      Andrey Ignatov authored
      Add new TCP-BPF callback that is called on listen(2) right after socket
      transition to TCP_LISTEN state.
      
      It fills the gap for listening sockets in TCP-BPF. For example BPF
      program can set BPF_SOCK_OPS_STATE_CB_FLAG when socket becomes listening
      and track later transition from TCP_LISTEN to TCP_CLOSE with
      BPF_SOCK_OPS_STATE_CB callback.
      
      Before there was no way to do it with TCP-BPF and other options were
      much harder to work with. E.g. socket state tracking can be done with
      tracepoints (either raw or regular) but they can't be attached to cgroup
      and their lifetime has to be managed separately.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      f333ee0c
    • David S. Miller's avatar
      Merge branch 'mlxsw-VRRP' · f5c64e56
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      mlxsw: Add VRRP support
      
      When a router that is acting as the default gateway of a host stops
      functioning, the host will encounter packet loss until the router starts
      functioning again.
      
      To increase the reliability of the default gateway without performing
      reconfiguration on the host, a host can use a Virtual Router Redundancy
      Protocol (VRRP) Router. This virtual router is composed from several
      routers where only one is actually forwarding packets from the host (the
      master router) while the other routers act as backup routers. The
      election of the master router is determined by the VRRP protocol [1].
      
      Packets addressed to the virtual router are always sent to the virtual
      router MAC address (IPv4: 00-00-5E-00-01-XX, IPv6: 00-00-5E-00-02-XX).
      Such packets can only be accepted by the master router and must be
      discarded by the backup routers.
      
      In Linux, VRRP is usually implemented by configuring a macvlan with the
      virtual router MAC on top of the router interface that is connected to
      the host / LAN. The macvlan on the master router is assigned the virtual
      IP (VIP) that the host uses as its gateway.
      
      In order to support VRRP in mlxsw, we first need to enable macvlan upper
      devices on top of mlxsw netdevs and their uppers. This is done by the
      first patch, which also takes care of sanitizing macvlan configurations
      that are not currently supported by the driver.
      
      The second patch directs packets with destination MAC addresses as the
      macvlans to the router so that they will undergo an L3 lookup. This is
      consistent with the kernel's behavior where the macvlan's Rx handler
      will re-inject such packets to the Rx path so that they will be picked
      up by the IPvX protocol handlers and undergo an L3 lookup. Note that the
      driver prevents the macvlans from being enslaved to other devices, to
      ensure the packets will be picked up by the protocol handler and not by
      another Rx handler.
      
      The third patch adds packet traps for VRRP control packets for both IPv4
      and IPv6. Finally, the last patch optimizes the reception of VRRP MACs
      by potentially skipping one L2 lookup for them.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f5c64e56
    • Ido Schimmel's avatar
      mlxsw: spectrum_router: Optimize processing of VRRP MACs · c3a49540
      Ido Schimmel authored
      Hosts using a VRRP router send their packets with a destination MAC of
      the VRRP router which is of the following form [1]:
      
      IPv4 - 00-00-5E-00-01-{VRID}
      IPv6 - 00-00-5E-00-02-{VRID}
      
      Where VRID is the ID of the virtual router. Such packets are directed to
      the router block in the ASIC by an FDB entry that was added in the
      previous patch.
      
      However, in certain cases it is possible to skip this FDB lookup and
      send such packets directly to the router. This is accomplished by adding
      these special MAC addresses to the RIF cache. If the cache is hit, the
      packet will skip the L2 lookup and ingress the router with the RIF
      specified in the cache entry.
      
      1. https://tools.ietf.org/html/rfc5798#section-7.3Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3a49540
    • Ido Schimmel's avatar
      mlxsw: spectrum: Add VRRP traps · 11566d34
      Ido Schimmel authored
      Virtual Router Redundancy Protocol packets are used to communicate the
      state of the Master router associated with the virtual router ID (VRID).
      
      These are link-local multicast packets sent with IP protocol 112 that
      are trapped in the router block in the ASIC.
      
      Add a trap for these packets and mark the trapped packets to prevent
      them from potentially being re-flooded by the bridge driver.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      11566d34
    • Ido Schimmel's avatar
      mlxsw: spectrum_router: Direct macvlans' MACs to router · 2db99378
      Ido Schimmel authored
      An IP packet received on a netdev with a macvlan upper whose MAC matches
      the packet's destination MAC will be re-injected to the Rx path as if it
      was received by the macvlan, and perform an L3 lookup.
      
      Reflect this functionality to the ASIC by programming FDB entries that
      will direct MACs of macvlan uppers to the router.
      
      In a similar fashion to router interfaces (RIFs) that are programmed
      upon the addition of the first IP address on an interface and destroyed
      upon the removal of the last IP address, the FDB entries for the macvlan
      are added and destroyed based on the addition of the first and removal
      of the last IP address on the macvlan.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2db99378
    • Ido Schimmel's avatar
      mlxsw: spectrum: Enable macvlan upper devices · c5516185
      Ido Schimmel authored
      In order to allow more unicast MAC addresses (e.g., VRRP virtual MAC) to
      be directed to the router we need to enable macvlan uppers on top of
      mlxsw netdevs.
      
      Allow macvlan upper devices on top of mlxsw netdevs and sanitize
      configurations that can't work. For example, a macvlan can't be enslaved
      to a bridge as without ACLs the device doesn't take the destination MAC
      into account when classifying a packet to a bridge instance (i.e., a
      FID).
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c5516185
    • Yafang Shao's avatar
      tcp: remove redundant rcv_nxt update · ff0432e5
      Yafang Shao authored
      tcp_rcv_nxt_update() is already executed in tcp_data_queue().
      This line is redundant.
      
      See bellow,
      	tcp_queue_rcv
      		tcp_rcv_nxt_update(tcp_sk(sk), TCP_SKB_CB(skb)->end_seq);
      	tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq); <<<< redundant
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ff0432e5