1. 04 Jul, 2018 40 commits
    • Jesus Sanchez-Palencia's avatar
      net/sched: Make etf report drops on error_queue · 4b15c707
      Jesus Sanchez-Palencia authored
      Use the socket error queue for reporting dropped packets if the
      socket has enabled that feature through the SO_TXTIME API.
      
      Packets are dropped either on enqueue() if they aren't accepted by the
      qdisc or on dequeue() if the system misses their deadline. Those are
      reported as different errors so applications can react accordingly.
      
      Userspace can retrieve the errors through the socket error queue and the
      corresponding cmsg interfaces. A struct sock_extended_err* is used for
      returning the error data, and the packet's timestamp can be retrieved by
      adding both ee_data and ee_info fields as e.g.:
      
          ((__u64) serr->ee_data << 32) + serr->ee_info
      
      This feature is disabled by default and must be explicitly enabled by
      applications. Enabling it can bring some overhead for the Tx cycles
      of the application.
      Signed-off-by: default avatarJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4b15c707
    • Jesus Sanchez-Palencia's avatar
      igb: Add support for ETF offload · 3048cf84
      Jesus Sanchez-Palencia authored
      Implement HW offload support for SO_TXTIME through igb's Launchtime
      feature. This is done by extending igb_setup_tc() so it supports
      TC_SETUP_QDISC_ETF and configuring i210 so time based transmit
      arbitration is enabled.
      
      The FQTSS transmission mode added before is extended so strict
      priority (SP) queues wait for stream reservation (SR) ones.
      igb_config_tx_modes() is extended so it can support enabling/disabling
      Launchtime following the previous approach used for the credit-based
      shaper (CBS).
      
      As the previous flow, FQTSS transmission mode is enabled automatically
      by the driver once Launchtime (or CBS, as before) is enabled.
      Similarly, it's automatically disabled when the feature is disabled
      for the last queue that had it setup on.
      
      The driver just consumes the transmit times from the skbuffs directly,
      so no special handling is done in case an 'invalid' time is provided.
      We assume this has been handled by the ETF qdisc already.
      Signed-off-by: default avatarJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3048cf84
    • Jesus Sanchez-Palencia's avatar
      igb: Only call skb_tx_timestamp after descriptors are ready · 1b9231e7
      Jesus Sanchez-Palencia authored
      Currently, skb_tx_timestamp() is being called before the Tx
      descriptors are prepared in igb_xmit_frame_ring(), which happens
      during either the igb_tso() or igb_tx_csum() calls.
      
      Given that now the skb->tstamp might be used to carry the timestamp
      for SO_TXTIME, we must only call skb_tx_timestamp() after the
      information has been copied into the Tx descriptors.
      Signed-off-by: default avatarJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1b9231e7
    • Jesus Sanchez-Palencia's avatar
      igb: Refactor igb_offload_cbs() · 8080e6ab
      Jesus Sanchez-Palencia authored
      Split code into a separate function (igb_offload_apply()) that will be
      used by ETF offload implementation.
      Signed-off-by: default avatarJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8080e6ab
    • Jesus Sanchez-Palencia's avatar
      igb: Only change Tx arbitration when CBS is on · 0364a0d0
      Jesus Sanchez-Palencia authored
      Currently the data transmission arbitration algorithm - DataTranARB
      field on TQAVCTRL reg - is always set to CBS when the Tx mode is
      changed from legacy to 'Qav' mode.
      
      Make that configuration a bit more granular in preparation for the
      upcoming Launchtime enabling patches, since CBS and Launchtime can be
      enabled separately. That is achieved by moving the DataTranARB setup
      to igb_config_tx_modes() instead.
      
      Similarly, when disabling CBS we must check if it has been disabled
      for all queues, and clear the DataTranARB accordingly.
      Signed-off-by: default avatarJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0364a0d0
    • Jesus Sanchez-Palencia's avatar
      igb: Refactor igb_configure_cbs() · 91db3642
      Jesus Sanchez-Palencia authored
      Make this function retrieve what it needs from the Tx ring being
      addressed since it already relies on what had been saved on it before.
      Also, since this function will be used by the upcoming Launchtime
      patches rename it to better reflect its intention. Note that
      Launchtime is not part of what 802.1Qav specifies, but the i210
      datasheet refers to this set of functionality as "Qav Transmission
      Mode".
      
      Here we also perform a tiny refactor at is_any_cbs_enabled(), and add
      further documentation to igb_setup_tx_mode().
      Signed-off-by: default avatarJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91db3642
    • Jesus Sanchez-Palencia's avatar
      net/sched: Add HW offloading capability to ETF · 88cab771
      Jesus Sanchez-Palencia authored
      Add infra so etf qdisc supports HW offload of time-based transmission.
      
      For hw offload, the time sorted list is still used, so packets are
      dequeued always in order of txtime.
      
      Example:
      
      $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
                 map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
      
      $ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 100000 \
      	   clockid CLOCK_REALTIME
      
      In this example, the Qdisc will use HW offload for the control of the
      transmission time through the network adapter. The hrtimer used for
      packets scheduling inside the qdisc will use the clockid CLOCK_REALTIME
      as reference and packets leave the Qdisc "delta" (100000) nanoseconds
      before their transmission time. Because this will be using HW offload and
      since dynamic clocks are not supported by the hrtimer, the system clock
      and the PHC clock must be synchronized for this mode to behave as
      expected.
      Signed-off-by: default avatarJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      88cab771
    • Vinicius Costa Gomes's avatar
      net/sched: Introduce the ETF Qdisc · 25db26a9
      Vinicius Costa Gomes authored
      The ETF (Earliest TxTime First) qdisc uses the information added
      earlier in this series (the socket option SO_TXTIME and the new
      role of sk_buff->tstamp) to schedule packets transmission based
      on absolute time.
      
      For some workloads, just bandwidth enforcement is not enough, and
      precise control of the transmission of packets is necessary.
      
      Example:
      
      $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
                 map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
      
      $ tc qdisc add dev enp2s0 parent 100:1 etf delta 100000 \
                 clockid CLOCK_TAI
      
      In this example, the Qdisc will provide SW best-effort for the control
      of the transmission time to the network adapter, the time stamp in the
      socket will be in reference to the clockid CLOCK_TAI and packets
      will leave the qdisc "delta" (100000) nanoseconds before its transmission
      time.
      
      The ETF qdisc will buffer packets sorted by their txtime. It will drop
      packets on enqueue() if their skbuff clockid does not match the clock
      reference of the Qdisc. Moreover, on dequeue(), a packet will be dropped
      if it expires while being enqueued.
      
      The qdisc also supports the SO_TXTIME deadline mode. For this mode, it
      will dequeue a packet as soon as possible and change the skb timestamp
      to 'now' during etf_dequeue().
      
      Note that both the qdisc's and the SO_TXTIME ABIs allow for a clockid
      to be configured, but it's been decided that usage of CLOCK_TAI should
      be enforced until we decide to allow for other clockids to be used.
      The rationale here is that PTP times are usually in the TAI scale, thus
      no other clocks should be necessary. For now, the qdisc will return
      EINVAL if any clocks other than CLOCK_TAI are used.
      Signed-off-by: default avatarJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      25db26a9
    • Vinicius Costa Gomes's avatar
      net/sched: Allow creating a Qdisc watchdog with other clocks · 860b642b
      Vinicius Costa Gomes authored
      This adds 'qdisc_watchdog_init_clockid()' that allows a clockid to be
      passed, this allows other time references to be used when scheduling
      the Qdisc to run.
      Signed-off-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      860b642b
    • Richard Cochran's avatar
      net: packet: Hook into time based transmission. · 3d0ba8c0
      Richard Cochran authored
      For raw layer-2 packets, copy the desired future transmit time from
      the CMSG cookie into the skb.
      Signed-off-by: default avatarRichard Cochran <rcochran@linutronix.de>
      Signed-off-by: default avatarJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3d0ba8c0
    • Jesus Sanchez-Palencia's avatar
      net: ipv6: Hook into time based transmission · a818f75e
      Jesus Sanchez-Palencia authored
      Add a struct sockcm_cookie parameter to ip6_setup_cork() so
      we can easily re-use the transmit_time field from struct inet_cork
      for most paths, by copying the timestamp from the CMSG cookie.
      This is later copied into the skb during __ip6_make_skb().
      
      For the raw fast path, also pass the sockcm_cookie as a parameter
      so we can just perform the copy at rawv6_send_hdrinc() directly.
      Signed-off-by: default avatarJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a818f75e
    • Jesus Sanchez-Palencia's avatar
      net: ipv4: Hook into time based transmission · bc969a97
      Jesus Sanchez-Palencia authored
      Add a transmit_time field to struct inet_cork, then copy the
      timestamp from the CMSG cookie at ip_setup_cork() so we can
      safely copy it into the skb later during __ip_make_skb().
      
      For the raw fast path, just perform the copy at raw_send_hdrinc().
      Signed-off-by: default avatarRichard Cochran <rcochran@linutronix.de>
      Signed-off-by: default avatarJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bc969a97
    • Richard Cochran's avatar
      net: Add a new socket option for a future transmit time. · 80b14dee
      Richard Cochran authored
      This patch introduces SO_TXTIME. User space enables this option in
      order to pass a desired future transmit time in a CMSG when calling
      sendmsg(2). The argument to this socket option is a 8-bytes long struct
      provided by the uapi header net_tstamp.h defined as:
      
      struct sock_txtime {
      	clockid_t 	clockid;
      	u32		flags;
      };
      
      Note that new fields were added to struct sock by filling a 2-bytes
      hole found in the struct. For that reason, neither the struct size or
      number of cachelines were altered.
      Signed-off-by: default avatarRichard Cochran <rcochran@linutronix.de>
      Signed-off-by: default avatarJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      80b14dee
    • Jesus Sanchez-Palencia's avatar
      net: Clear skb->tstamp only on the forwarding path · c47d8c2f
      Jesus Sanchez-Palencia authored
      This is done in preparation for the upcoming time based transmission
      patchset. Now that skb->tstamp will be used to hold packet's txtime,
      we must ensure that it is being cleared when traversing namespaces.
      Also, doing that from skb_scrub_packet() before the early return would
      break our feature when tunnels are used.
      Signed-off-by: default avatarJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c47d8c2f
    • Gustavo A. R. Silva's avatar
      isdn: mark expected switch fall-throughs · d287c502
      Gustavo A. R. Silva authored
      In preparation to enabling -Wimplicit-fallthrough, mark switch cases
      where we are expecting to fall through.
      
      Warning level 2 was used: -Wimplicit-fallthrough=2
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d287c502
    • Marcel Ziswiler's avatar
      net: usb: asix: allow optionally getting mac address from device tree · 03fc5d4f
      Marcel Ziswiler authored
      For Embedded use where e.g. AX88772B chips may be used without external
      EEPROMs the boot loader may choose to pass the MAC address to be used
      via device tree. Therefore, allow for optionally getting the MAC
      address from device tree data e.g. as follows (excerpt from a T30 based
      board, local-mac-address to be filled in by boot loader):
      
      /* EHCI instance 1: USB2_DP/N -> AX88772B */
      usb@7d004000 {
      	status = "okay";
      	#address-cells = <1>;
      	#size-cells = <0>;
      	asix@1 {
      		reg = <1>;
      		local-mac-address = [00 00 00 00 00 00];
      	};
      };
      Signed-off-by: default avatarMarcel Ziswiler <marcel.ziswiler@toradex.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      03fc5d4f
    • Wei Yongjun's avatar
      net: sched: act_pedit: fix possible memory leak in tcf_pedit_init() · 30e99ed6
      Wei Yongjun authored
      'keys_ex' is malloced by tcf_pedit_keys_ex_parse() in tcf_pedit_init()
      but not all of the error handle path free it, this may cause memory
      leak. This patch fix it.
      
      Fixes: 71d0ed70 ("net/act_pedit: Support using offset relative to the conventional network headers")
      Signed-off-by: default avatarWei Yongjun <weiyongjun1@huawei.com>
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      30e99ed6
    • David S. Miller's avatar
      Merge branch 'bridge-iproute2-isolated-port-and-selftests' · 7184e7e7
      David S. Miller authored
      Nikolay Aleksandrov says:
      
      ====================
      bridge: iproute2 isolated port and selftests
      
      Add support to iproute2 for port isolation config and selftests for it.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7184e7e7
    • Nikolay Aleksandrov's avatar
      selftests: forwarding: test for bridge port isolation · a14e9faf
      Nikolay Aleksandrov authored
      This test checks if the bridge port isolation feature works as expected
      by performing ping/ping6 tests between hosts that are isolated (should
      not work) and between an isolated and non-isolated hosts (should work).
      Same test is performed for flooding from and to isolated and
      non-isolated ports.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a14e9faf
    • Nikolay Aleksandrov's avatar
      selftests: forwarding: lib: extract ping and ping6 so they can be reused · 967450c5
      Nikolay Aleksandrov authored
      Extract ping and ping6 command execution so the return value can be
      checked by the caller, this is needed for port isolation tests that are
      intended to fail.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      967450c5
    • David S. Miller's avatar
      Merge branch 'vhost_net-Avoid-vq-kicks-during-busyloop' · f744c4bb
      David S. Miller authored
      Toshiaki Makita says:
      
      ====================
      vhost_net: Avoid vq kicks during busyloop
      
      Under heavy load vhost tx busypoll tend not to suppress vq kicks, which
      causes poor guest tx performance. The detailed scenario is described in
      commitlog of patch 2.
      Rx seems not to have that serious problem, but for consistency I made a
      similar change on rx to avoid rx wakeups (patch 3).
      Additionary patch 4 is to avoid rx kicks under heavy load during
      busypoll.
      
      Tx performance is greatly improved by this change. I don't see notable
      performance change on rx with this series though.
      
      Performance numbers (tx):
      
      - Bulk transfer from guest to external physical server.
          [Guest]->vhost_net->tap--(XDP_REDIRECT)-->i40e --(wire)--> [Server]
      - Set 10us busypoll.
      - Guest disables checksum and TSO because of host XDP.
      - Measured single flow Mbps by netperf, and kicks by perf kvm stat
        (EPT_MISCONFIG event).
      
                                  Before              After
                                Mbps  kicks/s      Mbps  kicks/s
      UDP_STREAM 1472byte              247758                 27
                      Send   3645.37            6958.10
                      Recv   3588.56            6958.10
                    1byte                9865                 37
                      Send      4.34               5.43
                      Recv      4.17               5.26
      TCP_STREAM             8801.03    45794   9592.77     2884
      
      v2:
      - Split patches into 3 parts (renaming variables, tx-kick fix, rx-wakeup
        fix).
      - Avoid rx-kicks too (patch 4).
      - Don't memorize endtime as it is not needed for now.
      ====================
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f744c4bb
    • Toshiaki Makita's avatar
      vhost_net: Avoid rx vring kicks during busyloop · 6369fec5
      Toshiaki Makita authored
      We may run out of avail rx ring descriptor under heavy load but busypoll
      did not detect it so busypoll may have exited prematurely. Avoid this by
      checking rx ring full during busypoll.
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6369fec5
    • Toshiaki Makita's avatar
      vhost_net: Avoid rx queue wake-ups during busypoll · be294a51
      Toshiaki Makita authored
      We may run handle_rx() while rx work is queued. For example a packet can
      push the rx work during the window before handle_rx calls
      vhost_net_disable_vq().
      In that case busypoll immediately exits due to vhost_has_work()
      condition and enables vq again. This can lead to another unnecessary rx
      wake-ups, so poll rx work instead of enabling the vq.
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      be294a51
    • Toshiaki Makita's avatar
      vhost_net: Avoid tx vring kicks during busyloop · 027b1760
      Toshiaki Makita authored
      Under heavy load vhost busypoll may run without suppressing
      notification. For example tx zerocopy callback can push tx work while
      handle_tx() is running, then busyloop exits due to vhost_has_work()
      condition and enables notification but immediately reenters handle_tx()
      because the pushed work was tx. In this case handle_tx() tries to
      disable notification again, but when using event_idx it by design
      cannot. Then busyloop will run without suppressing notification.
      Another example is the case where handle_tx() tries to enable
      notification but avail idx is advanced so disables it again. This case
      also leads to the same situation with event_idx.
      
      The problem is that once we enter this situation busyloop does not work
      under heavy load for considerable amount of time, because notification
      is likely to happen during busyloop and handle_tx() immediately enables
      notification after notification happens. Specifically busyloop detects
      notification by vhost_has_work() and then handle_tx() calls
      vhost_enable_notify(). Because the detected work was the tx work, it
      enters handle_tx(), and enters busyloop without suppression again.
      This is likely to be repeated, so with event_idx we are almost not able
      to suppress notification in this case.
      
      To fix this, poll the work instead of enabling notification when
      busypoll is interrupted by something. IMHO vhost_has_work() is kind of
      interruption rather than a signal to completely cancel the busypoll, so
      let's run busypoll after the necessary work is done.
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      027b1760
    • Toshiaki Makita's avatar
      vhost_net: Rename local variables in vhost_net_rx_peek_head_len · 28b9b33b
      Toshiaki Makita authored
      So we can easily see which variable is for which, tx or rx.
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28b9b33b
    • Qiaobin Fu's avatar
      net:sched: add action inheritdsfield to skbedit · e7e3728b
      Qiaobin Fu authored
      The new action inheritdsfield copies the field DS of
      IPv4 and IPv6 packets into skb->priority. This enables
      later classification of packets based on the DS field.
      
      v5:
      *Update the drop counter for TC_ACT_SHOT
      
      v4:
      *Not allow setting flags other than the expected ones.
      
      *Allow dumping the pure flags.
      
      v3:
      *Use optional flags, so that it won't break old versions of tc.
      
      *Allow users to set both SKBEDIT_F_PRIORITY and SKBEDIT_F_INHERITDSFIELD flags.
      
      v2:
      *Fix the style issue
      
      *Move the code from skbmod to skbedit
      
      Original idea by Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarQiaobin Fu <qiaobinf@bu.edu>
      Reviewed-by: default avatarMichel Machado <michel@digirati.com.br>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7e3728b
    • David S. Miller's avatar
      Merge branch 'More-mirror-to-gretap-tests-with-bridge-in-UL' · f145b0a7
      David S. Miller authored
      Petr Machata says:
      
      ====================
      More mirror-to-gretap tests with bridge in UL
      
      This patchset adds two more tests where the mirror-to-gretap has a
      bridge in underlay packet path, without a VLAN above or below that
      bridge.
      
      In patch #1, a non-VLAN-filtering bridge is tested.
      
      In patch #2, a VLAN-filtering bridge is tested.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f145b0a7
    • Petr Machata's avatar
      selftests: forwarding: Test mirror-to-gretap w/ UL 802.1q · 239e754a
      Petr Machata authored
      Test for "tc action mirred egress mirror" that mirrors to gretap when
      the underlay route points at a VLAN-aware bridge (802.1q).
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      239e754a
    • Petr Machata's avatar
      selftests: forwarding: Test mirror-to-gretap w/ UL 802.1d · 35c31d5c
      Petr Machata authored
      Test for "tc action mirred egress mirror" that mirrors to gretap when
      the underlay route points at a VLAN-unaware bridge (802.1d).
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35c31d5c
    • David S. Miller's avatar
      Merge branch 'Handle-multiple-received-packets-at-each-stage' · 2d1b1385
      David S. Miller authored
      Edward Cree says:
      
      ====================
      Handle multiple received packets at each stage
      
      This patch series adds the capability for the network stack to receive a
       list of packets and process them as a unit, rather than handling each
       packet singly in sequence.  This is done by factoring out the existing
       datapath code at each layer and wrapping it in list handling code.
      
      The motivation for this change is twofold:
      * Instruction cache locality.  Currently, running the entire network
        stack receive path on a packet involves more code than will fit in the
        lowest-level icache, meaning that when the next packet is handled, the
        code has to be reloaded from more distant caches.  By handling packets
        in "row-major order", we ensure that the code at each layer is hot for
        most of the list.  (There is a corresponding downside in _data_ cache
        locality, since we are now touching every packet at every layer, but in
        practice there is easily enough room in dcache to hold one cacheline of
        each of the 64 packets in a NAPI poll.)
      * Reduction of indirect calls.  Owing to Spectre mitigations, indirect
        function calls are now more expensive than ever; they are also heavily
        used in the network stack's architecture (see [1]).  By replacing 64
        indirect calls to the next-layer per-packet function with a single
        indirect call to the next-layer list function, we can save CPU cycles.
      
      Drivers pass an SKB list to the stack at the end of the NAPI poll; this
       gives a natural batch size (the NAPI poll weight) and avoids waiting at
       the software level for further packets to make a larger batch (which
       would add latency).  It also means that the batch size is automatically
       tuned by the existing interrupt moderation mechanism.
      The stack then runs each layer of processing over all the packets in the
       list before proceeding to the next layer.  Where the 'next layer' (or
       the context in which it must run) differs among the packets, the stack
       splits the list; this 'late demux' means that packets which differ only
       in later headers (e.g. same L2/L3 but different L4) can traverse the
       early part of the stack together.
      Also, where the next layer is not (yet) list-aware, the stack can revert
       to calling the rest of the stack in a loop; this allows gradual/creeping
       listification, with no 'flag day' patch needed to listify everything.
      
      Patches 1-2 simply place received packets on a list during the event
       processing loop on the sfc EF10 architecture, then call the normal stack
       for each packet singly at the end of the NAPI poll.  (Analogues of patch
       #2 for other NIC drivers should be fairly straightforward.)
      Patches 3-9 extend the list processing as far as the IP receive handler.
      
      Patches 1-2 alone give about a 10% improvement in packet rate in the
       baseline test; adding patches 3-9 raises this to around 25%.
      
      Performance measurements were made with NetPerf UDP_STREAM, using 1-byte
       packets and a single core to handle interrupts on the RX side; this was
       in order to measure as simply as possible the packet rate handled by a
       single core.  Figures are in Mbit/s; divide by 8 to obtain Mpps.  The
       setup was tuned for maximum reproducibility, rather than raw performance.
       Full details and more results (both with and without retpolines) from a
       previous version of the patch series are presented in [2].
      
      The baseline test uses four streams, and multiple RXQs all bound to a
       single CPU (the netperf binary is bound to a neighbouring CPU).  These
       tests were run with retpolines.
      net-next: 6.91 Mb/s (datum)
       after 9: 8.46 Mb/s (+22.5%)
      Note however that these results are not robust; changes in the parameters
       of the test sometimes shrink the gain to single-digit percentages.  For
       instance, when using only a single RXQ, only a 4% gain was seen.
      
      One test variation was the use of software filtering/firewall rules.
       Adding a single iptables rule (UDP port drop on a port range not matching
       the test traffic), thus making the netfilter hook have work to do,
       reduced baseline performance but showed a similar gain from the patches:
      net-next: 5.02 Mb/s (datum)
       after 9: 6.78 Mb/s (+35.1%)
      
      Similarly, testing with a set of TC flower filters (kindly supplied by
       Cong Wang) gave the following:
      net-next: 6.83 Mb/s (datum)
       after 9: 8.86 Mb/s (+29.7%)
      
      These data suggest that the batching approach remains effective in the
       presence of software switching rules, and perhaps even improves the
       performance of those rules by allowing them and their codepaths to stay
       in cache between packets.
      
      Changes from v3:
      * Fixed build error when CONFIG_NETFILTER=n (thanks kbuild).
      
      Changes from v2:
      * Used standard list handling (and skb->list) instead of the skb-queue
        functions (that use skb->next, skb->prev).
        - As part of this, changed from a "dequeue, process, enqueue" model to
          using list_for_each_safe, list_del, and (new) list_cut_before.
      * Altered __netif_receive_skb_core() changes in patch 6 as per Willem de
        Bruijn's suggestions (separate **ppt_prev from *pt_prev; renaming).
      * Removed patches to Generic XDP, since they were producing no benefit.
        I may revisit them later.
      * Removed RFC tags.
      
      Changes from v1:
      * Rebased across 2 years' net-next movement (surprisingly straightforward).
        - Added Generic XDP handling to netif_receive_skb_list_internal()
        - Dealt with changes to PFMEMALLOC setting APIs
      * General cleanup of code and comments.
      * Skipped function calls for empty lists at various points in the stack
        (patch #9).
      * Added listified Generic XDP handling (patches 10-12), though it doesn't
        seem to help (see above).
      * Extended testing to cover software firewalls / netfilter etc.
      
      [1] http://vger.kernel.org/netconf2018_files/DavidMiller_netconf2018.pdf
      [2] http://vger.kernel.org/netconf2018_files/EdwardCree_netconf2018.pdf
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d1b1385
    • Edward Cree's avatar
      net: don't bother calling list RX functions on empty lists · b9f463d6
      Edward Cree authored
      Generally the check should be very cheap, as the sk_buff_head is in cache.
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b9f463d6
    • Edward Cree's avatar
      net: ipv4: listify ip_rcv_finish · 5fa12739
      Edward Cree authored
      ip_rcv_finish_core(), if it does not drop, sets skb->dst by either early
       demux or route lookup.  The last step, calling dst_input(skb), is left to
       the caller; in the listified case, we split to form sublists with a common
       dst, but then ip_sublist_rcv_finish() just calls dst_input(skb) in a loop.
      The next step in listification would thus be to add a list_input() method
       to struct dst_entry.
      
      Early demux is an indirect call based on iph->protocol; this is another
       opportunity for listification which is not taken here (it would require
       slicing up ip_rcv_finish_core() to allow splitting on protocol changes).
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5fa12739
    • Edward Cree's avatar
      net: ipv4: listified version of ip_rcv · 17266ee9
      Edward Cree authored
      Also involved adding a way to run a netfilter hook over a list of packets.
       Rather than attempting to make netfilter know about lists (which would be
       a major project in itself) we just let it call the regular okfn (in this
       case ip_rcv_finish()) for any packets it steals, and have it give us back
       a list of packets it's synchronously accepted (which normally NF_HOOK
       would automatically call okfn() on, but we want to be able to potentially
       pass the list to a listified version of okfn().)
      The netfilter hooks themselves are indirect calls that still happen per-
       packet (see nf_hook_entry_hookfn()), but again, changing that can be left
       for future work.
      
      There is potential for out-of-order receives if the netfilter hook ends up
       synchronously stealing packets, as they will be processed before any
       accepts earlier in the list.  However, it was already possible for an
       asynchronous accept to cause out-of-order receives, so presumably this is
       considered OK.
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      17266ee9
    • Edward Cree's avatar
      net: core: propagate SKB lists through packet_type lookup · 88eb1944
      Edward Cree authored
      __netif_receive_skb_core() does a depressingly large amount of per-packet
       work that can't easily be listified, because the another_round looping
       makes it nontrivial to slice up into smaller functions.
      Fortunately, most of that work disappears in the fast path:
       * Hardware devices generally don't have an rx_handler
       * Unless you're tcpdumping or something, there is usually only one ptype
       * VLAN processing comes before the protocol ptype lookup, so doesn't force
         a pt_prev deliver
       so normally, __netif_receive_skb_core() will run straight through and pass
       back the one ptype found in ptype_base[hash of skb->protocol].
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      88eb1944
    • Edward Cree's avatar
      net: core: another layer of lists, around PF_MEMALLOC skb handling · 4ce0017a
      Edward Cree authored
      First example of a layer splitting the list (rather than merely taking
       individual packets off it).
      Involves new list.h function, list_cut_before(), like list_cut_position()
       but cuts on the other side of the given entry.
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ce0017a
    • Edward Cree's avatar
      net: core: Another step of skb receive list processing · 7da517a3
      Edward Cree authored
      netif_receive_skb_list_internal() now processes a list and hands it
       on to the next function.
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7da517a3
    • Edward Cree's avatar
    • Edward Cree's avatar
      sfc: batch up RX delivery · e090bfb9
      Edward Cree authored
      Improves packet rate of 1-byte UDP receives by up to 10%.
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e090bfb9
    • Edward Cree's avatar
      net: core: trivial netif_receive_skb_list() entry point · f6ad8c1b
      Edward Cree authored
      Just calls netif_receive_skb() in a loop.
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f6ad8c1b
    • David S. Miller's avatar
      Merge branch 'sctp-fully-support-for-dscp-and-flowlabel-per-transport' · 2bdea157
      David S. Miller authored
      Xin Long says:
      
      ====================
      sctp: fully support for dscp and flowlabel per transport
      
      Now dscp and flowlabel are set from sock when sending the packets,
      but being multi-homing, sctp also supports for dscp and flowlabel
      per transport, which is described in section 8.1.12 in RFC6458.
      
      v1->v2:
        - define ip_queue_xmit as inline in net/ip.h, instead of exporting
          it in Patch 1/5 according to David's suggestion.
        - fix the param len check in sctp_s/getsockopt_peer_addr_params()
          in Patch 3/5 to guarantee that an old app built with old kernel
          headers could work on the newer kernel per Marcelo's point.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2bdea157