• Tom Herbert's avatar
    rfs: Receive Flow Steering · fec5e652
    Tom Herbert authored
    This patch implements receive flow steering (RFS).  RFS steers
    received packets for layer 3 and 4 processing to the CPU where
    the application for the corresponding flow is running.  RFS is an
    extension of Receive Packet Steering (RPS).
    
    The basic idea of RFS is that when an application calls recvmsg
    (or sendmsg) the application's running CPU is stored in a hash
    table that is indexed by the connection's rxhash which is stored in
    the socket structure.  The rxhash is passed in skb's received on
    the connection from netif_receive_skb.  For each received packet,
    the associated rxhash is used to look up the CPU in the hash table,
    if a valid CPU is set then the packet is steered to that CPU using
    the RPS mechanisms.
    
    The convolution of the simple approach is that it would potentially
    allow OOO packets.  If threads are thrashing around CPUs or multiple
    threads are trying to read from the same sockets, a quickly changing
    CPU value in the hash table could cause rampant OOO packets--
    we consider this a non-starter.
    
    To avoid OOO packets, this solution implements two types of hash
    tables: rps_sock_flow_table and rps_dev_flow_table.
    
    rps_sock_table is a global hash table.  Each entry is just a CPU
    number and it is populated in recvmsg and sendmsg as described above.
    This table contains the "desired" CPUs for flows.
    
    rps_dev_flow_table is specific to each device queue.  Each entry
    contains a CPU and a tail queue counter.  The CPU is the "current"
    CPU for a matching flow.  The tail queue counter holds the value
    of a tail queue counter for the associated CPU's backlog queue at
    the time of last enqueue for a flow matching the entry.
    
    Each backlog queue has a queue head counter which is incremented
    on dequeue, and so a queue tail counter is computed as queue head
    count + queue length.  When a packet is enqueued on a backlog queue,
    the current value of the queue tail counter is saved in the hash
    entry of the rps_dev_flow_table.
    
    And now the trick: when selecting the CPU for RPS (get_rps_cpu)
    the rps_sock_flow table and the rps_dev_flow table for the RX queue
    are consulted.  When the desired CPU for the flow (found in the
    rps_sock_flow table) does not match the current CPU (found in the
    rps_dev_flow table), the current CPU is changed to the desired CPU
    if one of the following is true:
    
    - The current CPU is unset (equal to RPS_NO_CPU)
    - Current CPU is offline
    - The current CPU's queue head counter >= queue tail counter in the
    rps_dev_flow table.  This checks if the queue tail has advanced
    beyond the last packet that was enqueued using this table entry.
    This guarantees that all packets queued using this entry have been
    dequeued, thus preserving in order delivery.
    
    Making each queue have its own rps_dev_flow table has two advantages:
    1) the tail queue counters will be written on each receive, so
    keeping the table local to interrupting CPU s good for locality.  2)
    this allows lockless access to the table-- the CPU number and queue
    tail counter need to be accessed together under mutual exclusion
    from netif_receive_skb, we assume that this is only called from
    device napi_poll which is non-reentrant.
    
    This patch implements RFS for TCP and connected UDP sockets.
    It should be usable for other flow oriented protocols.
    
    There are two configuration parameters for RFS.  The
    "rps_flow_entries" kernel init parameter sets the number of
    entries in the rps_sock_flow_table, the per rxqueue sysfs entry
    "rps_flow_cnt" contains the number of entries in the rps_dev_flow
    table for the rxqueue.  Both are rounded to power of two.
    
    The obvious benefit of RFS (over just RPS) is that it achieves
    CPU locality between the receive processing for a flow and the
    applications processing; this can result in increased performance
    (higher pps, lower latency).
    
    The benefits of RFS are dependent on cache hierarchy, application
    load, and other factors.  On simple benchmarks, we don't necessarily
    see improvement and sometimes see degradation.  However, for more
    complex benchmarks and for applications where cache pressure is
    much higher this technique seems to perform very well.
    
    Below are some benchmark results which show the potential benfit of
    this patch.  The netperf test has 500 instances of netperf TCP_RR
    test with 1 byte req. and resp.  The RPC test is an request/response
    test similar in structure to netperf RR test ith 100 threads on
    each host, but does more work in userspace that netperf.
    
    e1000e on 8 core Intel
       No RFS or RPS		104K tps at 30% CPU
       No RFS (best RPS config):    290K tps at 63% CPU
       RFS				303K tps at 61% CPU
    
    RPC test	tps	CPU%	50/90/99% usec latency	Latency StdDev
      No RFS/RPS	103K	48%	757/900/3185		4472.35
      RPS only:	174K	73%	415/993/2468		491.66
      RFS		223K	73%	379/651/1382		315.61
    Signed-off-by: default avatarTom Herbert <therbert@google.com>
    Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    fec5e652
netdevice.h 69.6 KB