1. 14 Mar, 2016 5 commits
  2. 12 Mar, 2016 1 commit
  3. 11 Mar, 2016 2 commits
  4. 10 Mar, 2016 5 commits
  5. 09 Mar, 2016 18 commits
    • David S. Miller's avatar
      Merge branch 'kcm' · 9531ab65
      David S. Miller authored
      Tom Herbert says:
      
      ====================
      kcm: Kernel Connection Multiplexor (KCM)
      
      Kernel Connection Multiplexor (KCM) is a facility that provides a
      message based interface over TCP for generic application protocols.
      The motivation for this is based on the observation that although
      TCP is byte stream transport protocol with no concept of message
      boundaries, a common use case is to implement a framed application
      layer protocol running over TCP. To date, most TCP stacks offer
      byte stream API for applications, which places the burden of message
      delineation, message I/O operation atomicity, and load balancing
      in the application. With KCM an application can efficiently send
      and receive application protocol messages over TCP using a
      datagram interface.
      
      In order to delineate message in a TCP stream for receive in KCM, the
      kernel implements a message parser. For this we chose to employ BPF
      which is applied to the TCP stream. BPF code parses application layer
      messages and returns a message length. Nearly all binary application
      protocols are parsable in this manner, so KCM should be applicable
      across a wide range of applications. Other than message length
      determination in receive, KCM does not require any other application
      specific awareness. KCM does not implement any other application
      protocol semantics-- these are are provided in userspace or could be
      implemented in a kernel module layered above KCM.
      
      KCM implements an NxM multiplexor in the kernel as diagrammed below:
      
      +------------+   +------------+   +------------+   +------------+
      | KCM socket |   | KCM socket |   | KCM socket |   | KCM socket |
      +------------+   +------------+   +------------+   +------------+
            |                 |               |                |
            +-----------+     |               |     +----------+
                        |     |               |     |
                     +----------------------------------+
                     |           Multiplexor            |
                     +----------------------------------+
                       |   |           |           |  |
             +---------+   |           |           |  ------------+
             |             |           |           |              |
      +----------+  +----------+  +----------+  +----------+ +----------+
      |  Psock   |  |  Psock   |  |  Psock   |  |  Psock   | |  Psock   |
      +----------+  +----------+  +----------+  +----------+ +----------+
            |              |           |            |             |
      +----------+  +----------+  +----------+  +----------+ +----------+
      | TCP sock |  | TCP sock |  | TCP sock |  | TCP sock | | TCP sock |
      +----------+  +----------+  +----------+  +----------+ +----------+
      
      The KCM sockets provide the datagram interface to applications,
      Psocks are the state for each attached TCP connection (i.e. where
      message delineation is performed on receive).
      
      A description of the APIs and design can be found in the included
      Documentation/networking/kcm.txt.
      
      In this patch set:
      
        - Add MSG_BATCH flag. This is used in sendmsg msg_hdr flags to
          indicate that more messages will be sent on the socket. The stack
          may batch messages up if it is beneficial for transmission.
        - In sendmmsg, set MSG_BATCH in all sub messages except for the last
          one.
        - In order to allow sendmmsg to contain multiple messages with
          SOCK_SEQPAKET we allow each msg_hdr in the sendmmsg to set MSG_EOR.
        - Add KCM module
          - This supports SOCK_DGRAM and SOCK_SEQPACKET.
        - KCM documentation
      
      v2:
        - Added splice and page operations.
        - Assemble receive messages in place on TCP socket (don't have a
          separate assembly queue.
        - Based on above, enforce maxmimum receive message to be the size
          of the recceive socket buffer.
        - Support message assembly timeout. Use the timeout value in
          sk_rcvtimeo on the TCP socket.
        - Tested some with a couple of other production applications,
          see ~5% improvement in application latency.
      
      Testing:
      
      Dave Watson has integrated KCM into Thrift and we intend to put these
      changes into open source. Example of this is in:
      
      https://github.com/djwatson/fbthrift/commit/
      dd7e0f9cf4e80912fdb90f6cd394db24e61a14cc
      
      Some initial KCM Thrift benchmark numbers (comment from Dave)
      
      Thrift by default ties a single connection to a single thread.  KCM is
      instead able to load balance multiple connections across multiple epoll
      loops easily.
      
      A test sending ~5k bytes of data to a kcm thrift server, dropping the
      bytes on recv:
      
      QPS     Latency / std dev Latency
        without KCM
          70336     209/123
        with KCM
          70353     191/124
      
      A test sending a small request, then doing work in the epoll thread,
      before serving more requests:
      
      QPS     Latency / std dev Latency
      without KCM
          14282     559/602
      with KCM
          23192     344/234
      
      At the high end, there's definitely some additional kernel overhead:
      
      Cranking the pipelining way up, with lots of small requests
      
      QPS     Latency / std dev Latency
      without KCM
         1863429     127/119
      with KCM
         1337713     192/241
      
      ---
      
      So for a "realistic" workload, KCM performs pretty well (second case).
      Under extreme conditions of highest tps we still have some work to do.
      In its nature a multiplexor will spread work between CPUs which is
      logically good for load balancing but coan conflict with the goal
      promoting affinity. Batching messages on both send and receive are
      the means to recoup performance.
      
      Future support:
      
       - Integration with TLS (TLS-in-kernel is a separate initiative).
       - Page operations/splice support
       - Unconnected KCM sockets. Will be able to attach sockets to different
         destinations, AF_KCM addresses with be used in sendmsg and recvmsg
         to indicate destination
       - Explore more utility in performing BPF inline with a TCP data stream
         (setting SO_MARK, rxhash for messages being sent received on
         KCM sockets).
       - Performance work
         - Diagnose performance issues under high message load
      
      FAQ (Questions posted on LWN)
      
      Q: Why do this in the kernel?
      
      A: Because the kernel is good at scheduling threads and steering packets
         to threads. KCM fits well into this model since it allows the unit
         of work for scheduling and steering to be the application layer
         messages themselves. KCM should be thought of as generic application
         protocol acceleration. It to the philosophy that the kernel provides
         generic and extensible interfaces.
      
      Q: How can adding code in the path yield better performance?
      
      A: It is true that for just sending receiving a single message there
         would be some performance loss since the code path is longer (for
         instance comparing netperf to KCM). But for real production
         applications performance takes on many dynamics. Parallelism, context
         switching, affinity, granularity of locking, and load balancing are
         all relevant. The theory of KCM is that by an application-centric
         interface, the kernel can provide better support for these
         performance characteristics.
      
      Q: Why not use an existing message-oriented protocol such as RUDP,
         DCCP, SCTP, RDS, and others?
      
      A: Because that would entail using a completely new transport protocol.
         Deploying a new protocol at scale is either a huge undertaking or
         fundamentally infeasible. This is true in either the Internet and in
         the data center due in a large part to protocol ossification.
         Besides, KCM we want KCM to work existing, well deployed application
         protocols that we couldn't change even if we wanted to (e.g. http/2).
      
         KCM simply defines a new interface method, it does not redefine any
         aspect of the transport protocol nor application protocol, nor set
         any new requirements on these. Neither does KCM attempt to implement
         any application protocol logic other than message deliniation in the
         stream. These are fundamental requirement of KCM.
      
      Q: How does this affect TCP?
      
      A: It doesn't, not in the slightest. The use of KCM can be one-sided,
         KCM has no effect on the wire.
      
      Q: Why force TCP into doing something it's not designed for?
      
      A: TCP is defined as transport protocol and there is no standard that
         says the API into TCP must be stream based sockets, or for that
         matter sockets at all (or even that TCP needs to be implemented in a
         kernel). KCM is not inconsistent with the design of TCP just because
         to makes an message based interface over TCP, if it were then every
         application protocol sending messages over TCP would also be! :-)
      
      Q: What about the problem of a connections with very slow rate of
         incoming data? As a result your application can get storms of very
         short reads. And it actually happens a lot with connection from
         mobile devices and it is a problem for servers handling a lot of
         connections.
      
      A: The storm of short reads will occur regardless of whether KCM is used
         or not. KCM does have one advantage in this scenario though, it will
         only wake up the application when a full message has been received,
         not for each packet that makes up part of a bigger messages. If a
         bunch of small messages are received, the application can receive
         messages in batches using recvmmsg.
      
      Q: Why not just use DPDK, or at least provide KCM like functionality in
         DPDK?
      
      A: DPDK, or more generally OS bypass presumably with a TCP stack in
         userland, presents a different model of load balancing than that of
         KCM (and the kernel). KCM implements load balancing of messages
         across the threads of an application, whereas DPDK load balances
         based on queues which are more static and coarse-grained since
         multiple connections are bound to queues. DPDK works best when
         processing of packets is silo'ed in a thread on the CPU processing
         a queue, and packet processing (for both the stack and application)
         is fairly uniform. KCM works well for applications where the amount
         of work to process messages varies an application work is commonly
         delegated to worker threads often on different CPUs.
      
         The message based interface over TCP is something that could be
         provide by a DPDK or OS bypass library.
      
      Q: I'm not quite seeing this for HTTP. Maybe for HTTP/2, I guess, or web
         sockets?
      
      A: Yes. KCM is most appropriate for message based protocols over TCP
         where is easy to deduce the message length (e.g. a length field)
         and the protocol implements its own message ordering semantics.
         Fortunately this encompasses many modern protocols.
      
      Q: How is memory limited and controlled?
      
      A: In v2 all data for messages is now kept in socket buffers, either
         those for TCP or KCM, so socket buffer limits are applicable.
         This includes receive messages assembly which is now done ont teh
         TCP socket buffer instead of a separate queue-- this has the
         consequence that the TCP socket buffer limit provides an
         enforceable maxmimum message size.
      
         Additionally, a timeout may be set for messages assembly. The
         value used for this is taken from sk_rcvtimeo of the TCP socket.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9531ab65
    • Tom Herbert's avatar
      kcm: Add description in Documentation · 10016594
      Tom Herbert authored
      Add kcm.txt to desribe KCM and interfaces.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      10016594
    • Tom Herbert's avatar
      kcm: Add receive message timeout · 29152a34
      Tom Herbert authored
      This patch adds receive timeout for message assembly on the attached TCP
      sockets. The timeout is set when a new messages is started and the whole
      message has not been received by TCP (not in the receive queue). If the
      completely message is subsequently received the timer is cancelled, if the
      timer expires the RX side is aborted.
      
      The timeout value is taken from the socket timeout (SO_RCVTIMEO) that is
      set on a TCP socket (i.e. set by get sockopt before attaching a TCP socket
      to KCM.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29152a34
    • Tom Herbert's avatar
      kcm: Add memory limit for receive message construction · 7ced95ef
      Tom Herbert authored
      Message assembly is performed on the TCP socket. This is logically
      equivalent of an application that performs a peek on the socket to find
      out how much memory is needed for a receive buffer. The receive socket
      buffer also provides the maximum message size which is checked.
      
      The receive algorithm is something like:
      
         1) Receive the first skbuf for a message (or skbufs if multiple are
            needed to determine message length).
         2) Check the message length against the number of bytes in the TCP
            receive queue (tcp_inq()).
      	- If all the bytes of the message are in the queue (incluing the
      	  skbuf received), then proceed with message assembly (it should
      	  complete with the tcp_read_sock)
              - Else, mark the psock with the number of bytes needed to
      	  complete the message.
         3) In TCP data ready function, if the psock indicates that we are
            waiting for the rest of the bytes of a messages, check the number
            of queued bytes against that.
              - If there are still not enough bytes for the message, just
      	  return
              - Else, clear the waiting bytes and proceed to receive the
      	  skbufs.  The message should now be received in one
      	  tcp_read_sock
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7ced95ef
    • Tom Herbert's avatar
      kcm: Sendpage support · f29698fc
      Tom Herbert authored
      Implement kcm_sendpage. Set in sendpage to kcm_sendpage in both
      dgram and seqpacket ops.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f29698fc
    • Tom Herbert's avatar
      kcm: Splice support · 91687355
      Tom Herbert authored
      Implement kcm_splice_read. This is supported only for seqpacket.
      Add kcm_seqpacket_ops and set splice read to kcm_splice_read.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91687355
    • Tom Herbert's avatar
      kcm: Add statistics and proc interfaces · cd6e111b
      Tom Herbert authored
      This patch adds various counters for KCM. These include counters for
      messages and bytes received or sent, as well as counters for number of
      attached/unattached TCP sockets and other error or edge events.
      
      The statistics are exposed via a proc interface. /proc/net/kcm provides
      statistics per KCM socket and per psock (attached TCP sockets).
      /proc/net/kcm_stats provides aggregate statistics.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd6e111b
    • Tom Herbert's avatar
      kcm: Kernel Connection Multiplexor module · ab7ac4eb
      Tom Herbert authored
      This module implements the Kernel Connection Multiplexor.
      
      Kernel Connection Multiplexor (KCM) is a facility that provides a
      message based interface over TCP for generic application protocols.
      With KCM an application can efficiently send and receive application
      protocol messages over TCP using datagram sockets.
      
      For more information see the included Documentation/networking/kcm.txt
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab7ac4eb
    • Tom Herbert's avatar
      tcp: Add tcp_inq to get available receive bytes on socket · 473bd239
      Tom Herbert authored
      Create a common kernel function to get the number of bytes available
      on a TCP socket. This is based on code in INQ getsockopt and we now call
      the function for that getsockopt.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      473bd239
    • Tom Herbert's avatar
      net: Walk fragments in __skb_splice_bits · fa9835e5
      Tom Herbert authored
      Add walking of fragments in __skb_splice_bits.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fa9835e5
    • Tom Herbert's avatar
      net: Add MSG_BATCH flag · f092276d
      Tom Herbert authored
      Add a new msg flag called MSG_BATCH. This flag is used in sendmsg to
      indicate that more messages will follow (i.e. a batch of messages is
      being sent). This is similar to MSG_MORE except that the following
      messages are not merged into one packet, they are sent individually.
      sendmmsg is updated so that each contained message except for the
      last one is marked as MSG_BATCH.
      
      MSG_BATCH is a performance optimization in cases where a socket
      implementation can benefit by transmitting packets in a batch.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f092276d
    • Tom Herbert's avatar
      net: Allow MSG_EOR in each msghdr of sendmmsg · 28a94d8f
      Tom Herbert authored
      This patch allows setting MSG_EOR in each individual msghdr passed
      in sendmmsg. This allows a sendmmsg to send multiple messages when
      using SOCK_SEQPACKET.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28a94d8f
    • Tom Herbert's avatar
      net: Make sock_alloc exportable · f4a00aac
      Tom Herbert authored
      Export it for cases where we want to create sockets by hand.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f4a00aac
    • Tom Herbert's avatar
      rcu: Add list_next_or_null_rcu · ff3c44e6
      Tom Herbert authored
      This is a convenience function that returns the next entry in an RCU
      list or NULL if at the end of the list.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ff3c44e6
    • Alexei Starovoitov's avatar
      samples/bpf: add map performance test · 26e90931
      Alexei Starovoitov authored
      performance tests for hash map and per-cpu hash map
      with and without pre-allocation
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26e90931
    • Alexei Starovoitov's avatar
      samples/bpf: stress test bpf_get_stackid · 7dcc42b6
      Alexei Starovoitov authored
      increase stress by also calling bpf_get_stackid() from
      various *spin* functions
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7dcc42b6
    • Alexei Starovoitov's avatar
      samples/bpf: add bpf map stress test · 9d8b612d
      Alexei Starovoitov authored
      this test calls bpf programs from different contexts:
      from inside of slub, from rcu, from pretty much everywhere,
      since it kprobes all spin_lock functions.
      It stresses the bpf hash and percpu map pre-allocation,
      deallocation logic and call_rcu mechanisms.
      User space part adding more stress by walking and deleting map elements.
      
      Note that due to nature bpf_load.c the earlier kprobe+bpf programs are
      already active while loader loads new programs, creates new kprobes and
      attaches them.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9d8b612d
    • Daniel Borkmann's avatar
      ip_tunnel, bpf: ip_tunnel_info_opts_{get, set} depends on CONFIG_INET · e28e87ed
      Daniel Borkmann authored
      Helpers like ip_tunnel_info_opts_{get,set}() are only available if
      CONFIG_INET is set, thus add an empty definition into the header for
      the !CONFIG_INET case, where already other empty inline helpers are
      defined.
      
      This avoids ifdef kludge inside filter.c, but also vxlan and geneve
      themself where this facility can only be used with, depend on INET
      being set. For the !INET case TUNNEL_OPTIONS_PRESENT would never be
      set in flags.
      
      Fixes: 14ca0751 ("bpf: support for access to tunnel options")
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e28e87ed
  6. 08 Mar, 2016 9 commits
    • David S. Miller's avatar
      Merge branch 'bpf-map-prealloc' · f14b488d
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      bpf: map pre-alloc
      
      v1->v2:
      . fix few issues spotted by Daniel
      . converted stackmap into pre-allocation as well
      . added a workaround for lockdep false positive
      . added pcpu_freelist_populate to be used by hashmap and stackmap
      
      this path set switches bpf hash map to use pre-allocation by default
      and introduces BPF_F_NO_PREALLOC flag to keep old behavior for cases
      where full map pre-allocation is too memory expensive.
      
      Some time back Daniel Wagner reported crashes when bpf hash map is
      used to compute time intervals between preempt_disable->preempt_enable
      and recently Tom Zanussi reported a dead lock in iovisor/bcc/funccount
      tool if it's used to count the number of invocations of kernel
      '*spin*' functions. Both problems are due to the recursive use of
      slub and can only be solved by pre-allocating all map elements.
      
      A lot of different solutions were considered. Many implemented,
      but at the end pre-allocation seems to be the only feasible answer.
      As far as pre-allocation goes it also was implemented 4 different ways:
      - simple free-list with single lock
      - percpu_ida with optimizations
      - blk-mq-tag variant customized for bpf use case
      - percpu_freelist
      For bpf style of alloc/free patterns percpu_freelist is the best
      and implemented in this patch set.
      Detailed performance numbers in patch 3.
      Patch 2 introduces percpu_freelist
      Patch 1 fixes simple deadlocks due to missing recursion checks
      Patch 5: converts stackmap to pre-allocation
      Patches 6-9: prepare test infra
      Patch 10: stress test for hash map infra. It attaches to spin_lock
      functions and bpf_map_update/delete are called from different contexts
      Patch 11: stress for bpf_get_stackid
      Patch 12: map performance test
      Reported-by: default avatarDaniel Wagner <daniel.wagner@bmw-carit.de>
      Reported-by: default avatarTom Zanussi <tom.zanussi@linux.intel.com>
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f14b488d
    • Alexei Starovoitov's avatar
      samples/bpf: test both pre-alloc and normal maps · c3f85cff
      Alexei Starovoitov authored
      extend test coveraged to include pre-allocated and run-time alloc maps
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3f85cff
    • Alexei Starovoitov's avatar
      samples/bpf: add map_flags to bpf loader · 89b97607
      Alexei Starovoitov authored
      note old loader is compatible with new kernel.
      map_flags are optional
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      89b97607
    • Alexei Starovoitov's avatar
      samples/bpf: move ksym_search() into library · 3622e7e4
      Alexei Starovoitov authored
      move ksym search from offwaketime into library to be reused
      in other tests
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3622e7e4
    • Alexei Starovoitov's avatar
      samples/bpf: make map creation more verbose · 618ec9a7
      Alexei Starovoitov authored
      map creation is typically the first one to fail when rlimits are
      too low, not enough memory, etc
      Make this failure scenario more verbose
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      618ec9a7
    • Alexei Starovoitov's avatar
      bpf: convert stackmap to pre-allocation · 557c0c6e
      Alexei Starovoitov authored
      It was observed that calling bpf_get_stackid() from a kprobe inside
      slub or from spin_unlock causes similar deadlock as with hashmap,
      therefore convert stackmap to use pre-allocated memory.
      
      The call_rcu is no longer feasible mechanism, since delayed freeing
      causes bpf_get_stackid() to fail unpredictably when number of actual
      stacks is significantly less than user requested max_entries.
      Since elements are no longer freed into slub, we can push elements into
      freelist immediately and let them be recycled.
      However the very unlikley race between user space map_lookup() and
      program-side recycling is possible:
           cpu0                          cpu1
           ----                          ----
      user does lookup(stackidX)
      starts copying ips into buffer
                                         delete(stackidX)
                                         calls bpf_get_stackid()
      				   which recyles the element and
                                         overwrites with new stack trace
      
      To avoid user space seeing a partial stack trace consisting of two
      merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket);
      to preserve consistent stack trace delivery to user space.
      Now we can move memset(,0) of left-over element value from critical
      path of bpf_get_stackid() into slow-path of user space lookup.
      Also disallow lookup() from bpf program, since it's useless and
      program shouldn't be messing with collected stack trace.
      
      Note that similar race between user space lookup and kernel side updates
      is also present in hashmap, but it's not a new race. bpf programs were
      always allowed to modify hash and array map elements while user space
      is copying them.
      
      Fixes: d5a3b1f6 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      557c0c6e
    • Alexei Starovoitov's avatar
    • Alexei Starovoitov's avatar
      bpf: pre-allocate hash map elements · 6c905981
      Alexei Starovoitov authored
      If kprobe is placed on spin_unlock then calling kmalloc/kfree from
      bpf programs is not safe, since the following dead lock is possible:
      kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
      bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
      and deadlocks.
      
      The following solutions were considered and some implemented, but
      eventually discarded
      - kmem_cache_create for every map
      - add recursion check to slow-path of slub
      - use reserved memory in bpf_map_update for in_irq or in preempt_disabled
      - kmalloc via irq_work
      
      At the end pre-allocation of all map elements turned out to be the simplest
      solution and since the user is charged upfront for all the memory, such
      pre-allocation doesn't affect the user space visible behavior.
      
      Since it's impossible to tell whether kprobe is triggered in a safe
      location from kmalloc point of view, use pre-allocation by default
      and introduce new BPF_F_NO_PREALLOC flag.
      
      While testing of per-cpu hash maps it was discovered
      that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
      fails to allocate memory even when 90% of it is free.
      The pre-allocation of per-cpu hash elements solves this problem as well.
      
      Turned out that bpf_map_update() quickly followed by
      bpf_map_lookup()+bpf_map_delete() is very common pattern used
      in many of iovisor/bcc/tools, so there is additional benefit of
      pre-allocation, since such use cases are must faster.
      
      Since all hash map elements are now pre-allocated we can remove
      atomic increment of htab->count and save few more cycles.
      
      Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
      large malloc/free done by users who don't have sufficient limits.
      
      Pre-allocation is done with vmalloc and alloc/free is done
      via percpu_freelist. Here are performance numbers for different
      pre-allocation algorithms that were implemented, but discarded
      in favor of percpu_freelist:
      
      1 cpu:
      pcpu_ida	2.1M
      pcpu_ida nolock	2.3M
      bt		2.4M
      kmalloc		1.8M
      hlist+spinlock	2.3M
      pcpu_freelist	2.6M
      
      4 cpu:
      pcpu_ida	1.5M
      pcpu_ida nolock	1.8M
      bt w/smp_align	1.7M
      bt no/smp_align	1.1M
      kmalloc		0.7M
      hlist+spinlock	0.2M
      pcpu_freelist	2.0M
      
      8 cpu:
      pcpu_ida	0.7M
      bt w/smp_align	0.8M
      kmalloc		0.4M
      pcpu_freelist	1.5M
      
      32 cpu:
      kmalloc		0.13M
      pcpu_freelist	0.49M
      
      pcpu_ida nolock is a modified percpu_ida algorithm without
      percpu_ida_cpu locks and without cross-cpu tag stealing.
      It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
      
      bt is a variant of block/blk-mq-tag.c simlified and customized
      for bpf use case. bt w/smp_align is using cache line for every 'long'
      (similar to blk-mq-tag). bt no/smp_align allocates 'long'
      bitmasks continuously to save memory. It's comparable to percpu_ida
      and in some cases faster, but slower than percpu_freelist
      
      hlist+spinlock is the simplest free list with single spinlock.
      As expeceted it has very bad scaling in SMP.
      
      kmalloc is existing implementation which is still available via
      BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
      in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
      but saves memory, so in cases where map->max_entries can be large
      and number of map update/delete per second is low, it may make
      sense to use it.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c905981
    • Alexei Starovoitov's avatar
      bpf: introduce percpu_freelist · e19494ed
      Alexei Starovoitov authored
      Introduce simple percpu_freelist to keep single list of elements
      spread across per-cpu singly linked lists.
      
      /* push element into the list */
      void pcpu_freelist_push(struct pcpu_freelist *, struct pcpu_freelist_node *);
      
      /* pop element from the list */
      struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *);
      
      The object is pushed to the current cpu list.
      Pop first trying to get the object from the current cpu list,
      if it's empty goes to the neigbour cpu list.
      
      For bpf program usage pattern the collision rate is very low,
      since programs push and pop the objects typically on the same cpu.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e19494ed