- 15 Jun, 2016 40 commits
-
-
David S. Miller authored
Michael S. Tsirkin says: ==================== skb_array: array based FIFO for skbs This is in response to the proposal by Jason to make tun rx packet queue lockless using a circular buffer. My testing seems to show that at least for the common usecase in networking, which isn't lockless, circular buffer with indices does not perform that well, because each index access causes a cache line to bounce between CPUs, and index access causes stalls due to the dependency. By comparison, an array of pointers where NULL means invalid and !NULL means valid, can be updated without messing up barriers at all and does not have this issue. On the flip side, cache pressure may be caused by using large queues. tun has a queue of 1000 entries by default and that's 8K. At this point I'm not sure this can be solved efficiently. The correct solution might be sizing the queues appropriately. Here's an implementation of this idea: it can be used more or less whenever sk_buff_head can be used, except you need to know the queue size in advance. As this might be useful outside of networking, I implemented a generic array of void pointers, with a type-safe wrapper for skbs. It remains to be seen whether resizing is required, in case it is I included patches implementing resizing by holding both the consumer and the producer locks. I think this code works fine without any extra memory barriers since we always read and write the same location, so the accesses can not be reordered. Multiple writes of the same value into memory would mess things up for us, I don't think compilers would do it though. But if people feel it's better to be safe wrt compiler optimizations, specifying queue as volatile would probably do it in a cleaner way than converting all accesses to READ_ONCE/WRITE_ONCE. Thoughts? The only issue is with calls within a loop using the __ptr_ring_XXX accessors - in theory compiler could hoist accesses out of the loop. Following volatile-considered-harmful.txt I merely documented that callers that busy-poll should invoke cpu_relax(). Most people will use the external skb_array_XXX APIs with a spinlock, so this should not be an issue for them. Eric Dumazet suggested adding an extra pointer to skb for when we have a single outstanding packet. I could not figure out a way to implement this without a shared consumer/producer lock though, which would cause cache line bounces by itself. Jesper, Jason, I know that both of you tested this, please post Tested-by tags for whatever was tested. changes since v7 fix typos noticed by Jesper Brouer changes since v6 resize implemented. peek/full calls are no longer lockless replaced _FIELD macros with _CALL which invoke a function on the pointer rather than just returning a value destroy now scans the array and frees all queued skbs changes since v5 implemented a generic ptr_ring api, and made skb_array a type-safe wrapper apis for taking the spinlock in different contexts following expected usecase in tun changes since v4 (v3 was never posted) documentation dropped SKB_ARRAY_MIN_SIZE heuristic unit test (in userspace, included as patch 2) changes since v2: fixed integer overflow pointed out by Eric. added some comments. changes since v1: fixed bug pointed out by Eric. ==================== Tested-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael S. Tsirkin authored
Update skb_array after ptr_ring API changes. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael S. Tsirkin authored
This adds ring resize support. Seems to be necessary as users such as tun allow userspace control over queue size. If resize is used, this costs us ability to peek at queue without consumer lock - should not be a big deal as peek and consumer are usually run on the same CPU. If ring is made bigger, ring contents is preserved. If ring is made smaller, extra pointers are passed to an optional destructor callback. Cleanup function also gains destructor callback such that all pointers in queue can be cleaned up. This changes some APIs but we don't have any users yet, so it won't break bisect. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael S. Tsirkin authored
A simple array based FIFO of pointers. Intended for net stack so uses skbs for type safety. Implemented as a set of wrappers around ptr_ring. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael S. Tsirkin authored
Add ringtest based unit test for ptr ring. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael S. Tsirkin authored
A simple array based FIFO of pointers. Intended for net stack which commonly has a single consumer/producer. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
WANG Cong authored
Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
David Ahern says: ==================== net: vrf: Handle ipv6 multicast and link-local addresses IPv6 multicast and link-local addresses require special handling by the VRF driver. Rather than using the VRF device index and full FIB lookups, packets to/from these addresses should use direct FIB lookups based on the VRF device table. Multicast routes do not make sense for the L3 master device directly. Accordingly, do not add mcast routes for the device, and the VRF driver should fail attempts to send packets to ipv6 mcast addresses on the device (e.g, ping6 ff02::1%<vrf> should fail) With this change connections into and out of a VRF enslaved device work for multicast and link-local addresses (icmp, tcp, and udp). e.g., 1. packets into VM with VRF config: ping6 -c3 fe80::e0:f9ff:fe1c:b974%br1 ping6 -c3 ff02::1%br1 ssh -6 fe80::e0:f9ff:fe1c:b974%br1 2. packets going out a VRF enslaved device: ping6 -c3 fe80::18f8:83ff:fe4b:7a2e%eth1 ping6 -c3 ff02::1%eth1 ssh -6 root@fe80::18f8:83ff:fe4b:7a2e%eth1 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Ahern authored
IPv6 multicast and link-local addresses require special handling by the VRF driver: 1. Rather than using the VRF device index and full FIB lookups, packets to/from these addresses should use direct FIB lookups based on the VRF device table. 2. fail sends/receives on a VRF device to/from a multicast address (e.g, make ping6 ff02::1%<vrf> fail) 3. move the setting of the flow oif to the first dst lookup and revert the change in icmpv6_echo_reply made in ca254490 ("net: Add VRF support to IPv6 stack"). Linklocal/mcast addresses require use of the skb->dev. With this change connections into and out of a VRF enslaved device work for multicast and link-local addresses work (icmp, tcp, and udp) e.g., 1. packets into VM with VRF config: ping6 -c3 fe80::e0:f9ff:fe1c:b974%br1 ping6 -c3 ff02::1%br1 ssh -6 fe80::e0:f9ff:fe1c:b974%br1 2. packets going out a VRF enslaved device: ping6 -c3 fe80::18f8:83ff:fe4b:7a2e%eth1 ping6 -c3 ff02::1%eth1 ssh -6 root@fe80::18f8:83ff:fe4b:7a2e%eth1 Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Ahern authored
L3 master devices are virtual devices similar to the loopback device. Link local and multicast routes for these devices do not make sense. The ipv6 addrconf code already skips adding a linklocal address; do the same for the mcast route. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Ahern authored
Allow drivers to pass flow arg to functions where the arg is not const and allow the driver to make updates as needed (eg., setting oif). Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Ursula Braun says: ==================== s390: af_iucv patches here are improvements for af_iucv relaxing the pressure to allocate big contiguous kernel buffers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eugene Crosser authored
When an inbound message is bigger than a page, allocate a paged SKB, and subsequently use IUCV receive primitive with IPBUFLST flag. This relaxes the pressure to allocate big contiguous kernel buffers. Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eugene Crosser authored
Before introducing paged skbs in the receive path, get rid of the function `iucv_fragment_skb()` that replaces one large linear skb with several smaller linear skbs. Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eugene Crosser authored
When an outbound message is bigger than a page, allocate and fill a paged SKB, and subsequently use IUCV send primitive with IPBUFLST flag. This relaxes the pressure to allocate big contiguous kernel buffers. Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Shiyan authored
Add device tree binding documentation details for Cirrus Logic CS8900/CS8920 ethernet chip. Signed-off-by: Alexander Shiyan <shc_work@mail.ru> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Shiyan authored
Add DT support to the Cirrus Logic CS89x0 driver. Signed-off-by: Alexander Shiyan <shc_work@mail.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
-
WANG Cong authored
This function is just ->init(), rename it to make it obvious. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
WANG Cong authored
These should be gone when we removed CONFIG_NET_CLS_POLICE. We can not totally remove them since they are exposed to userspace. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Sowmini Varadhan says: ==================== RDS: multiple connection paths for scaling Today RDS-over-TCP is implemented by demux-ing multiple PF_RDS sockets between any 2 endpoints (where endpoint == [IP address, port]) over a single TCP socket between the 2 IP addresses involved. This has the limitation that it ends up funneling multiple RDS flows over a single TCP flow, thus the rds/tcp connection is (a) upper-bounded to the single-flow bandwidth, (b) suffers from head-of-line blocking for the RDS sockets. Better throughput (for a fixed small packet size, MTU) can be achieved by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp connection. RDS sockets will be attached to a path based on some hash (e.g., of local address and RDS port number) and packets for that RDS socket will be sent over the attached path using TCP to segment/reassemble RDS datagrams on that path. The table below, generated using a prototype that implements mprds, shows that this is significant for scaling to 40G. Packet sizes used were: 8K byte req, 256 byte resp. MTU: 1500. The parameters for RDS-concurrency used below are described in the rds-stress(1) man page- the number listed is proportional to the number of threads at which max throughput was attained. ------------------------------------------------------------------- RDS-concurrency Num of tx+rx K/s (iops) throughput (-t N -d N) TCP paths ------------------------------------------------------------------- 16 1 600K - 700K 4 Gbps 28 8 5000K - 6000K 32 Gbps ------------------------------------------------------------------- FAQ: what is the relation between mprds and mptcp? mprds is orthogonal to mptcp. Whereas mptcp creates sub-flows for a single TCP connection, mprds parallelizes tx/rx at the RDS layer. MPRDS with N paths will allow N datagrams to be sent in parallel; each path will continue to send one datagram at a time, with sender and receiver keeping track of the retransmit and dgram-assembly state based on the RDS header. If desired, mptcp can additionally be used to speed up each TCP path. That acceleration is orthogonal to the parallelization benefits of mprds. This patch series lays down the foundational data-structures to support mprds in the kernel. It implements the changes to split up the rds_connection structure into a common (to all paths) part, and a per-path rds_conn_path. All I/O workqs are driven from the rds_conn_path. Note that this patchset does not (yet) actually enable multipathing for any of the transports; all transports will continue to use a single path with the refactored data-structures. A subsequent patchset will add the changes to the rds-tcp module to actually use mprds in rds-tcp. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
Refactor rds_conn_destroy() so that the per-path dismantling is done in rds_conn_path_destroy, and then iterate as needed over rds_conn_path_destroy(). Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
This commit changes rds_conn_shutdown to take a rds_conn_path * argument, allowing it to shutdown paths other than c_path[0] for MP-capable transports. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
Add a for() loop in __rds_conn_create to initialize all the conn_paths, in preparate for MP capable transports. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
rds_conn_path_error() is the MP-aware analog of rds_conn_error, to be used by multipath-capable callers. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
This commit updates the callbacks related to the rds-info command so that they walk through all the rds_conn_path structures and report the requested info. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
rds_conn_path_connect_if_down() works on the rds_conn_path that it is passed. Callers who are not t_m_capable may continue calling rds_conn_connect_if_down, which will invoke rds_conn_path_connect_if_down() with the default c_path[0]. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
This commit allows rds_send_pong() callers to send back the rds pong message on some path other than c_path[0] by passing in a struct rds_conn_path * argument. It also removes the last dependency on the #defines in rds_single.h from send.c Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
Explicitly set up rds_conn_path, either from i_conn_path (for MP capable transpots) or as c_path[0], and use this in rds_send_drop_to() Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
Pass a struct rds_conn_path to rds_send_xmit so that MP capable transports can transmit packets on something other than c_path[0]. The eventual goal for MP capable transports is to hash the rds socket to a path based on the bound local address/port, and use this path as the argument to rds_send_xmit() Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
Pass the rds_conn_path to rds_send_queue_rm, and use it to initialize the i_conn_path field in struct rds_incoming. This commit also makes rds_send_queue_rm() MP capable, because it now takes locks specific to the rds_conn_path passed in, instead of defaulting to the c_path[0] based defines from rds_single_path.h Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
The only caller of rds_send_get_message() was rds_iw_send_cq_comp_handler() which was removed as part of commit dcdede04 ("RDS: Drop stale iWARP RDMA transport"), so remove rds_send_get_message() for the same reason. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
rds_send_path_drop_acked() is the path-specific version of rds_send_drop_acked() to be invoked by MP capable callers. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
rds_send_path_reset() is the path specific version of rds_send_reset() intended for MP capable callers. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
t_mp_capable transports can use rds_inc_path_init to initialize all fields in struct rds_incoming, including the i_conn_path. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
Transports that are t_mp_capable should set the rds_conn_path on which the datagram was recived in the ->i_conn_path field of struct rds_incoming. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
The t_mp_capable bit will be used in the core rds module to support multipathing logic when the transport supports it. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
In preparation for multipath RDS, split the rds_connection structure into a base structure, and a per-path struct rds_conn_path. The base structure tracks information and locks common to all paths. The workqs for send/recv/shutdown etc are tracked per rds_conn_path. Thus the workq callbacks now work with rds_conn_path. This commit allows for one rds_conn_path per rds_connection, and will be extended into multiple conn_paths in subsequent commits. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Neal Cardwell authored
Make sure that dctcp_get_info() returns only the size of the info->dctcp struct that it zeroes out and fills in. Previously it had been returning the size of the enclosing tcp_cc_info union, sizeof(*info). There is no problem yet, but that union that may one day be larger than struct tcp_dctcp_info, in which case the TCP_CC_INFO code might accidentally copy uninitialized bytes from the stack. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Wei Yongjun authored
Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Acked-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Merge tag 'rxrpc-rewrite-20160613' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Rename rxrpc source files Here's the next part of the AF_RXRPC rewrite. In this set I rename some of the files in the net/rxrpc/ directory and adjust the Makefile and ar-internal.h to reflect the changes. The aim is twofold: (1) Remove the "ar-" prefix on those files that have it as it's not really useful, especially now that I'm building rxkad in. (2) To aid splitting the local, peer, connection and call handling code into separate files for object and event handling in future patches by making it easier to come up with new filenames. There are two commits: (1) The first commit does a bunch of renames of .c files and alters the Makefile. ar-internal.h isn't renamed at this time to avoid having to change the contents of the files being renamed. (2) The second commit changes the section label comments in ar-internal.h to reflect the changed filenames and reorders the file so that the sections are back in filename order. The patches can be found here also: http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite Tagged thusly: git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git rxrpc-rewrite-20160613 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-