Merge branch 'Handle-multiple-received-packets-at-each-stage'

Edward Cree says: ==================== Handle multiple received packets at each stage This patch series adds the capability for the network stack to receive a list of packets and process them as a unit, rather than handling each packet singly in sequence. This is done by factoring out the existing datapath code at each layer and wrapping it in list handling code. The motivation for this change is twofold: * Instruction cache locality. Currently, running the entire network stack receive path on a packet involves more code than will fit in the lowest-level icache, meaning that when the next packet is handled, the code has to be reloaded from more distant caches. By handling packets in "row-major order", we ensure that the code at each layer is hot for most of the list. (There is a corresponding downside in _data_ cache locality, since we are now touching every packet at every layer, but in practice there is easily enough room in dcache to hold one cacheline of each of the 64 packets in a NAPI poll.) * Reduction of indirect calls. Owing to Spectre mitigations, indirect function calls are now more expensive than ever; they are also heavily used in the network stack's architecture (see [1]). By replacing 64 indirect calls to the next-layer per-packet function with a single indirect call to the next-layer list function, we can save CPU cycles. Drivers pass an SKB list to the stack at the end of the NAPI poll; this gives a natural batch size (the NAPI poll weight) and avoids waiting at the software level for further packets to make a larger batch (which would add latency). It also means that the batch size is automatically tuned by the existing interrupt moderation mechanism. The stack then runs each layer of processing over all the packets in the list before proceeding to the next layer. Where the 'next layer' (or the context in which it must run) differs among the packets, the stack splits the list; this 'late demux' means that packets which differ only in later headers (e.g. same L2/L3 but different L4) can traverse the early part of the stack together. Also, where the next layer is not (yet) list-aware, the stack can revert to calling the rest of the stack in a loop; this allows gradual/creeping listification, with no 'flag day' patch needed to listify everything. Patches 1-2 simply place received packets on a list during the event processing loop on the sfc EF10 architecture, then call the normal stack for each packet singly at the end of the NAPI poll. (Analogues of patch #2 for other NIC drivers should be fairly straightforward.) Patches 3-9 extend the list processing as far as the IP receive handler. Patches 1-2 alone give about a 10% improvement in packet rate in the baseline test; adding patches 3-9 raises this to around 25%. Performance measurements were made with NetPerf UDP_STREAM, using 1-byte packets and a single core to handle interrupts on the RX side; this was in order to measure as simply as possible the packet rate handled by a single core. Figures are in Mbit/s; divide by 8 to obtain Mpps. The setup was tuned for maximum reproducibility, rather than raw performance. Full details and more results (both with and without retpolines) from a previous version of the patch series are presented in [2]. The baseline test uses four streams, and multiple RXQs all bound to a single CPU (the netperf binary is bound to a neighbouring CPU). These tests were run with retpolines. net-next: 6.91 Mb/s (datum) after 9: 8.46 Mb/s (+22.5%) Note however that these results are not robust; changes in the parameters of the test sometimes shrink the gain to single-digit percentages. For instance, when using only a single RXQ, only a 4% gain was seen. One test variation was the use of software filtering/firewall rules. Adding a single iptables rule (UDP port drop on a port range not matching the test traffic), thus making the netfilter hook have work to do, reduced baseline performance but showed a similar gain from the patches: net-next: 5.02 Mb/s (datum) after 9: 6.78 Mb/s (+35.1%) Similarly, testing with a set of TC flower filters (kindly supplied by Cong Wang) gave the following: net-next: 6.83 Mb/s (datum) after 9: 8.86 Mb/s (+29.7%) These data suggest that the batching approach remains effective in the presence of software switching rules, and perhaps even improves the performance of those rules by allowing them and their codepaths to stay in cache between packets. Changes from v3: * Fixed build error when CONFIG_NETFILTER=n (thanks kbuild). Changes from v2: * Used standard list handling (and skb->list) instead of the skb-queue functions (that use skb->next, skb->prev). - As part of this, changed from a "dequeue, process, enqueue" model to using list_for_each_safe, list_del, and (new) list_cut_before. * Altered __netif_receive_skb_core() changes in patch 6 as per Willem de Bruijn's suggestions (separate **ppt_prev from *pt_prev; renaming). * Removed patches to Generic XDP, since they were producing no benefit. I may revisit them later. * Removed RFC tags. Changes from v1: * Rebased across 2 years' net-next movement (surprisingly straightforward). - Added Generic XDP handling to netif_receive_skb_list_internal() - Dealt with changes to PFMEMALLOC setting APIs * General cleanup of code and comments. * Skipped function calls for empty lists at various points in the stack (patch #9). * Added listified Generic XDP handling (patches 10-12), though it doesn't seem to help (see above). * Extended testing to cover software firewalls / netfilter etc. [1] http://vger.kernel.org/netconf2018_files/DavidMiller_netconf2018.pdf [2] http://vger.kernel.org/netconf2018_files/EdwardCree_netconf2018.pdf ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Handle-multiple-received-packets-at-each-stage'
Edward Cree says: ==================== Handle multiple received packets at each stage This patch series adds the capability for the network stack to receive a list of packets and process them as a unit, rather than handling each packet singly in sequence. This is done by factoring out the existing datapath code at each layer and wrapping it in list handling code. The motivation for this change is twofold: * Instruction cache locality. Currently, running the entire network stack receive path on a packet involves more code than will fit in the lowest-level icache, meaning that when the next packet is handled, the code has to be reloaded from more distant caches. By handling packets in "row-major order", we ensure that the code at each layer is hot for most of the list. (There is a corresponding downside in _data_ cache locality, since we are now touching every packet at every layer, but in practice there is easily enough room in dcache to hold one cacheline of each of the 64 packets in a NAPI poll.) * Reduction of indirect calls. Owing to Spectre mitigations, indirect function calls are now more expensive than ever; they are also heavily used in the network stack's architecture (see [1]). By replacing 64 indirect calls to the next-layer per-packet function with a single indirect call to the next-layer list function, we can save CPU cycles. Drivers pass an SKB list to the stack at the end of the NAPI poll; this gives a natural batch size (the NAPI poll weight) and avoids waiting at the software level for further packets to make a larger batch (which would add latency). It also means that the batch size is automatically tuned by the existing interrupt moderation mechanism. The stack then runs each layer of processing over all the packets in the list before proceeding to the next layer. Where the 'next layer' (or the context in which it must run) differs among the packets, the stack splits the list; this 'late demux' means that packets which differ only in later headers (e.g. same L2/L3 but different L4) can traverse the early part of the stack together. Also, where the next layer is not (yet) list-aware, the stack can revert to calling the rest of the stack in a loop; this allows gradual/creeping listification, with no 'flag day' patch needed to listify everything. Patches 1-2 simply place received packets on a list during the event processing loop on the sfc EF10 architecture, then call the normal stack for each packet singly at the end of the NAPI poll. (Analogues of patch #2 for other NIC drivers should be fairly straightforward.) Patches 3-9 extend the list processing as far as the IP receive handler. Patches 1-2 alone give about a 10% improvement in packet rate in the baseline test; adding patches 3-9 raises this to around 25%. Performance measurements were made with NetPerf UDP_STREAM, using 1-byte packets and a single core to handle interrupts on the RX side; this was in order to measure as simply as possible the packet rate handled by a single core. Figures are in Mbit/s; divide by 8 to obtain Mpps. The setup was tuned for maximum reproducibility, rather than raw performance. Full details and more results (both with and without retpolines) from a previous version of the patch series are presented in [2]. The baseline test uses four streams, and multiple RXQs all bound to a single CPU (the netperf binary is bound to a neighbouring CPU). These tests were run with retpolines. net-next: 6.91 Mb/s (datum) after 9: 8.46 Mb/s (+22.5%) Note however that these results are not robust; changes in the parameters of the test sometimes shrink the gain to single-digit percentages. For instance, when using only a single RXQ, only a 4% gain was seen. One test variation was the use of software filtering/firewall rules. Adding a single iptables rule (UDP port drop on a port range not matching the test traffic), thus making the netfilter hook have work to do, reduced baseline performance but showed a similar gain from the patches: net-next: 5.02 Mb/s (datum) after 9: 6.78 Mb/s (+35.1%) Similarly, testing with a set of TC flower filters (kindly supplied by Cong Wang) gave the following: net-next: 6.83 Mb/s (datum) after 9: 8.86 Mb/s (+29.7%) These data suggest that the batching approach remains effective in the presence of software switching rules, and perhaps even improves the performance of those rules by allowing them and their codepaths to stay in cache between packets. Changes from v3: * Fixed build error when CONFIG_NETFILTER=n (thanks kbuild). Changes from v2: * Used standard list handling (and skb->list) instead of the skb-queue functions (that use skb->next, skb->prev). - As part of this, changed from a "dequeue, process, enqueue" model to using list_for_each_safe, list_del, and (new) list_cut_before. * Altered __netif_receive_skb_core() changes in patch 6 as per Willem de Bruijn's suggestions (separate **ppt_prev from *pt_prev; renaming). * Removed patches to Generic XDP, since they were producing no benefit. I may revisit them later. * Removed RFC tags. Changes from v1: * Rebased across 2 years' net-next movement (surprisingly straightforward). - Added Generic XDP handling to netif_receive_skb_list_internal() - Dealt with changes to PFMEMALLOC setting APIs * General cleanup of code and comments. * Skipped function calls for empty lists at various points in the stack (patch #9). * Added listified Generic XDP handling (patches 10-12), though it doesn't seem to help (see above). * Extended testing to cover software firewalls / netfilter etc. [1] http://vger.kernel.org/netconf2018_files/DavidMiller_netconf2018.pdf [2] http://vger.kernel.org/netconf2018_files/EdwardCree_netconf2018.pdf ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2d1b1385 · David S. Miller · 2bdea157 · b9f463d6 · 2d1b1385 · 2d1b1385
Commit 2d1b1385 authored Jul 04, 2018 by David S. Miller
11 changed files
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -264,11 +264,17 @@ static int efx_check_disabled(struct efx_nic *efx)
 static int efx_process_channel(struct efx_channel *channel, int budget)
 {
 	struct efx_tx_queue *tx_queue;
+	struct list_head rx_list;
 	int spent;
 	if (unlikely(!channel->enabled))
 		return 0;
+	/* Prepare the batch receive list */
+	EFX_WARN_ON_PARANOID(channel->rx_list != NULL);
+	INIT_LIST_HEAD(&rx_list);
+	channel->rx_list = &rx_list;
 	efx_for_each_channel_tx_queue(tx_queue, channel) {
 		tx_queue->pkts_compl = 0;
 		tx_queue->bytes_compl = 0;
@@ -291,6 +297,10 @@ static int efx_process_channel(struct efx_channel *channel, int budget)
 		}
 	}
+	/* Receive any packets we queued up */
+	netif_receive_skb_list(channel->rx_list);
+	channel->rx_list = NULL;
 	return spent;
 }
@@ -555,6 +565,8 @@ static int efx_probe_channel(struct efx_channel *channel)
 			goto fail;
 	}
+	channel->rx_list = NULL;
 	return 0;
 fail:

--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -448,6 +448,7 @@ enum efx_sync_events_state {
 *	__efx_rx_packet(), or zero if there is none
 * @rx_pkt_index: Ring index of first buffer for next packet to be delivered
 *	by __efx_rx_packet(), if @rx_pkt_n_frags != 0
+ * @rx_list: list of SKBs from current RX, awaiting processing
 * @rx_queue: RX queue for this channel
 * @tx_queue: TX queues for this channel
 * @sync_events_state: Current state of sync events on this channel
@@ -500,6 +501,8 @@ struct efx_channel {
 	unsigned int rx_pkt_n_frags;
 	unsigned int rx_pkt_index;
+	struct list_head *rx_list;
 	struct efx_rx_queue rx_queue;
 	struct efx_tx_queue tx_queue[EFX_TXQ_TYPES];

--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -634,6 +634,11 @@ static void efx_rx_deliver(struct efx_channel *channel, u8 *eh,
 			return;
 	/* Pass the packet up */
+	if (channel->rx_list != NULL)
+		/* Add to list, will pass up later */
+		list_add_tail(&skb->list, channel->rx_list);
+	else
+		/* No list, so pass it up now */
 		netif_receive_skb(skb);
 }

--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -285,6 +285,36 @@ static inline void list_cut_position(struct list_head *list,
 		__list_cut_position(list, head, entry);
 }
+/**
+ * list_cut_before - cut a list into two, before given entry
+ * @list: a new list to add all removed entries
+ * @head: a list with entries
+ * @entry: an entry within head, could be the head itself
+ *
+ * This helper moves the initial part of @head, up to but
+ * excluding @entry, from @head to @list.  You should pass
+ * in @entry an element you know is on @head.  @list should
+ * be an empty list or a list you do not care about losing
+ * its data.
+ * If @entry == @head, all entries on @head are moved to
+ * @list.
+ */
+static inline void list_cut_before(struct list_head *list,
+				   struct list_head *head,
+				   struct list_head *entry)
+{
+	if (head->next == entry) {
+		INIT_LIST_HEAD(list);
+		return;
+	}
+	list->next = head->next;
+	list->next->prev = list;
+	list->prev = entry->prev;
+	list->prev->next = list;
+	head->next = entry;
+	entry->prev = head;
+}
 static inline void __list_splice(const struct list_head *list,
 				 struct list_head *prev,
 				 struct list_head *next)

--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2297,6 +2297,9 @@ struct packet_type {
 					 struct net_device *,
 					 struct packet_type *,
 					 struct net_device *);
+	void			(*list_func) (struct list_head *,
+					      struct packet_type *,
+					      struct net_device *);
 	bool			(*id_match)(struct packet_type *ptype,
 					    struct sock *sk);
 	void			*af_packet_priv;
@@ -3477,6 +3480,7 @@ int netif_rx(struct sk_buff *skb);
 int netif_rx_ni(struct sk_buff *skb);
 int netif_receive_skb(struct sk_buff *skb);
 int netif_receive_skb_core(struct sk_buff *skb);
+void netif_receive_skb_list(struct list_head *head);
 gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb);
 void napi_gro_flush(struct napi_struct *napi, bool flush_old);
 struct sk_buff *napi_get_frags(struct napi_struct *napi);

--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -288,6 +288,20 @@ NF_HOOK(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk, struct
 	return ret;
 }
+static inline void
+NF_HOOK_LIST(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
+	     struct list_head *head, struct net_device *in, struct net_device *out,
+	     int (*okfn)(struct net *, struct sock *, struct sk_buff *))
+{
+	struct sk_buff *skb, *next;
+	list_for_each_entry_safe(skb, next, head, list) {
+		int ret = nf_hook(pf, hook, net, sk, skb, in, out, okfn);
+		if (ret != 1)
+			list_del(&skb->list);
+	}
+}
 /* Call setsockopt() */
 int nf_setsockopt(struct sock *sk, u_int8_t pf, int optval, char __user *opt,
 		  unsigned int len);
@@ -369,6 +383,14 @@ NF_HOOK(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
 	return okfn(net, sk, skb);
 }
+static inline void
+NF_HOOK_LIST(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
+	     struct list_head *head, struct net_device *in, struct net_device *out,
+	     int (*okfn)(struct net *, struct sock *, struct sk_buff *))
+{
+	/* nothing to do */
+}
 static inline int nf_hook(u_int8_t pf, unsigned int hook, struct net *net,
 			  struct sock *sk, struct sk_buff *skb,
 			  struct net_device *indev, struct net_device *outdev,

--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -138,6 +138,8 @@ int ip_build_and_send_pkt(struct sk_buff *skb, const struct sock *sk,
 			  struct ip_options_rcu *opt);
 int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	   struct net_device *orig_dev);
+void ip_list_rcv(struct list_head *head, struct packet_type *pt,
+		 struct net_device *orig_dev);
 int ip_local_deliver(struct sk_buff *skb);
 int ip_mr_input(struct sk_buff *skb);
 int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb);

--- a/include/trace/events/net.h
+++ b/include/trace/events/net.h
@@ -223,6 +223,13 @@ DEFINE_EVENT(net_dev_rx_verbose_template, netif_receive_skb_entry,
 	TP_ARGS(skb)
 );
+DEFINE_EVENT(net_dev_rx_verbose_template, netif_receive_skb_list_entry,
+	TP_PROTO(const struct sk_buff *skb),
+	TP_ARGS(skb)
+);
 DEFINE_EVENT(net_dev_rx_verbose_template, netif_rx_entry,
 	TP_PROTO(const struct sk_buff *skb),

--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4608,7 +4608,8 @@ static inline int nf_ingress(struct sk_buff *skb, struct packet_type **pt_prev,
 	return 0;
 }
-static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
+static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc,
+				    struct packet_type **ppt_prev)
 {
 	struct packet_type *ptype, *pt_prev;
 	rx_handler_func_t *rx_handler;
@@ -4738,8 +4739,7 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
 	if (pt_prev) {
 		if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
 			goto drop;
-		else
+		*ppt_prev = pt_prev;
-			ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 	} else {
 drop:
 		if (!deliver_exact)
@@ -4757,6 +4757,18 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
 	return ret;
 }
+static int __netif_receive_skb_one_core(struct sk_buff *skb, bool pfmemalloc)
+{
+	struct net_device *orig_dev = skb->dev;
+	struct packet_type *pt_prev = NULL;
+	int ret;
+	ret = __netif_receive_skb_core(skb, pfmemalloc, &pt_prev);
+	if (pt_prev)
+		ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+	return ret;
+}
 /**
 *	netif_receive_skb_core - special purpose version of netif_receive_skb
 *	@skb: buffer to process
@@ -4777,13 +4789,67 @@ int netif_receive_skb_core(struct sk_buff *skb)
 	int ret;
 	rcu_read_lock();
-	ret = __netif_receive_skb_core(skb, false);
+	ret = __netif_receive_skb_one_core(skb, false);
 	rcu_read_unlock();
 	return ret;
 }
 EXPORT_SYMBOL(netif_receive_skb_core);
+static inline void __netif_receive_skb_list_ptype(struct list_head *head,
+						  struct packet_type *pt_prev,
+						  struct net_device *orig_dev)
+{
+	struct sk_buff *skb, *next;
+	if (!pt_prev)
+		return;
+	if (list_empty(head))
+		return;
+	if (pt_prev->list_func != NULL)
+		pt_prev->list_func(head, pt_prev, orig_dev);
+	else
+		list_for_each_entry_safe(skb, next, head, list)
+			pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+}
+static void __netif_receive_skb_list_core(struct list_head *head, bool pfmemalloc)
+{
+	/* Fast-path assumptions:
+	 * - There is no RX handler.
+	 * - Only one packet_type matches.
+	 * If either of these fails, we will end up doing some per-packet
+	 * processing in-line, then handling the 'last ptype' for the whole
+	 * sublist.  This can't cause out-of-order delivery to any single ptype,
+	 * because the 'last ptype' must be constant across the sublist, and all
+	 * other ptypes are handled per-packet.
+	 */
+	/* Current (common) ptype of sublist */
+	struct packet_type *pt_curr = NULL;
+	/* Current (common) orig_dev of sublist */
+	struct net_device *od_curr = NULL;
+	struct list_head sublist;
+	struct sk_buff *skb, *next;
+	list_for_each_entry_safe(skb, next, head, list) {
+		struct net_device *orig_dev = skb->dev;
+		struct packet_type *pt_prev = NULL;
+		__netif_receive_skb_core(skb, pfmemalloc, &pt_prev);
+		if (pt_curr != pt_prev || od_curr != orig_dev) {
+			/* dispatch old sublist */
+			list_cut_before(&sublist, head, &skb->list);
+			__netif_receive_skb_list_ptype(&sublist, pt_curr, od_curr);
+			/* start new sublist */
+			pt_curr = pt_prev;
+			od_curr = orig_dev;
+		}
+	}
+	/* dispatch final sublist */
+	__netif_receive_skb_list_ptype(head, pt_curr, od_curr);
+}
 static int __netif_receive_skb(struct sk_buff *skb)
 {
 	int ret;
@@ -4801,14 +4867,44 @@ static int __netif_receive_skb(struct sk_buff *skb)
 		 * context down to all allocation sites.
 		 */
 		noreclaim_flag = memalloc_noreclaim_save();
-		ret = __netif_receive_skb_core(skb, true);
+		ret = __netif_receive_skb_one_core(skb, true);
 		memalloc_noreclaim_restore(noreclaim_flag);
 	} else
-		ret = __netif_receive_skb_core(skb, false);
+		ret = __netif_receive_skb_one_core(skb, false);
 	return ret;
 }
+static void __netif_receive_skb_list(struct list_head *head)
+{
+	unsigned long noreclaim_flag = 0;
+	struct sk_buff *skb, *next;
+	bool pfmemalloc = false; /* Is current sublist PF_MEMALLOC? */
+	list_for_each_entry_safe(skb, next, head, list) {
+		if ((sk_memalloc_socks() && skb_pfmemalloc(skb)) != pfmemalloc) {
+			struct list_head sublist;
+			/* Handle the previous sublist */
+			list_cut_before(&sublist, head, &skb->list);
+			if (!list_empty(&sublist))
+				__netif_receive_skb_list_core(&sublist, pfmemalloc);
+			pfmemalloc = !pfmemalloc;
+			/* See comments in __netif_receive_skb */
+			if (pfmemalloc)
+				noreclaim_flag = memalloc_noreclaim_save();
+			else
+				memalloc_noreclaim_restore(noreclaim_flag);
+		}
+	}
+	/* Handle the remaining sublist */
+	if (!list_empty(head))
+		__netif_receive_skb_list_core(head, pfmemalloc);
+	/* Restore pflags */
+	if (pfmemalloc)
+		memalloc_noreclaim_restore(noreclaim_flag);
+}
 static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
 {
 	struct bpf_prog *old = rtnl_dereference(dev->xdp_prog);
@@ -4883,6 +4979,50 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
 	return ret;
 }
+static void netif_receive_skb_list_internal(struct list_head *head)
+{
+	struct bpf_prog *xdp_prog = NULL;
+	struct sk_buff *skb, *next;
+	list_for_each_entry_safe(skb, next, head, list) {
+		net_timestamp_check(netdev_tstamp_prequeue, skb);
+		if (skb_defer_rx_timestamp(skb))
+			/* Handled, remove from list */
+			list_del(&skb->list);
+	}
+	if (static_branch_unlikely(&generic_xdp_needed_key)) {
+		preempt_disable();
+		rcu_read_lock();
+		list_for_each_entry_safe(skb, next, head, list) {
+			xdp_prog = rcu_dereference(skb->dev->xdp_prog);
+			if (do_xdp_generic(xdp_prog, skb) != XDP_PASS)
+				/* Dropped, remove from list */
+				list_del(&skb->list);
+		}
+		rcu_read_unlock();
+		preempt_enable();
+	}
+	rcu_read_lock();
+#ifdef CONFIG_RPS
+	if (static_key_false(&rps_needed)) {
+		list_for_each_entry_safe(skb, next, head, list) {
+			struct rps_dev_flow voidflow, *rflow = &voidflow;
+			int cpu = get_rps_cpu(skb->dev, skb, &rflow);
+			if (cpu >= 0) {
+				enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+				/* Handled, remove from list */
+				list_del(&skb->list);
+			}
+		}
+	}
+#endif
+	__netif_receive_skb_list(head);
+	rcu_read_unlock();
+}
 /**
 *	netif_receive_skb - process receive buffer from network
 *	@skb: buffer to process
@@ -4906,6 +5046,28 @@ int netif_receive_skb(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(netif_receive_skb);
+/**
+ *	netif_receive_skb_list - process many receive buffers from network
+ *	@head: list of skbs to process.
+ *
+ *	Since return value of netif_receive_skb() is normally ignored, and
+ *	wouldn't be meaningful for a list, this function returns void.
+ *
+ *	This function may only be called from softirq context and interrupts
+ *	should be enabled.
+ */
+void netif_receive_skb_list(struct list_head *head)
+{
+	struct sk_buff *skb;
+	if (list_empty(head))
+		return;
+	list_for_each_entry(skb, head, list)
+		trace_netif_receive_skb_list_entry(skb);
+	netif_receive_skb_list_internal(head);
+}
+EXPORT_SYMBOL(netif_receive_skb_list);
 DEFINE_PER_CPU(struct work_struct, flush_works);
 /* Network device is going away, flush any packets still pending */

--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1882,6 +1882,7 @@ fs_initcall(ipv4_offload_init);
 static struct packet_type ip_packet_type __read_mostly = {
 	.type = cpu_to_be16(ETH_P_IP),
 	.func = ip_rcv,
+	.list_func = ip_list_rcv,
 };
 static int __init inet_init(void)

--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -307,7 +307,8 @@ static inline bool ip_rcv_options(struct sk_buff *skb)
 	return true;
 }
-static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
+static int ip_rcv_finish_core(struct net *net, struct sock *sk,
+			      struct sk_buff *skb)
 {
 	const struct iphdr *iph = ip_hdr(skb);
 	int (*edemux)(struct sk_buff *skb);
@@ -393,7 +394,7 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 			goto drop;
 	}
-	return dst_input(skb);
+	return NET_RX_SUCCESS;
 drop:
 	kfree_skb(skb);
@@ -405,13 +406,21 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 	goto drop;
 }
+static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
+{
+	int ret = ip_rcv_finish_core(net, sk, skb);
+	if (ret != NET_RX_DROP)
+		ret = dst_input(skb);
+	return ret;
+}
 /*
 * 	Main IP Receive routine.
 */
-int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev)
+static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net)
 {
 	const struct iphdr *iph;
-	struct net *net;
 	u32 len;
 	/* When the interface is in promisc. mode, drop all the crap
@@ -421,7 +430,6 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 		goto drop;
-	net = dev_net(dev);
 	__IP_UPD_PO_STATS(net, IPSTATS_MIB_IN, skb->len);
 	skb = skb_share_check(skb, GFP_ATOMIC);
@@ -489,9 +497,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	/* Must drop socket now because of tproxy. */
 	skb_orphan(skb);
-	return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
+	return skb;
-		       net, NULL, skb, dev, NULL,
-		       ip_rcv_finish);
 csum_error:
 	__IP_INC_STATS(net, IPSTATS_MIB_CSUMERRORS);
@@ -500,5 +506,95 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 drop:
 	kfree_skb(skb);
 out:
+	return NULL;
+}
+/*
+ * IP receive entry point
+ */
+int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
+	   struct net_device *orig_dev)
+{
+	struct net *net = dev_net(dev);
+	skb = ip_rcv_core(skb, net);
+	if (skb == NULL)
 		return NET_RX_DROP;
+	return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
+		       net, NULL, skb, dev, NULL,
+		       ip_rcv_finish);
+}
+static void ip_sublist_rcv_finish(struct list_head *head)
+{
+	struct sk_buff *skb, *next;
+	list_for_each_entry_safe(skb, next, head, list)
+		dst_input(skb);
+}
+static void ip_list_rcv_finish(struct net *net, struct sock *sk,
+			       struct list_head *head)
+{
+	struct dst_entry *curr_dst = NULL;
+	struct sk_buff *skb, *next;
+	struct list_head sublist;
+	list_for_each_entry_safe(skb, next, head, list) {
+		struct dst_entry *dst;
+		if (ip_rcv_finish_core(net, sk, skb) == NET_RX_DROP)
+			continue;
+		dst = skb_dst(skb);
+		if (curr_dst != dst) {
+			/* dispatch old sublist */
+			list_cut_before(&sublist, head, &skb->list);
+			if (!list_empty(&sublist))
+				ip_sublist_rcv_finish(&sublist);
+			/* start new sublist */
+			curr_dst = dst;
+		}
+	}
+	/* dispatch final sublist */
+	ip_sublist_rcv_finish(head);
+}
+static void ip_sublist_rcv(struct list_head *head, struct net_device *dev,
+			   struct net *net)
+{
+	NF_HOOK_LIST(NFPROTO_IPV4, NF_INET_PRE_ROUTING, net, NULL,
+		     head, dev, NULL, ip_rcv_finish);
+	ip_list_rcv_finish(net, NULL, head);
+}
+/* Receive a list of IP packets */
+void ip_list_rcv(struct list_head *head, struct packet_type *pt,
+		 struct net_device *orig_dev)
+{
+	struct net_device *curr_dev = NULL;
+	struct net *curr_net = NULL;
+	struct sk_buff *skb, *next;
+	struct list_head sublist;
+	list_for_each_entry_safe(skb, next, head, list) {
+		struct net_device *dev = skb->dev;
+		struct net *net = dev_net(dev);
+		skb = ip_rcv_core(skb, net);
+		if (skb == NULL)
+			continue;
+		if (curr_dev != dev || curr_net != net) {
+			/* dispatch old sublist */
+			list_cut_before(&sublist, head, &skb->list);
+			if (!list_empty(&sublist))
+				ip_sublist_rcv(&sublist, dev, net);
+			/* start new sublist */
+			curr_dev = dev;
+			curr_net = net;
+		}
+	}
+	/* dispatch final sublist */
+	ip_sublist_rcv(head, curr_dev, curr_net);
 }