idpf: add TX splitq napi poll support

Add support to handle the interrupts for the TX completion queue and process the various completion types. In the flow scheduling mode, the driver processes primarily buffer completions as well as descriptor completions occasionally. This mode supports out of order TX completions. To do so, HW generates one buffer completion per packet. Each of those completions contains the unique tag provided during the TX encoding which is used to locate the packet either on the TX buffer ring or in a hash table. The hash table is used to track TX buffer information so the descriptor(s) for a given packet can be reused while the driver is still waiting on the buffer completion(s). Packets end up in the hash table in one of 2 ways: 1) a packet was stashed during descriptor completion cleaning, or 2) because an out of order buffer completion was processed. A descriptor completion arrives only every so often and is primarily used to guarantee the TX descriptor ring can be reused without having to wait on the individual buffer completions. E.g. a descriptor completion for N+16 guarantees HW read all of the descriptors for packets N through N+15, therefore all of the buffers for packets N through N+15 are stashed into the hash table and the descriptors can be reused for more TX packets. Similarly, a packet can be stashed in the hash table because an out an order buffer completion was processed. E.g. processing a buffer completion for packet N+3 implies that HW read all of the descriptors for packets N through N+3 and they can be reused. However, the HW did not do the DMA yet. The buffers for packets N through N+2 cannot be freed, so they are stashed in the hash table. In either case, the buffer completions will eventually be processed for all of the stashed packets, and all of the buffers will be cleaned from the hash table. In queue based scheduling mode, the driver processes primarily descriptor completions and cleans the TX ring the conventional way. Finally, the driver triggers a TX queue drain after sending the disable queues virtchnl message. When the HW completes the queue draining, it sends the driver a queue marker packet completion. The driver determines when all TX queues have been drained and proceeds with the disable flow. With this, the driver can send TX packets and clean up the resources properly. Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Co-developed-by: Alan Brady <alan.brady@intel.com> Signed-off-by: Alan Brady <alan.brady@intel.com> Co-developed-by: Madhu Chittim <madhu.chittim@intel.com> Signed-off-by: Madhu Chittim <madhu.chittim@intel.com> Co-developed-by: Phani Burra <phani.r.burra@intel.com> Signed-off-by: Phani Burra <phani.r.burra@intel.com> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Co-developed-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com> Signed-off-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

idpf: add TX splitq napi poll support
Add support to handle the interrupts for the TX completion queue and process the various completion types. In the flow scheduling mode, the driver processes primarily buffer completions as well as descriptor completions occasionally. This mode supports out of order TX completions. To do so, HW generates one buffer completion per packet. Each of those completions contains the unique tag provided during the TX encoding which is used to locate the packet either on the TX buffer ring or in a hash table. The hash table is used to track TX buffer information so the descriptor(s) for a given packet can be reused while the driver is still waiting on the buffer completion(s). Packets end up in the hash table in one of 2 ways: 1) a packet was stashed during descriptor completion cleaning, or 2) because an out of order buffer completion was processed. A descriptor completion arrives only every so often and is primarily used to guarantee the TX descriptor ring can be reused without having to wait on the individual buffer completions. E.g. a descriptor completion for N+16 guarantees HW read all of the descriptors for packets N through N+15, therefore all of the buffers for packets N through N+15 are stashed into the hash table and the descriptors can be reused for more TX packets. Similarly, a packet can be stashed in the hash table because an out an order buffer completion was processed. E.g. processing a buffer completion for packet N+3 implies that HW read all of the descriptors for packets N through N+3 and they can be reused. However, the HW did not do the DMA yet. The buffers for packets N through N+2 cannot be freed, so they are stashed in the hash table. In either case, the buffer completions will eventually be processed for all of the stashed packets, and all of the buffers will be cleaned from the hash table. In queue based scheduling mode, the driver processes primarily descriptor completions and cleans the TX ring the conventional way. Finally, the driver triggers a TX queue drain after sending the disable queues virtchnl message. When the HW completes the queue draining, it sends the driver a queue marker packet completion. The driver determines when all TX queues have been drained and proceeds with the disable flow. With this, the driver can send TX packets and clean up the resources properly. Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Co-developed-by: Alan Brady <alan.brady@intel.com> Signed-off-by: Alan Brady <alan.brady@intel.com> Co-developed-by: Madhu Chittim <madhu.chittim@intel.com> Signed-off-by: Madhu Chittim <madhu.chittim@intel.com> Co-developed-by: Phani Burra <phani.r.burra@intel.com> Signed-off-by: Phani Burra <phani.r.burra@intel.com> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Co-developed-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com> Signed-off-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
c2d548ca · Joshua Hay · Tony Nguyen · 6818c4d5 · c2d548ca · c2d548ca
Commit c2d548ca authored Aug 07, 2023 by Joshua Hay Committed by Tony Nguyen Sep 13, 2023
6 changed files
--- a/drivers/net/ethernet/intel/idpf/idpf.h
+++ b/drivers/net/ethernet/intel/idpf/idpf.h
@@ -14,6 +14,7 @@ struct idpf_vport_max_q;
 #include <linux/etherdevice.h>
 #include <linux/pci.h>
 #include <linux/bitfield.h>
+#include <linux/dim.h>

 #include "virtchnl2.h"
 #include "idpf_lan_txrx.h"
@@ -41,6 +42,8 @@ struct idpf_vport_max_q;
 /* available message levels */
 #define IDPF_AVAIL_NETIF_M (NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK)

+#define IDPF_DIM_PROFILE_SLOTS  5
+
 #define IDPF_VIRTCHNL_VERSION_MAJOR VIRTCHNL2_VERSION_MAJOR_2
 #define IDPF_VIRTCHNL_VERSION_MINOR VIRTCHNL2_VERSION_MINOR_0

@@ -254,12 +257,24 @@ enum idpf_vport_vc_state {

 extern const char * const idpf_vport_vc_state_str[];

+/**
+ * enum idpf_vport_flags - Vport flags
+ * @IDPF_VPORT_SW_MARKER: Indicate TX pipe drain software marker packets
+ *			  processing is done
+ * @IDPF_VPORT_FLAGS_NBITS: Must be last
+ */
+enum idpf_vport_flags {
+	IDPF_VPORT_SW_MARKER,
+	IDPF_VPORT_FLAGS_NBITS,
+};
+
 /**
 * struct idpf_vport - Handle for netdevices and queue resources
 * @num_txq: Number of allocated TX queues
 * @num_complq: Number of allocated completion queues
 * @txq_desc_count: TX queue descriptor count
 * @complq_desc_count: Completion queue descriptor count
+ * @compln_clean_budget: Work budget for completion clean
 * @num_txq_grp: Number of TX queue groups
 * @txq_grps: Array of TX queue groups
 * @txq_model: Split queue or single queue queuing model
@@ -280,6 +295,7 @@ extern const char * const idpf_vport_vc_state_str[];
 * @adapter: back pointer to associated adapter
 * @netdev: Associated net_device. Each vport should have one and only one
 *	    associated netdev.
+ * @flags: See enum idpf_vport_flags
 * @vport_type: Default SRIOV, SIOV, etc.
 * @vport_id: Device given vport identifier
 * @idx: Software index in adapter vports struct
@@ -290,10 +306,12 @@ extern const char * const idpf_vport_vc_state_str[];
 * @q_vector_idxs: Starting index of queue vectors
 * @max_mtu: device given max possible MTU
 * @default_mac_addr: device will give a default MAC to use
+ * @tx_itr_profile: TX profiles for Dynamic Interrupt Moderation
 * @link_up: True if link is up
 * @vc_msg: Virtchnl message buffer
 * @vc_state: Virtchnl message state
 * @vchnl_wq: Wait queue for virtchnl messages
+ * @sw_marker_wq: workqueue for marker packets
 * @vc_buf_lock: Lock to protect virtchnl buffer
 */
 struct idpf_vport {
@@ -301,6 +319,7 @@ struct idpf_vport {
 	u16 num_complq;
 	u32 txq_desc_count;
 	u32 complq_desc_count;
+	u32 compln_clean_budget;
 	u16 num_txq_grp;
 	struct idpf_txq_group *txq_grps;
 	u32 txq_model;
@@ -319,6 +338,7 @@ struct idpf_vport {

 	struct idpf_adapter *adapter;
 	struct net_device *netdev;
+	DECLARE_BITMAP(flags, IDPF_VPORT_FLAGS_NBITS);
 	u16 vport_type;
 	u32 vport_id;
 	u16 idx;
@@ -330,6 +350,7 @@ struct idpf_vport {
 	u16 *q_vector_idxs;
 	u16 max_mtu;
 	u8 default_mac_addr[ETH_ALEN];
+	u16 tx_itr_profile[IDPF_DIM_PROFILE_SLOTS];

 	bool link_up;

@@ -337,6 +358,7 @@ struct idpf_vport {
 	DECLARE_BITMAP(vc_state, IDPF_VC_NBITS);

 	wait_queue_head_t vchnl_wq;
+	wait_queue_head_t sw_marker_wq;
 	struct mutex vc_buf_lock;
 };


--- a/drivers/net/ethernet/intel/idpf/idpf_lan_txrx.h
+++ b/drivers/net/ethernet/intel/idpf/idpf_lan_txrx.h
@@ -56,6 +56,14 @@ enum idpf_rss_hash {
 	BIT_ULL(IDPF_HASH_NONF_UNICAST_IPV6_UDP) |		\
 	BIT_ULL(IDPF_HASH_NONF_MULTICAST_IPV6_UDP))

+/* For idpf_splitq_base_tx_compl_desc */
+#define IDPF_TXD_COMPLQ_GEN_S		15
+#define IDPF_TXD_COMPLQ_GEN_M		BIT_ULL(IDPF_TXD_COMPLQ_GEN_S)
+#define IDPF_TXD_COMPLQ_COMPL_TYPE_S	11
+#define IDPF_TXD_COMPLQ_COMPL_TYPE_M	GENMASK_ULL(13, 11)
+#define IDPF_TXD_COMPLQ_QID_S		0
+#define IDPF_TXD_COMPLQ_QID_M		GENMASK_ULL(9, 0)
+
 #define IDPF_TXD_CTX_QW1_MSS_S		50
 #define IDPF_TXD_CTX_QW1_MSS_M		GENMASK_ULL(63, 50)
 #define IDPF_TXD_CTX_QW1_TSO_LEN_S	30
@@ -75,6 +83,14 @@ enum idpf_rss_hash {
 #define IDPF_TXD_QW1_DTYPE_S		0
 #define IDPF_TXD_QW1_DTYPE_M		GENMASK_ULL(3, 0)

+/* TX Completion Descriptor Completion Types */
+#define IDPF_TXD_COMPLT_ITR_FLUSH	0
+/* Descriptor completion type 1 is reserved */
+#define IDPF_TXD_COMPLT_RS		2
+/* Descriptor completion type 3 is reserved */
+#define IDPF_TXD_COMPLT_RE		4
+#define IDPF_TXD_COMPLT_SW_MARKER	5
+
 enum idpf_tx_desc_dtype_value {
 	IDPF_TX_DESC_DTYPE_DATA				= 0,
 	IDPF_TX_DESC_DTYPE_CTX				= 1,

--- a/drivers/net/ethernet/intel/idpf/idpf_lib.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_lib.c
@@ -929,6 +929,7 @@ static struct idpf_vport *idpf_vport_alloc(struct idpf_adapter *adapter,

 	vport->idx = idx;
 	vport->adapter = adapter;
+	vport->compln_clean_budget = IDPF_TX_COMPLQ_CLEAN_BUDGET;
 	vport->default_vport = adapter->num_alloc_vports <
 			       idpf_get_default_vports(adapter);

@@ -1241,6 +1242,7 @@ void idpf_init_task(struct work_struct *work)
 	index = vport->idx;
 	vport_config = adapter->vport_config[index];

+	init_waitqueue_head(&vport->sw_marker_wq);
 	init_waitqueue_head(&vport->vchnl_wq);

 	mutex_init(&vport->vc_buf_lock);

--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.h
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
@@ -15,6 +15,9 @@
 #define IDPF_MIN_TXQ_COMPLQ_DESC		256
 #define IDPF_MAX_QIDS				256

+#define IDPF_MIN_TX_DESC_NEEDED (MAX_SKB_FRAGS + 6)
+#define IDPF_TX_WAKE_THRESH ((u16)IDPF_MIN_TX_DESC_NEEDED * 2)
+
 #define MIN_SUPPORT_TXDID (\
 	VIRTCHNL2_TXDID_FLEX_FLOW_SCHED |\
 	VIRTCHNL2_TXDID_FLEX_TSO_CTX)
@@ -79,6 +82,9 @@
 #define IDPF_SPLITQ_RX_BUF_DESC(rxq, i)	\
 	(&(((struct virtchnl2_splitq_rx_buf_desc *)((rxq)->desc_ring))[i]))

+#define IDPF_SPLITQ_TX_COMPLQ_DESC(txcq, i)	\
+	(&(((struct idpf_splitq_tx_compl_desc *)((txcq)->desc_ring))[i]))
+
 #define IDPF_FLEX_TX_DESC(txq, i) \
 	(&(((union idpf_tx_flex_desc *)((txq)->desc_ring))[i]))
 #define IDPF_FLEX_TX_CTX_DESC(txq, i)	\
@@ -155,7 +161,8 @@ struct idpf_tx_buf {
 };

 struct idpf_tx_stash {
-	/* stub */
+	struct hlist_node hlist;
+	struct idpf_tx_buf buf;
 };

 /**
@@ -209,6 +216,7 @@ struct idpf_tx_splitq_params {
 	struct idpf_tx_offload_params offload;
 };

+#define IDPF_TX_COMPLQ_CLEAN_BUDGET	256
 #define IDPF_TX_MIN_PKT_LEN		17
 #define IDPF_TX_DESCS_FOR_SKB_DATA_PTR	1
 #define IDPF_TX_DESCS_PER_CACHE_LINE	(L1_CACHE_BYTES / \
@@ -362,12 +370,16 @@ struct idpf_rx_ptype_decoded {
 * @__IDPF_RFLQ_GEN_CHK: Refill queues are SW only, so Q_GEN acts as the HW bit
 *			 and RFLGQ_GEN is the SW bit.
 * @__IDPF_Q_FLOW_SCH_EN: Enable flow scheduling
+ * @__IDPF_Q_SW_MARKER: Used to indicate TX queue marker completions
+ * @__IDPF_Q_POLL_MODE: Enable poll mode
 * @__IDPF_Q_FLAGS_NBITS: Must be last
 */
 enum idpf_queue_flags_t {
 	__IDPF_Q_GEN_CHK,
 	__IDPF_RFLQ_GEN_CHK,
 	__IDPF_Q_FLOW_SCH_EN,
+	__IDPF_Q_SW_MARKER,
+	__IDPF_Q_POLL_MODE,

 	__IDPF_Q_FLAGS_NBITS,
 };
@@ -418,6 +430,7 @@ struct idpf_intr_reg {
 * @intr_reg: See struct idpf_intr_reg
 * @num_txq: Number of TX queues
 * @tx: Array of TX queues to service
+ * @tx_dim: Data for TX net_dim algorithm
 * @tx_itr_value: TX interrupt throttling rate
 * @tx_intr_mode: Dynamic ITR or not
 * @tx_itr_idx: TX ITR index
@@ -428,6 +441,7 @@ struct idpf_intr_reg {
 * @rx_itr_idx: RX ITR index
 * @num_bufq: Number of buffer queues
 * @bufq: Array of buffer queues to service
+ * @total_events: Number of interrupts processed
 * @name: Queue vector name
 */
 struct idpf_q_vector {
@@ -439,6 +453,7 @@ struct idpf_q_vector {

 	u16 num_txq;
 	struct idpf_queue **tx;
+	struct dim tx_dim;
 	u16 tx_itr_value;
 	bool tx_intr_mode;
 	u32 tx_itr_idx;
@@ -452,6 +467,7 @@ struct idpf_q_vector {
 	u16 num_bufq;
 	struct idpf_queue **bufq;

+	u16 total_events;
 	char *name;
 };

@@ -460,6 +476,8 @@ struct idpf_rx_queue_stats {
 };

 struct idpf_tx_queue_stats {
+	u64_stats_t packets;
+	u64_stats_t bytes;
 	u64_stats_t lso_pkts;
 	u64_stats_t linearize;
 	u64_stats_t q_busy;
@@ -467,6 +485,11 @@ struct idpf_tx_queue_stats {
 	u64_stats_t dma_map_errs;
 };

+struct idpf_cleaned_stats {
+	u32 packets;
+	u32 bytes;
+};
+
 union idpf_queue_stats {
 	struct idpf_rx_queue_stats rx;
 	struct idpf_tx_queue_stats tx;
@@ -474,9 +497,16 @@ union idpf_queue_stats {

 #define IDPF_ITR_DYNAMIC	1
 #define IDPF_ITR_20K		0x0032
+#define IDPF_ITR_GRAN_S		1	/* Assume ITR granularity is 2us */
+#define IDPF_ITR_MASK		0x1FFE  /* ITR register value alignment mask */
+#define ITR_REG_ALIGN(setting)	((setting) & IDPF_ITR_MASK)
+#define IDPF_ITR_IS_DYNAMIC(itr_mode) (itr_mode)
 #define IDPF_ITR_TX_DEF		IDPF_ITR_20K
 #define IDPF_ITR_RX_DEF		IDPF_ITR_20K
+/* Index used for 'No ITR' update in DYN_CTL register */
+#define IDPF_NO_ITR_UPDATE_IDX	3
 #define IDPF_ITR_IDX_SPACING(spacing, dflt)	(spacing ? spacing : dflt)
+#define IDPF_DIM_DEFAULT_PROFILE_IX		1

 /**
 * struct idpf_queue
@@ -512,6 +542,15 @@ union idpf_queue_stats {
 * @flags: See enum idpf_queue_flags_t
 * @q_stats: See union idpf_queue_stats
 * @stats_sync: See struct u64_stats_sync
+ * @cleaned_bytes: Splitq only, TXQ only: When a TX completion is received on
+ *		   the TX completion queue, it can be for any TXQ associated
+ *		   with that completion queue. This means we can clean up to
+ *		   N TXQs during a single call to clean the completion queue.
+ *		   cleaned_bytes|pkts tracks the clean stats per TXQ during
+ *		   that single call to clean the completion queue. By doing so,
+ *		   we can update BQL with aggregate cleaned stats for each TXQ
+ *		   only once at the end of the cleaning routine.
+ * @cleaned_pkts: Number of packets cleaned for the above said case
 * @rx_hsplit_en: RX headsplit enable
 * @rx_hbuf_size: Header buffer size
 * @rx_buf_size: Buffer size
@@ -587,6 +626,9 @@ struct idpf_queue {
 	union idpf_queue_stats q_stats;
 	struct u64_stats_sync stats_sync;

+	u32 cleaned_bytes;
+	u16 cleaned_pkts;
+
 	bool rx_hsplit_en;
 	u16 rx_hbuf_size;
 	u16 rx_buf_size;

--- a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
@@ -645,6 +645,36 @@ static int idpf_wait_for_event(struct idpf_adapter *adapter,
 				     IDPF_WAIT_FOR_EVENT_TIMEO);
 }

+/**
+ * idpf_wait_for_marker_event - wait for software marker response
+ * @vport: virtual port data structure
+ *
+ * Returns 0 success, negative on failure.
+ **/
+static int idpf_wait_for_marker_event(struct idpf_vport *vport)
+{
+	int event;
+	int i;
+
+	for (i = 0; i < vport->num_txq; i++)
+		set_bit(__IDPF_Q_SW_MARKER, vport->txqs[i]->flags);
+
+	event = wait_event_timeout(vport->sw_marker_wq,
+				   test_and_clear_bit(IDPF_VPORT_SW_MARKER,
+						      vport->flags),
+				   msecs_to_jiffies(500));
+
+	for (i = 0; i < vport->num_txq; i++)
+		clear_bit(__IDPF_Q_POLL_MODE, vport->txqs[i]->flags);
+
+	if (event)
+		return 0;
+
+	dev_warn(&vport->adapter->pdev->dev, "Failed to receive marker packets\n");
+
+	return -ETIMEDOUT;
+}
+
 /**
 * idpf_send_ver_msg - send virtchnl version message
 * @adapter: Driver specific private structure
@@ -1936,7 +1966,23 @@ int idpf_send_enable_queues_msg(struct idpf_vport *vport)
 */
 int idpf_send_disable_queues_msg(struct idpf_vport *vport)
 {
-	return idpf_send_ena_dis_queues_msg(vport, VIRTCHNL2_OP_DISABLE_QUEUES);
+	int err, i;
+
+	err = idpf_send_ena_dis_queues_msg(vport, VIRTCHNL2_OP_DISABLE_QUEUES);
+	if (err)
+		return err;
+
+	/* switch to poll mode as interrupts will be disabled after disable
+	 * queues virtchnl message is sent
+	 */
+	for (i = 0; i < vport->num_txq; i++)
+		set_bit(__IDPF_Q_POLL_MODE, vport->txqs[i]->flags);
+
+	/* schedule the napi to receive all the marker packets */
+	for (i = 0; i < vport->num_q_vectors; i++)
+		napi_schedule(&vport->q_vectors[i].napi);
+
+	return idpf_wait_for_marker_event(vport);
 }

 /**
@@ -2813,6 +2859,7 @@ void idpf_vport_init(struct idpf_vport *vport, struct idpf_vport_max_q *max_q)
 	struct idpf_adapter *adapter = vport->adapter;
 	struct virtchnl2_create_vport *vport_msg;
 	struct idpf_vport_config *vport_config;
+	u16 tx_itr[] = {2, 8, 64, 128, 256};
 	struct idpf_rss_data *rss_data;
 	u16 idx = vport->idx;

@@ -2837,6 +2884,9 @@ void idpf_vport_init(struct idpf_vport *vport, struct idpf_vport_max_q *max_q)
 	ether_addr_copy(vport->default_mac_addr, vport_msg->default_mac_addr);
 	vport->max_mtu = le16_to_cpu(vport_msg->max_mtu) - IDPF_PACKET_HDR_PAD;

+	/* Initialize Tx profiles for Dynamic Interrupt Moderation */
+	memcpy(vport->tx_itr_profile, tx_itr, IDPF_DIM_PROFILE_SLOTS);
+
 	idpf_vport_init_num_qs(vport, vport_msg);
 	idpf_vport_calc_num_q_desc(vport);
 	idpf_vport_calc_num_q_groups(vport);