Commits · f845a027de66d1efeb8cb22020e2c50baebdc441 · Kirill Smelkov / linux

12 Sep, 2024 24 commits

net: ethernet: oa_tc6: enable open alliance tc6 data communication · f845a027

Parthiban Veerasooran authored Sep 09, 2024

Enabling Configuration Synchronization bit (SYNC) in the Configuration
Register #0 enables data communication in the MAC-PHY. The state of this
bit is reflected in the data footer SYNC bit.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Parthiban Veerasooran <Parthiban.Veerasooran@microchip.com>
Link: https://patch.msgid.link/20240909082514.262942-9-Parthiban.Veerasooran@microchip.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f845a027

net: phy: microchip_t1s: add c45 direct access in LAN865x internal PHY · 18a91876

Parthiban Veerasooran authored Sep 09, 2024

This patch adds c45 registers direct access support in Microchip's
LAN865x internal PHY.

OPEN Alliance 10BASE-T1x compliance MAC-PHYs will have both C22 and C45
registers space. If the PHY is discovered via C22 bus protocol it assumes
it uses C22 protocol and always uses C22 registers indirect access to
access C45 registers. This is because, we don't have a clean separation
between C22/C45 register space and C22/C45 MDIO bus protocols. Resulting,
PHY C45 registers direct access can't be used which can save multiple SPI
bus access. To support this feature, set .read_mmd/.write_mmd in the PHY
driver to call .read_c45/.write_c45 in the OPEN Alliance framework
drivers/net/ethernet/oa_tc6.c
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Parthiban Veerasooran <Parthiban.Veerasooran@microchip.com>
Link: https://patch.msgid.link/20240909082514.262942-8-Parthiban.Veerasooran@microchip.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

18a91876

net: ethernet: oa_tc6: implement internal PHY initialization · 8f9bf857

Parthiban Veerasooran authored Sep 09, 2024

Internal PHY is initialized as per the PHY register capability supported
by the MAC-PHY. Direct PHY Register Access Capability indicates if PHY
registers are directly accessible within the SPI register memory space.
Indirect PHY Register Access Capability indicates if PHY registers are
indirectly accessible through the MDIO/MDC registers MDIOACCn defined in
OPEN Alliance specification. Currently the direct register access is only
supported.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Parthiban Veerasooran <Parthiban.Veerasooran@microchip.com>
Link: https://patch.msgid.link/20240909082514.262942-7-Parthiban.Veerasooran@microchip.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8f9bf857

net: ethernet: oa_tc6: implement error interrupts unmasking · 86c03a0f

Parthiban Veerasooran authored Sep 09, 2024

This will unmask the following error interrupts from the MAC-PHY.
  tx protocol error
  rx buffer overflow error
  loss of framing error
  header error
The MAC-PHY will signal an error by setting the EXST bit in the receive
data footer which will then allow the host to read the STATUS0 register
to find the source of the error.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Parthiban Veerasooran <Parthiban.Veerasooran@microchip.com>
Link: https://patch.msgid.link/20240909082514.262942-6-Parthiban.Veerasooran@microchip.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

86c03a0f

net: ethernet: oa_tc6: implement software reset · 1f9c4eed

Parthiban Veerasooran authored Sep 09, 2024

Reset complete bit is set when the MAC-PHY reset completes and ready for
configuration. Additionally reset complete bit in the STS0 register has
to be written by one upon reset complete to clear the interrupt.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Parthiban Veerasooran <Parthiban.Veerasooran@microchip.com>
Link: https://patch.msgid.link/20240909082514.262942-5-Parthiban.Veerasooran@microchip.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1f9c4eed

net: ethernet: oa_tc6: implement register read operation · 375d1e02

Parthiban Veerasooran authored Sep 09, 2024

Implement register read operation according to the control communication
specified in the OPEN Alliance 10BASE-T1x MACPHY Serial Interface
document. Control read commands are used by the SPI host to read
registers within the MAC-PHY. Each control read commands are composed of
a 32 bits control command header.

The MAC-PHY ignores all data from the SPI host following the control
header for the remainder of the control read command. Control read
commands can read either a single register or multiple consecutive
registers. When multiple consecutive registers are read, the address is
automatically post-incremented by the MAC-PHY. Reading any unimplemented
or undefined registers shall return zero.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Parthiban Veerasooran <Parthiban.Veerasooran@microchip.com>
Link: https://patch.msgid.link/20240909082514.262942-4-Parthiban.Veerasooran@microchip.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

375d1e02

net: ethernet: oa_tc6: implement register write operation · aa58bec0

Parthiban Veerasooran authored Sep 09, 2024

Implement register write operation according to the control communication
specified in the OPEN Alliance 10BASE-T1x MACPHY Serial Interface
document. Control write commands are used by the SPI host to write
registers within the MAC-PHY. Each control write commands are composed of
a 32 bits control command header followed by register write data.

The MAC-PHY ignores the final 32 bits of data from the SPI host at the
end of the control write command. The write command and data is also
echoed from the MAC-PHY back to the SPI host to enable the SPI host to
identify which register write failed in the case of any bus errors.
Control write commands can write either a single register or multiple
consecutive registers. When multiple consecutive registers are written,
the address is automatically post-incremented by the MAC-PHY. Writing to
any unimplemented or undefined registers shall be ignored and yield no
effect.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Parthiban Veerasooran <Parthiban.Veerasooran@microchip.com>
Link: https://patch.msgid.link/20240909082514.262942-3-Parthiban.Veerasooran@microchip.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

aa58bec0

Documentation: networking: add OPEN Alliance 10BASE-T1x MAC-PHY serial interface · b3e33f2c

Parthiban Veerasooran authored Sep 09, 2024

The IEEE 802.3cg project defines two 10 Mbit/s PHYs operating over a
single pair of conductors. The 10BASE-T1L (Clause 146) is a long reach
PHY supporting full duplex point-to-point operation over 1 km of single
balanced pair of conductors. The 10BASE-T1S (Clause 147) is a short reach
PHY supporting full / half duplex point-to-point operation over 15 m of
single balanced pair of conductors, or half duplex multidrop bus
operation over 25 m of single balanced pair of conductors.

Furthermore, the IEEE 802.3cg project defines the new Physical Layer
Collision Avoidance (PLCA) Reconciliation Sublayer (Clause 148) meant to
provide improved determinism to the CSMA/CD media access method. PLCA
works in conjunction with the 10BASE-T1S PHY operating in multidrop mode.

The aforementioned PHYs are intended to cover the low-speed / low-cost
applications in industrial and automotive environment. The large number
of pins (16) required by the MII interface, which is specified by the
IEEE 802.3 in Clause 22, is one of the major cost factors that need to be
addressed to fulfil this objective.

The MAC-PHY solution integrates an IEEE Clause 4 MAC and a 10BASE-T1x PHY
exposing a low pin count Serial Peripheral Interface (SPI) to the host
microcontroller. This also enables the addition of Ethernet functionality
to existing low-end microcontrollers which do not integrate a MAC
controller.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Parthiban Veerasooran <Parthiban.Veerasooran@microchip.com>
Link: https://patch.msgid.link/20240909082514.262942-2-Parthiban.Veerasooran@microchip.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b3e33f2c

Merge branch 'device-memory-tcp' · e331673a

Jakub Kicinski authored Sep 11, 2024

Mina Almasry says:

====================
Device Memory TCP

Device memory TCP (devmem TCP) is a proposal for transferring data
to and/or from device memory efficiently, without bouncing the data
to a host memory buffer.

* Problem:

A large amount of data transfers have device memory as the source
and/or destination. Accelerators drastically increased the volume
of such transfers. Some examples include:

- ML accelerators transferring large amounts of training data from storage
into GPU/TPU memory. In some cases ML training setup time can be as long
as 50% of TPU compute time, improving data transfer throughput &
efficiency can help improving GPU/TPU utilization.

- Distributed training, where ML accelerators, such as GPUs on different
hosts, exchange data among them.

- Distributed raw block storage applications transfer large amounts of
data with remote SSDs, much of this data does not require host
processing.

Today, the majority of the Device-to-Device data transfers the network
are implemented as the following low level operations: Device-to-Host
copy, Host-to-Host network transfer, and Host-to-Device copy.

The implementation is suboptimal, especially for bulk data transfers,
and can put significant strains on system resources, such as host memory
bandwidth, PCIe bandwidth, etc. One important reason behind the current
state is the kernel’s lack of semantics to express device to network
transfers.

* Proposal:

In this patch series we attempt to optimize this use case by implementing
socket APIs that enable the user to:

1. send device memory across the network directly, and
2. receive incoming network packets directly into device memory.

Packet _payloads_ go directly from the NIC to device memory for receive
and from device memory to NIC for transmit.
Packet _headers_ go to/from host memory and are processed by the TCP/IP
stack normally. The NIC _must_ support header split to achieve this.

Advantages:

- Alleviate host memory bandwidth pressure, compared to existing
network-transfer + device-copy semantics.

- Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
of the PCIe tree, compared to traditional path which sends data through
the root complex.

* Patch overview:

** Part 1: netlink API

Gives user ability to bind dma-buf to an RX queue.

** Part 2: scatterlist support

Currently the standard for device memory sharing is DMABUF, which doesn't
generate struct pages. On the other hand, networking stack (skbs, drivers,
and page pool) operate on pages. We have 2 options:

1. Generate struct pages for dmabuf device memory, or,
2. Modify the networking stack to process scatterlist.

Approach #1 was attempted in RFC v1. RFC v2 implements approach #2.

** part 3: page pool support

We piggy back on page pool memory providers proposal:
https://github.com/kuba-moo/linux/tree/pp-providers

It allows the page pool to define a memory provider that provides the
page allocation and freeing. It helps abstract most of the device memory
TCP changes from the driver.

** part 4: support for unreadable skb frags

Page pool iovs are not accessible by the host; we implement changes
throughput the networking stack to correctly handle skbs with unreadable
frags.

** Part 5: recvmsg() APIs

We define user APIs for the user to send and receive device memory.

Not included with this series is the GVE devmem TCP support, just to
simplify the review. Code available here if desired:
https://github.com/mina/linux/tree/tcpdevmem

This series is built on top of net-next with Jakub's pp-providers changes
cherry-picked.

* NIC dependencies:

1. (strict) Devmem TCP require the NIC to support header split, i.e. the
capability to split incoming packets into a header + payload and to put
each into a separate buffer. Devmem TCP works by using device memory
for the packet payload, and host memory for the packet headers.

2. (optional) Devmem TCP works better with flow steering support & RSS
support, i.e. the NIC's ability to steer flows into certain rx queues.
This allows the sysadmin to enable devmem TCP on a subset of the rx
queues, and steer devmem TCP traffic onto these queues and non devmem
TCP elsewhere.

The NIC I have access to with these properties is the GVE with DQO support
running in Google Cloud, but any NIC that supports these features would
suffice. I may be able to help reviewers bring up devmem TCP on their NICs.

* Testing:

The series includes a udmabuf kselftest that show a simple use case of
devmem TCP and validates the entire data path end to end without
a dependency on a specific dmabuf provider.

** Test Setup

Kernel: net-next with this series and memory provider API cherry-picked
locally.

Hardware: Google Cloud A3 VMs.

NIC: GVE with header split & RSS & flow steering support.
====================

Link: https://patch.msgid.link/20240910171458.219195-1-almasrymina@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e331673a

netdev: add dmabuf introspection · d0caf987

Mina Almasry authored Sep 10, 2024

Add dmabuf information to page_pool stats:

$ ./cli.py --spec ../netlink/specs/netdev.yaml --dump page-pool-get
...
 {'dmabuf': 10,
  'id': 456,
  'ifindex': 3,
  'inflight': 1023,
  'inflight-mem': 4190208},
 {'dmabuf': 10,
  'id': 455,
  'ifindex': 3,
  'inflight': 1023,
  'inflight-mem': 4190208},
 {'dmabuf': 10,
  'id': 454,
  'ifindex': 3,
  'inflight': 1023,
  'inflight-mem': 4190208},
 {'dmabuf': 10,
  'id': 453,
  'ifindex': 3,
  'inflight': 1023,
  'inflight-mem': 4190208},
 {'dmabuf': 10,
  'id': 452,
  'ifindex': 3,
  'inflight': 1023,
  'inflight-mem': 4190208},
 {'dmabuf': 10,
  'id': 451,
  'ifindex': 3,
  'inflight': 1023,
  'inflight-mem': 4190208},
 {'dmabuf': 10,
  'id': 450,
  'ifindex': 3,
  'inflight': 1023,
  'inflight-mem': 4190208},
 {'dmabuf': 10,
  'id': 449,
  'ifindex': 3,
  'inflight': 1023,
  'inflight-mem': 4190208},

And queue stats:

$ ./cli.py --spec ../netlink/specs/netdev.yaml --dump queue-get
...
{'dmabuf': 10, 'id': 8, 'ifindex': 3, 'type': 'rx'},
{'dmabuf': 10, 'id': 9, 'ifindex': 3, 'type': 'rx'},
{'dmabuf': 10, 'id': 10, 'ifindex': 3, 'type': 'rx'},
{'dmabuf': 10, 'id': 11, 'ifindex': 3, 'type': 'rx'},
{'dmabuf': 10, 'id': 12, 'ifindex': 3, 'type': 'rx'},
{'dmabuf': 10, 'id': 13, 'ifindex': 3, 'type': 'rx'},
{'dmabuf': 10, 'id': 14, 'ifindex': 3, 'type': 'rx'},
{'dmabuf': 10, 'id': 15, 'ifindex': 3, 'type': 'rx'},
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-14-almasrymina@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d0caf987

selftests: add ncdevmem, netcat for devmem TCP · 85585b4b

Mina Almasry authored Sep 10, 2024

ncdevmem is a devmem TCP netcat. It works similarly to netcat, but it
sends and receives data using the devmem TCP APIs. It uses udmabuf as
the dmabuf provider. It is compatible with a regular netcat running on
a peer, or a ncdevmem running on a peer.

In addition to normal netcat support, ncdevmem has a validation mode,
where it sends a specific pattern and validates this pattern on the
receiver side to ensure data integrity.
Suggested-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20240910171458.219195-13-almasrymina@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

85585b4b

net: add devmem TCP documentation · 09d1db26

Mina Almasry authored Sep 10, 2024

Add documentation outlining the usage and details of devmem TCP.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20240910171458.219195-12-almasrymina@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

09d1db26

net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags · 678f6e28

Mina Almasry authored Sep 10, 2024

Add an interface for the user to notify the kernel that it is done
reading the devmem dmabuf frags returned as cmsg. The kernel will
drop the reference on the frags to make them available for reuse.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240910171458.219195-11-almasrymina@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

678f6e28

tcp: RX path for devmem TCP · 8f0b3cc9

Mina Almasry authored Sep 10, 2024

In tcp_recvmsg_locked(), detect if the skb being received by the user
is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM
flag - pass it to tcp_recvmsg_devmem() for custom handling.

tcp_recvmsg_devmem() copies any data in the skb header to the linear
buffer, and returns a cmsg to the user indicating the number of bytes
returned in the linear buffer.

tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags,
and returns to the user a cmsg_devmem indicating the location of the
data in the dmabuf device memory. cmsg_devmem contains this information:

1. the offset into the dmabuf where the payload starts. 'frag_offset'.
2. the size of the frag. 'frag_size'.
3. an opaque token 'frag_token' to return to the kernel when the buffer
is to be released.

The pages awaiting freeing are stored in the newly added
sk->sk_user_frags, and each page passed to userspace is get_page()'d.
This reference is dropped once the userspace indicates that it is
done reading this page.  All pages are released when the socket is
destroyed.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240910171458.219195-10-almasrymina@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8f0b3cc9

net: add support for skbs with unreadable frags · 65249feb

Mina Almasry authored Sep 10, 2024

For device memory TCP, we expect the skb headers to be available in host
memory for access, and we expect the skb frags to be in device memory
and unaccessible to the host. We expect there to be no mixing and
matching of device memory frags (unaccessible) with host memory frags
(accessible) in the same skb.

Add a skb->devmem flag which indicates whether the frags in this skb
are device memory frags or not.

__skb_fill_netmem_desc() now checks frags added to skbs for net_iov,
and marks the skb as skb->devmem accordingly.

Add checks through the network stack to avoid accessing the frags of
devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-9-almasrymina@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

65249feb

net: support non paged skb frags · 9f6b619e

Mina Almasry authored Sep 10, 2024

Make skb_frag_page() fail in the case where the frag is not backed
by a page, and fix its relevant callers to handle this case.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-8-almasrymina@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

9f6b619e

memory-provider: dmabuf devmem memory provider · 0f921404

Mina Almasry authored Sep 10, 2024

Implement a memory provider that allocates dmabuf devmem in the form of
net_iov.

The provider receives a reference to the struct netdev_dmabuf_binding
via the pool->mp_priv pointer. The driver needs to set this pointer for
the provider in the net_iov.

The provider obtains a reference on the netdev_dmabuf_binding which
guarantees the binding and the underlying mapping remains alive until
the provider is destroyed.

Usage of PP_FLAG_DMA_MAP is required for this memory provide such that
the page_pool can provide the driver with the dma-addrs of the devmem.

Support for PP_FLAG_DMA_SYNC_DEV is omitted for simplicity & p.order !=
0.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-7-almasrymina@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

0f921404

page_pool: devmem support · 8ab79ed5

Mina Almasry authored Sep 10, 2024

Convert netmem to be a union of struct page and struct netmem. Overload
the LSB of struct netmem* to indicate that it's a net_iov, otherwise
it's a page.

Currently these entries in struct page are rented by the page_pool and
used exclusively by the net stack:

struct {
	unsigned long pp_magic;
	struct page_pool *pp;
	unsigned long _pp_mapping_pad;
	unsigned long dma_addr;
	atomic_long_t pp_ref_count;
};

Mirror these (and only these) entries into struct net_iov and implement
netmem helpers that can access these common fields regardless of
whether the underlying type is page or net_iov.

Implement checks for net_iov in netmem helpers which delegate to mm
APIs, to ensure net_iov are never passed to the mm stack.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-6-almasrymina@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8ab79ed5

netdev: netdevice devmem allocator · 28c5c74e

Mina Almasry authored Sep 10, 2024

Implement netdev devmem allocator. The allocator takes a given struct
netdev_dmabuf_binding as input and allocates net_iov from that
binding.

The allocation simply delegates to the binding's genpool for the
allocation logic and wraps the returned memory region in a net_iov
struct.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-5-almasrymina@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

28c5c74e

netdev: support binding dma-buf to netdevice · 170aafe3

Mina Almasry authored Sep 10, 2024

Add a netdev_dmabuf_binding struct which represents the
dma-buf-to-netdevice binding. The netlink API will bind the dma-buf to
rx queues on the netdevice. On the binding, the dma_buf_attach
& dma_buf_map_attachment will occur. The entries in the sg_table from
mapping will be inserted into a genpool to make it ready
for allocation.

The chunks in the genpool are owned by a dmabuf_chunk_owner struct which
holds the dma-buf offset of the base of the chunk and the dma_addr of
the chunk. Both are needed to use allocations that come from this chunk.

We create a new type that represents an allocation from the genpool:
net_iov. We setup the net_iov allocation size in the
genpool to PAGE_SIZE for simplicity: to match the PAGE_SIZE normally
allocated by the page pool and given to the drivers.

The user can unbind the dmabuf from the netdevice by closing the netlink
socket that established the binding. We do this so that the binding is
automatically unbound even if the userspace process crashes.

The binding and unbinding leaves an indicator in struct netdev_rx_queue
that the given queue is bound, and the binding is actuated by resetting
the rx queue using the queue API.

The netdev_dmabuf_binding struct is refcounted, and releases its
resources only when all the refs are released.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # excluding netlink
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-4-almasrymina@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

170aafe3

net: netdev netlink api to bind dma-buf to a net device · 3efd7ab4

Mina Almasry authored Sep 10, 2024

API takes the dma-buf fd as input, and binds it to the netdevice. The
user can specify the rx queues to bind the dma-buf to.
Suggested-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-3-almasrymina@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3efd7ab4

netdev: add netdev_rx_queue_restart() · 7c88f865

Mina Almasry authored Sep 10, 2024

Add netdev_rx_queue_restart(), which resets an rx queue using the
queue API recently merged[1].

The queue API was merged to enable the core net stack to reset individual
rx queues to actuate changes in the rx queue's configuration. In later
patches in this series, we will use netdev_rx_queue_restart() to reset
rx queues after binding or unbinding dmabuf configuration, which will
cause reallocation of the page_pool to repopulate its memory using the
new configuration.

[1] https://lore.kernel.org/netdev/20240430231420.699177-1-shailend@google.com/T/Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-2-almasrymina@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

7c88f865

Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · 24b8c193

Jakub Kicinski authored Sep 11, 2024

Tony Nguyen says:

====================
idpf: XDP chapter II: convert Tx completion to libeth

Alexander Lobakin says:

XDP for idpf is currently 5 chapters:
* convert Rx to libeth;
* convert Tx completion to libeth (this);
* generic XDP and XSk code changes;
* actual XDP for idpf via libeth_xdp;
* XSk for idpf (^).

Part II does the following:
* adds generic libeth Tx completion routines;
* converts idpf to use generic libeth Tx comp routines;
* fixes Tx queue timeouts and robustifies Tx completion in general;
* fixes Tx event/descriptor flushes (writebacks).

Most idpf patches again remove more lines than adds.
Generic Tx completion helpers and structs are needed as libeth_xdp
(Ch. III) makes use of them. WB_ON_ITR is needed since XDPSQs don't
want to work without it at all. Tx queue timeouts fixes are needed
since without them, it's way easier to catch a Tx timeout event when
WB_ON_ITR is enabled.

* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  idpf: enable WB_ON_ITR
  idpf: fix netdev Tx queue stop/wake
  idpf: refactor Tx completion routines
  netdevice: add netdev_tx_reset_subqueue() shorthand
  idpf: convert to libeth Tx buffer completion
  libeth: add Tx buffer completion helpers
====================

Link: https://patch.msgid.link/20240909205323.3110312-1-anthony.l.nguyen@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

24b8c193

net: phy: microchip_t1: Cable Diagnostics for lan887x · b2c8a506

Divya Koppera authored Sep 09, 2024

Add support for cable diagnostics in lan887x PHY.
Using this we can diagnose connected/open/short wires and
also length where cable fault is occurred.
Signed-off-by: Divya Koppera <divya.koppera@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20240909114339.3446-1-divya.koppera@microchip.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b2c8a506

11 Sep, 2024 16 commits

net: ethtool: phy: Check the req_info.pdn field for GET commands · fce1e9f8

Maxime Chevallier authored Sep 10, 2024

When processing the netlink GET requests to get PHY info, the req_info.pdn
pointer is NULL when no PHY matches the requested parameters, such as when
the phy_index is invalid, or there's simply no PHY attached to the
interface.

Therefore, check the req_info.pdn pointer for NULL instead of
dereferencing it.
Suggested-by: Eric Dumazet <edumazet@google.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Closes: https://lore.kernel.org/netdev/CANn89iKRW0WpGAh1tKqY345D8WkYCPm3Y9ym--Si42JZrQAu1g@mail.gmail.com/T/#mfced87d607d18ea32b3b4934dfa18d7b36669285
Fixes: 17194be4 ("net: ethtool: Introduce a command to list PHYs on an interface")
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240910174636.857352-1-maxime.chevallier@bootlin.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

fce1e9f8

net: gianfar: fix NVMEM mac address · b2d95440

Rosen Penev authored Sep 10, 2024

If nvmem loads after the ethernet driver, mac address assignments will
not take effect. of_get_ethdev_address returns EPROBE_DEFER in such a
case so we need to handle that to avoid eth_hw_addr_random.
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240910220913.14101-1-rosenp@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b2d95440

sfc: Add X4 PF support · cf06766f

Jonathan Cooper authored Sep 10, 2024

Add X4 series. Most functionality is the same as previous
EF10 nics but enough is different to warrant a new nic type struct
and revision; for example legacy interrupts and SRIOV are
not supported.

Most removed features will be re-added later as new implementations.
Signed-off-by: Jonathan Cooper <jonathan.s.cooper@amd.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Link: https://patch.msgid.link/20240910153014.12803-1-jonathan.s.cooper@amd.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

cf06766f

qlcnic: make read-only const array key static · af647fe2

Colin Ian King authored Sep 10, 2024

Don't populate the const read-only array key on the stack at
run time, instead make it static.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240910120635.115266-1-colin.i.king@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

af647fe2

Merge branch 'mptcp-fallback-to-tcp-after-3-mpc-drop-cache' · 9ee92621

Jakub Kicinski authored Sep 11, 2024

Matthieu Baerts says:

====================
mptcp: fallback to TCP after 3 MPC drop + cache

The SYN + MPTCP_CAPABLE packets could be explicitly dropped by firewalls
somewhere in the network, e.g. if they decide to drop packets based on
the TCP options, instead of stripping them off.

The idea of this series is to fallback to TCP after 3 SYN+MPC drop
(patch 2). If the connection succeeds after the fallback, it very likely
means a blackhole has been detected. In this case (patch 3), MPTCP can
be disabled for a certain period of time, 1h by default. If after this
period, MPTCP is still blocked, the period is doubled. This technique is
inspired by the one used by TCP FastOpen.

This should help applications which want to use MPTCP by default on the
client side if available.
====================

Link: https://patch.msgid.link/20240909-net-next-mptcp-fallback-x-mpc-v1-0-da7ebb4cd2a3@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

9ee92621

mptcp: disable active MPTCP in case of blackhole · 27069e7c

Matthieu Baerts (NGI0) authored Sep 09, 2024

An MPTCP firewall blackhole can be detected if the following SYN
retransmission after a fallback to "plain" TCP is accepted.

In case of blackhole, a similar technique to the one in place with TFO
is now used: MPTCP can be disabled for a certain period of time, 1h by
default. This time period will grow exponentially when more blackhole
issues get detected right after MPTCP is re-enabled and will reset to
the initial value when the blackhole issue goes away.

The blackhole period can be modified thanks to a new sysctl knob:
blackhole_timeout. Two new MIB counters help understanding what's
happening:

- 'Blackhole', incremented when a blackhole is detected.
- 'MPCapableSYNTXDisabled', incremented when an MPTCP connection
  directly falls back to TCP during the blackhole period.

Because the technique is inspired by the one used by TFO, an important
part of the new code is similar to what can find in tcp_fastopen.c, with
some adaptations to the MPTCP case.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/57Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240909-net-next-mptcp-fallback-x-mpc-v1-3-da7ebb4cd2a3@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

27069e7c

mptcp: fallback to TCP after SYN+MPC drops · 6982826f

Matthieu Baerts (NGI0) authored Sep 09, 2024

Some middleboxes might be nasty with MPTCP, and decide to drop packets
with MPTCP options, instead of just dropping the MPTCP options (or
letting them pass...).

In this case, it sounds better to fallback to "plain" TCP after 2
retransmissions, and try again.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/477Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240909-net-next-mptcp-fallback-x-mpc-v1-2-da7ebb4cd2a3@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6982826f

mptcp: export mptcp_subflow_early_fallback() · 65b02260

Matthieu Baerts (NGI0) authored Sep 09, 2024

This helper will be used outside protocol.h in the following commit.

While at it, also add a 'pr_fallback()' debug print, to help identifying
fallbacks.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240909-net-next-mptcp-fallback-x-mpc-v1-1-da7ebb4cd2a3@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

65b02260

Merge branch 'net-hsr-use-the-seqnr-lock-for-frames-received-via-interlink-port' · 8b5d2e5c

Jakub Kicinski authored Sep 11, 2024

Sebastian Andrzej Siewior says:

====================
net: hsr: Use the seqnr lock for frames received via interlink port.

This is follow-up to the thread at
https://lore.kernel.org/all/20240904133725.1073963-1-edumazet@google.com/
====================

Link: https://patch.msgid.link/20240906132816.657485-1-bigeasy@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8b5d2e5c

net: hsr: Remove interlink_sequence_nr. · 35e24f28

Eric Dumazet authored Sep 06, 2024

Remove interlink_sequence_nr which is unused.

[ bigeasy: split out from Eric's patch ].
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240906132816.657485-3-bigeasy@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

35e24f28

net: hsr: Use the seqnr lock for frames received via interlink port. · 430d67bd

Sebastian Andrzej Siewior authored Sep 06, 2024

syzbot reported that the seqnr_lock is not acquire for frames received
over the interlink port. In the interlink case a new seqnr is generated
and assigned to the frame.
Frames, which are received over the slave port have already a sequence
number assigned so the lock is not required.

Acquire the hsr_priv::seqnr_lock during in the invocation of
hsr_forward_skb() if a packet has been received from the interlink port.

Reported-by: syzbot+3d602af7549af539274e@syzkaller.appspotmail.com
Closes: https://groups.google.com/g/syzkaller-bugs/c/KppVvGviGg4/m/EItSdCZdBAAJ
Fixes: 5055cccf ("net: hsr: Provide RedBox support (HSR-SAN)")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Lukasz Majewski <lukma@denx.de>
Tested-by: Lukasz Majewski <lukma@denx.de>
Link: https://patch.msgid.link/20240906132816.657485-2-bigeasy@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

430d67bd

Merge tag 'wireless-next-2024-09-11' of... · a18c097e

Jakub Kicinski authored Sep 11, 2024

Merge tag 'wireless-next-2024-09-11' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Kalle Valo says:

====================
wireless-next patches for v6.12

The last -next "new features" pull request for v6.12. The stack now
supports DFS on MLO but otherwise nothing really standing out.

Major changes:

cfg80211/mac80211
 * EHT rate support in AQL airtime
 * DFS support for MLO

rtw89
 * complete BT-coexistence code for RTL8852BT
 * RTL8922A WoWLAN net-detect support

* tag 'wireless-next-2024-09-11' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (105 commits)
  wifi: brcmfmac: cfg80211: Convert comma to semicolon
  wifi: rsi: Remove an unused field in struct rsi_debugfs
  wifi: libertas: Cleanup unused declarations
  wifi: wilc1000: Convert using devm_clk_get_optional_enabled() in wilc_bus_probe()
  wifi: wilc1000: Convert using devm_clk_get_optional_enabled() in wilc_sdio_probe()
  wifi: wilc1000: fix potential RCU dereference issue in wilc_parse_join_bss_param
  wifi: mwifiex: Fix memcpy() field-spanning write warning in mwifiex_cmd_802_11_scan_ext()
  wifi: mac80211: use two-phase skb reclamation in ieee80211_do_stop()
  wifi: cfg80211: fix two more possible UBSAN-detected off-by-one errors
  wifi: cfg80211: fix kernel-doc for per-link data
  wifi: mt76: mt7925: replace chan config with extend txpower config for clc
  wifi: mt76: mt7925: fix a potential array-index-out-of-bounds issue for clc
  wifi: mt76: mt7615: check devm_kasprintf() returned value
  wifi: mt76: mt7925: convert comma to semicolon
  wifi: mt76: mt7925: fix a potential association failure upon resuming
  wifi: mt76: Avoid multiple -Wflex-array-member-not-at-end warnings
  wifi: mt76: mt7921: Check devm_kasprintf() returned value
  wifi: mt76: mt7915: check devm_kasprintf() returned value
  wifi: mt76: mt7915: avoid long MCU command timeouts during SER
  wifi: mt76: mt7996: fix uninitialized TLV data
  ...
====================

Link: https://patch.msgid.link/20240911084147.A205DC4AF0F@smtp.kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a18c097e

Merge branch 'lan743x-phylink' · bf73478b

David S. Miller authored Sep 11, 2024

Raju Lakkaraju says:

====================
Add support to PHYLINK for LAN743x/PCI11x1x chips

This is the follow-up patch series of
https://lkml.iu.edu/hypermail/linux/kernel/2310.2/02078.html

Divide the PHYLINK adaptation and SFP modifications into two separate patch
series.

The current patch series focuses on transitioning the LAN743x driver's PHY
support from phylib to phylink.

Tested on PCI11010 Rev-1 Evaluation board

Change List:
============
V5 -> V6:
  - Remove the lan743x_find_max_speed( ) function. Not require
  - Add EEE enable check before calling lan743x_mac_eee_enable( ) function
V4 -> V5:
  - Remove the fixed_phy_unregister( ) function. Not require
  - Remove the "phydev->eee_enabled" check to update the MAC EEE
    enable/disable
  - Call lan743x_mac_eee_enable() with true after update tx_lpi_timer.
  - Add phy_support_eee() to initialize the EEE flags
V3 -> V4:
  - Add fixed-link patch along with this series.
    Note: Note: This code was developed by Mr.Russell King
    Ref:
    https://lore.kernel.org/netdev/LV8PR11MB8700C786F5F1C274C73036CC9F8E2@LV8PR11MB8700.namprd11.prod.outlook.com/T/#me943adf54f1ea082edf294aba448fa003a116815
  - Change phylink fixed-link function header's string from "Returns" to
    "Returns:"
  - Remove the EEE private variable from LAN743x adapter strcture and fix the
    EEE's set/get functions
  - set the individual caps (i.e. _RGMII, _RGMII_ID, _RGMII_RXID and
    __RGMII_TXID) replace with phy_interface_set_rgmii( ) function
  - Change lan743x_set_eee( ) to lan743x_mac_eee_enable( )

V2 -> V3:
  - Remove the unwanted parens in each of these if() sub-blocks
  - Replace "to_net_dev(config->dev)" with "netdev".
  - Add GMII_ID/RGMII_TXID/RGMII_RXID in supported_interfaces
  - Fix the lan743x_phy_handle_exists( ) return type

V1 -> V2:
  - Fix the Russell King's comments i.e. remove the speed, duplex update in
    lan743x_phylink_mac_config( )
  - pre-March 2020 legacy support has been removed

V0 -> V1:
  - Integrate with Synopsys DesignWare XPCS drivers
  - Based on external review comments,
  - Changes made to SGMII interface support only 1G/100M/10M bps speed
  - Changes made to 2500Base-X interface support only 2.5Gbps speed
  - Add check for not is_sgmii_en with is_sfp_support_en support
  - Change the "pci11x1x_strap_get_status" function return type from void to
    int
  - Add ethtool phylink wol, eee, pause get/set functions
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

bf73478b

net: lan743x: Add support to ethtool phylink get and set settings · f95f28d7

Raju Lakkaraju authored Sep 06, 2024

Add support to ethtool phylink functions:
  - get/set settings like speed, duplex etc
  - get/set the wake-on-lan (WOL)
  - get/set the energy-efficient ethernet (EEE)
  - get/set the pause
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f95f28d7

net: lan743x: Migrate phylib to phylink · a5f199a8

Raju Lakkaraju authored Sep 06, 2024

Migrate phy support from phylib to phylink.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a5f199a8

net: lan743x: Create separate Link Speed Duplex state function · 92b740a4

Raju Lakkaraju authored Sep 06, 2024

Create separate Link Speed Duplex (LSD) update state function from
lan743x_sgmii_config () to use as subroutine.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

92b740a4