Commit 03db3a2d authored by Matan Barak's avatar Matan Barak Committed by Doug Ledford

IB/core: Add RoCE GID table management

RoCE GIDs are based on IP addresses configured on Ethernet net-devices
which relate to the RDMA (RoCE) device port.

Currently, each of the low-level drivers that support RoCE (ocrdma,
mlx4) manages its own RoCE port GID table. As there's nothing which is
essentially vendor specific, we generalize that, and enhance the RDMA
core GID cache to do this job.

In order to populate the GID table, we listen for events:

(a) netdev up/down/change_addr events - if a netdev is built onto
    our RoCE device, we need to add/delete its IPs. This involves
    adding all GIDs related to this ndev, add default GIDs, etc.

(b) inet events - add new GIDs (according to the IP addresses)
    to the table.

For programming the port RoCE GID table, providers must implement
the add_gid and del_gid callbacks.

RoCE GID management requires us to state the associated net_device
alongside the GID. This information is necessary in order to manage
the GID table. For example, when a net_device is removed, its
associated GIDs need to be removed as well.

RoCE mandates generating a default GID for each port, based on the
related net-device's IPv6 link local. In contrast to the GID based on
the regular IPv6 link-local (as we generate GID per IP address),
the default GID is also available when the net device is down (in
order to support loopback).

Locking is done as follows:
The patch modify the GID table code both for new RoCE drivers
implementing the add_gid/del_gid callbacks and for current RoCE and
IB drivers that do not. The flows for updating the table are
different, so the locking requirements are too.

While updating RoCE GID table, protection against multiple writers is
achieved via mutex_lock(&table->lock). Since writing to a table
requires us to find an entry (possible a free entry) in the table and
then modify it, this mutex protects both the find_gid and write_gid
ensuring the atomicity of the action.
Each entry in the GID cache is protected by rwlock. In RoCE, writing
(usually results from netdev notifier) involves invoking the vendor's
add_gid and del_gid callbacks, which could sleep.
Therefore, an invalid flag is added for each entry. Updates for RoCE are
done via a workqueue, thus sleeping is permitted.

In IB, updates are done in write_lock_irq(&device->cache.lock), thus
write_gid isn't allowed to sleep and add_gid/del_gid are not called.

When passing net-device into/out-of the GID cache, the device
is always passed held (dev_hold).

The code uses a single work item for updating all RDMA devices,
following a netdev or inet notifier.

The patch moves the cache from being a client (which was incorrect,
as the cache is part of the IB infrastructure) to being explicitly
initialized/freed when a device is registered/removed.
Signed-off-by: default avatarMatan Barak <matanb@mellanox.com>
Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
parent 55aeed06
...@@ -9,7 +9,8 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o \ ...@@ -9,7 +9,8 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o \
$(user_access-y) $(user_access-y)
ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ ib_core-y := packer.o ud_header.o verbs.o sysfs.o \
device.o fmr_pool.o cache.o netlink.o device.o fmr_pool.o cache.o netlink.o \
roce_gid_mgmt.o
ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o umem_rbtree.o ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o umem_rbtree.o
......
This diff is collapsed.
...@@ -43,9 +43,58 @@ int ib_device_register_sysfs(struct ib_device *device, ...@@ -43,9 +43,58 @@ int ib_device_register_sysfs(struct ib_device *device,
u8, struct kobject *)); u8, struct kobject *));
void ib_device_unregister_sysfs(struct ib_device *device); void ib_device_unregister_sysfs(struct ib_device *device);
int ib_cache_setup(void); void ib_cache_setup(void);
void ib_cache_cleanup(void); void ib_cache_cleanup(void);
int ib_resolve_eth_l2_attrs(struct ib_qp *qp, int ib_resolve_eth_l2_attrs(struct ib_qp *qp,
struct ib_qp_attr *qp_attr, int *qp_attr_mask); struct ib_qp_attr *qp_attr, int *qp_attr_mask);
typedef void (*roce_netdev_callback)(struct ib_device *device, u8 port,
struct net_device *idev, void *cookie);
typedef int (*roce_netdev_filter)(struct ib_device *device, u8 port,
struct net_device *idev, void *cookie);
void ib_enum_roce_netdev(struct ib_device *ib_dev,
roce_netdev_filter filter,
void *filter_cookie,
roce_netdev_callback cb,
void *cookie);
void ib_enum_all_roce_netdevs(roce_netdev_filter filter,
void *filter_cookie,
roce_netdev_callback cb,
void *cookie);
int ib_cache_gid_find_by_port(struct ib_device *ib_dev,
const union ib_gid *gid,
u8 port, struct net_device *ndev,
u16 *index);
enum ib_cache_gid_default_mode {
IB_CACHE_GID_DEFAULT_MODE_SET,
IB_CACHE_GID_DEFAULT_MODE_DELETE
};
void ib_cache_gid_set_default_gid(struct ib_device *ib_dev, u8 port,
struct net_device *ndev,
enum ib_cache_gid_default_mode mode);
int ib_cache_gid_add(struct ib_device *ib_dev, u8 port,
union ib_gid *gid, struct ib_gid_attr *attr);
int ib_cache_gid_del(struct ib_device *ib_dev, u8 port,
union ib_gid *gid, struct ib_gid_attr *attr);
int ib_cache_gid_del_all_netdev_gids(struct ib_device *ib_dev, u8 port,
struct net_device *ndev);
int roce_gid_mgmt_init(void);
void roce_gid_mgmt_cleanup(void);
int roce_rescan_device(struct ib_device *ib_dev);
int ib_cache_setup_one(struct ib_device *device);
void ib_cache_cleanup_one(struct ib_device *device);
void ib_cache_release_one(struct ib_device *device);
#endif /* _CORE_PRIV_H */ #endif /* _CORE_PRIV_H */
...@@ -40,6 +40,8 @@ ...@@ -40,6 +40,8 @@
#include <linux/mutex.h> #include <linux/mutex.h>
#include <linux/netdevice.h> #include <linux/netdevice.h>
#include <rdma/rdma_netlink.h> #include <rdma/rdma_netlink.h>
#include <rdma/ib_addr.h>
#include <rdma/ib_cache.h>
#include "core_priv.h" #include "core_priv.h"
...@@ -169,6 +171,7 @@ static void ib_device_release(struct device *device) ...@@ -169,6 +171,7 @@ static void ib_device_release(struct device *device)
{ {
struct ib_device *dev = container_of(device, struct ib_device, dev); struct ib_device *dev = container_of(device, struct ib_device, dev);
ib_cache_release_one(dev);
kfree(dev->port_immutable); kfree(dev->port_immutable);
kfree(dev); kfree(dev);
} }
...@@ -342,10 +345,17 @@ int ib_register_device(struct ib_device *device, ...@@ -342,10 +345,17 @@ int ib_register_device(struct ib_device *device,
goto out; goto out;
} }
ret = ib_cache_setup_one(device);
if (ret) {
printk(KERN_WARNING "Couldn't set up InfiniBand P_Key/GID cache\n");
goto out;
}
ret = ib_device_register_sysfs(device, port_callback); ret = ib_device_register_sysfs(device, port_callback);
if (ret) { if (ret) {
printk(KERN_WARNING "Couldn't register device %s with driver model\n", printk(KERN_WARNING "Couldn't register device %s with driver model\n",
device->name); device->name);
ib_cache_cleanup_one(device);
goto out; goto out;
} }
...@@ -399,6 +409,7 @@ void ib_unregister_device(struct ib_device *device) ...@@ -399,6 +409,7 @@ void ib_unregister_device(struct ib_device *device)
mutex_unlock(&device_mutex); mutex_unlock(&device_mutex);
ib_device_unregister_sysfs(device); ib_device_unregister_sysfs(device);
ib_cache_cleanup_one(device);
down_write(&lists_rwsem); down_write(&lists_rwsem);
spin_lock_irqsave(&device->client_data_lock, flags); spin_lock_irqsave(&device->client_data_lock, flags);
...@@ -670,10 +681,79 @@ EXPORT_SYMBOL(ib_query_port); ...@@ -670,10 +681,79 @@ EXPORT_SYMBOL(ib_query_port);
int ib_query_gid(struct ib_device *device, int ib_query_gid(struct ib_device *device,
u8 port_num, int index, union ib_gid *gid) u8 port_num, int index, union ib_gid *gid)
{ {
if (rdma_cap_roce_gid_table(device, port_num))
return ib_get_cached_gid(device, port_num, index, gid);
return device->query_gid(device, port_num, index, gid); return device->query_gid(device, port_num, index, gid);
} }
EXPORT_SYMBOL(ib_query_gid); EXPORT_SYMBOL(ib_query_gid);
/**
* ib_enum_roce_netdev - enumerate all RoCE ports
* @ib_dev : IB device we want to query
* @filter: Should we call the callback?
* @filter_cookie: Cookie passed to filter
* @cb: Callback to call for each found RoCE ports
* @cookie: Cookie passed back to the callback
*
* Enumerates all of the physical RoCE ports of ib_dev
* which are related to netdevice and calls callback() on each
* device for which filter() function returns non zero.
*/
void ib_enum_roce_netdev(struct ib_device *ib_dev,
roce_netdev_filter filter,
void *filter_cookie,
roce_netdev_callback cb,
void *cookie)
{
u8 port;
for (port = rdma_start_port(ib_dev); port <= rdma_end_port(ib_dev);
port++)
if (rdma_protocol_roce(ib_dev, port)) {
struct net_device *idev = NULL;
if (ib_dev->get_netdev)
idev = ib_dev->get_netdev(ib_dev, port);
if (idev &&
idev->reg_state >= NETREG_UNREGISTERED) {
dev_put(idev);
idev = NULL;
}
if (filter(ib_dev, port, idev, filter_cookie))
cb(ib_dev, port, idev, cookie);
if (idev)
dev_put(idev);
}
}
/**
* ib_enum_all_roce_netdevs - enumerate all RoCE devices
* @filter: Should we call the callback?
* @filter_cookie: Cookie passed to filter
* @cb: Callback to call for each found RoCE ports
* @cookie: Cookie passed back to the callback
*
* Enumerates all RoCE devices' physical ports which are related
* to netdevices and calls callback() on each device for which
* filter() function returns non zero.
*/
void ib_enum_all_roce_netdevs(roce_netdev_filter filter,
void *filter_cookie,
roce_netdev_callback cb,
void *cookie)
{
struct ib_device *dev;
down_read(&lists_rwsem);
list_for_each_entry(dev, &device_list, core_list)
ib_enum_roce_netdev(dev, filter, filter_cookie, cb, cookie);
up_read(&lists_rwsem);
}
/** /**
* ib_query_pkey - Get P_Key table entry * ib_query_pkey - Get P_Key table entry
* @device:Device to query * @device:Device to query
...@@ -753,6 +833,13 @@ int ib_find_gid(struct ib_device *device, union ib_gid *gid, ...@@ -753,6 +833,13 @@ int ib_find_gid(struct ib_device *device, union ib_gid *gid,
int ret, port, i; int ret, port, i;
for (port = rdma_start_port(device); port <= rdma_end_port(device); ++port) { for (port = rdma_start_port(device); port <= rdma_end_port(device); ++port) {
if (rdma_cap_roce_gid_table(device, port)) {
if (!ib_cache_gid_find_by_port(device, gid, port,
NULL, index))
*port_num = port;
return 0;
}
for (i = 0; i < device->port_immutable[port].gid_tbl_len; ++i) { for (i = 0; i < device->port_immutable[port].gid_tbl_len; ++i) {
ret = ib_query_gid(device, port, i, &tmp_gid); ret = ib_query_gid(device, port, i, &tmp_gid);
if (ret) if (ret)
...@@ -874,17 +961,10 @@ static int __init ib_core_init(void) ...@@ -874,17 +961,10 @@ static int __init ib_core_init(void)
goto err_sysfs; goto err_sysfs;
} }
ret = ib_cache_setup(); ib_cache_setup();
if (ret) {
printk(KERN_WARNING "Couldn't set up InfiniBand P_Key/GID cache\n");
goto err_nl;
}
return 0; return 0;
err_nl:
ibnl_cleanup();
err_sysfs: err_sysfs:
class_unregister(&ib_class); class_unregister(&ib_class);
......
This diff is collapsed.
...@@ -65,6 +65,10 @@ union ib_gid { ...@@ -65,6 +65,10 @@ union ib_gid {
} global; } global;
}; };
struct ib_gid_attr {
struct net_device *ndev;
};
enum rdma_node_type { enum rdma_node_type {
/* IB values map to NodeInfo:NodeType. */ /* IB values map to NodeInfo:NodeType. */
RDMA_NODE_IB_CA = 1, RDMA_NODE_IB_CA = 1,
...@@ -285,7 +289,7 @@ enum ib_port_cap_flags { ...@@ -285,7 +289,7 @@ enum ib_port_cap_flags {
IB_PORT_BOOT_MGMT_SUP = 1 << 23, IB_PORT_BOOT_MGMT_SUP = 1 << 23,
IB_PORT_LINK_LATENCY_SUP = 1 << 24, IB_PORT_LINK_LATENCY_SUP = 1 << 24,
IB_PORT_CLIENT_REG_SUP = 1 << 25, IB_PORT_CLIENT_REG_SUP = 1 << 25,
IB_PORT_IP_BASED_GIDS = 1 << 26 IB_PORT_IP_BASED_GIDS = 1 << 26,
}; };
enum ib_port_width { enum ib_port_width {
...@@ -1487,7 +1491,7 @@ struct ib_cache { ...@@ -1487,7 +1491,7 @@ struct ib_cache {
rwlock_t lock; rwlock_t lock;
struct ib_event_handler event_handler; struct ib_event_handler event_handler;
struct ib_pkey_cache **pkey_cache; struct ib_pkey_cache **pkey_cache;
struct ib_gid_cache **gid_cache; struct ib_gid_table **gid_cache;
u8 *lmc_cache; u8 *lmc_cache;
}; };
...@@ -1573,9 +1577,47 @@ struct ib_device { ...@@ -1573,9 +1577,47 @@ struct ib_device {
struct ib_port_attr *port_attr); struct ib_port_attr *port_attr);
enum rdma_link_layer (*get_link_layer)(struct ib_device *device, enum rdma_link_layer (*get_link_layer)(struct ib_device *device,
u8 port_num); u8 port_num);
/* When calling get_netdev, the HW vendor's driver should return the
* net device of device @device at port @port_num or NULL if such
* a net device doesn't exist. The vendor driver should call dev_hold
* on this net device. The HW vendor's device driver must guarantee
* that this function returns NULL before the net device reaches
* NETDEV_UNREGISTER_FINAL state.
*/
struct net_device *(*get_netdev)(struct ib_device *device,
u8 port_num);
int (*query_gid)(struct ib_device *device, int (*query_gid)(struct ib_device *device,
u8 port_num, int index, u8 port_num, int index,
union ib_gid *gid); union ib_gid *gid);
/* When calling add_gid, the HW vendor's driver should
* add the gid of device @device at gid index @index of
* port @port_num to be @gid. Meta-info of that gid (for example,
* the network device related to this gid is available
* at @attr. @context allows the HW vendor driver to store extra
* information together with a GID entry. The HW vendor may allocate
* memory to contain this information and store it in @context when a
* new GID entry is written to. Params are consistent until the next
* call of add_gid or delete_gid. The function should return 0 on
* success or error otherwise. The function could be called
* concurrently for different ports. This function is only called
* when roce_gid_table is used.
*/
int (*add_gid)(struct ib_device *device,
u8 port_num,
unsigned int index,
const union ib_gid *gid,
const struct ib_gid_attr *attr,
void **context);
/* When calling del_gid, the HW vendor's driver should delete the
* gid of device @device at gid index @index of port @port_num.
* Upon the deletion of a GID entry, the HW vendor must free any
* allocated memory. The caller will clear @context afterwards.
* This function is only called when roce_gid_table is used.
*/
int (*del_gid)(struct ib_device *device,
u8 port_num,
unsigned int index,
void **context);
int (*query_pkey)(struct ib_device *device, int (*query_pkey)(struct ib_device *device,
u8 port_num, u16 index, u16 *pkey); u8 port_num, u16 index, u16 *pkey);
int (*modify_device)(struct ib_device *device, int (*modify_device)(struct ib_device *device,
...@@ -2108,6 +2150,26 @@ static inline size_t rdma_max_mad_size(const struct ib_device *device, u8 port_n ...@@ -2108,6 +2150,26 @@ static inline size_t rdma_max_mad_size(const struct ib_device *device, u8 port_n
return device->port_immutable[port_num].max_mad_size; return device->port_immutable[port_num].max_mad_size;
} }
/**
* rdma_cap_roce_gid_table - Check if the port of device uses roce_gid_table
* @device: Device to check
* @port_num: Port number to check
*
* RoCE GID table mechanism manages the various GIDs for a device.
*
* NOTE: if allocating the port's GID table has failed, this call will still
* return true, but any RoCE GID table API will fail.
*
* Return: true if the port uses RoCE GID table mechanism in order to manage
* its GIDs.
*/
static inline bool rdma_cap_roce_gid_table(const struct ib_device *device,
u8 port_num)
{
return rdma_protocol_roce(device, port_num) &&
device->add_gid && device->del_gid;
}
int ib_query_gid(struct ib_device *device, int ib_query_gid(struct ib_device *device,
u8 port_num, int index, union ib_gid *gid); u8 port_num, int index, union ib_gid *gid);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment