Commits · 03aa39558e74649b8ad19b2a3988a22bef23a517 · nexedi / linux

19 Feb, 2020 7 commits

Merge branch 'bpf_read_branch_records' · 03aa3955

Alexei Starovoitov authored Feb 19, 2020

Daniel Xu says:

====================
Branch records are a CPU feature that can be configured to record
certain branches that are taken during code execution. This data is
particularly interesting for profile guided optimizations. perf has had
branch record support for a while but the data collection can be a bit
coarse grained.

We (Facebook) have seen in experiments that associating metadata with
branch records can improve results (after postprocessing). We generally
use bpf_probe_read_*() to get metadata out of userspace. That's why bpf
support for branch records is useful.

Aside from this particular use case, having branch data available to bpf
progs can be useful to get stack traces out of userspace applications
that omit frame pointers.

Changes in v8:
- Use globals instead of perf buffer
- Call test_perf_branches__detach() before destroying skeleton
- Fix typo in docs

Changes in v7:
- Const-ify and static-ify local var
- Documentation formatting

Changes in v6:
- Move #ifdef a little to avoid unused variable warnings on !x86
- Test negative condition in selftest (-EINVAL on improperly configured
  perf event)
- Skip positive condition selftest on setups that don't support branch
  records

Changes in v5:
- Rename bpf_perf_prog_read_branches() -> bpf_read_branch_records()
- Rename BPF_F_GET_BR_SIZE -> BPF_F_GET_BRANCH_RECORDS_SIZE
- Squash tools/ bpf.h sync into selftest commit

Changes in v4:
- Add BPF_F_GET_BR_SIZE flag
- Return -ENOENT on unsupported architectures
- Only accept initialized memory in helper
- Check buffer size is multiple of sizeof(struct perf_branch_entry)
- Use bpf skeleton in selftest
- Add commit messages
- Spelling and formatting

Changes in v3:
- Document filling unused buffer with zero
- Formatting fixes
- Rebase

Changes in v2:
- Change to a bpf helper instead of context access
- Avoid mentioning Intel specific things
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

03aa3955

selftests/bpf: Add bpf_read_branch_records() selftest · 67306f84

Daniel Xu authored Feb 17, 2020

Add a selftest to test:

* default bpf_read_branch_records() behavior
* BPF_F_GET_BRANCH_RECORDS_SIZE flag behavior
* error path on non branch record perf events
* using helper to write to stack
* using helper to write to global

On host with hardware counter support:

    # ./test_progs -t perf_branches
    #27/1 perf_branches_hw:OK
    #27/2 perf_branches_no_hw:OK
    #27 perf_branches:OK
    Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED

On host without hardware counter support (VM):

    # ./test_progs -t perf_branches
    #27/1 perf_branches_hw:OK
    #27/2 perf_branches_no_hw:OK
    #27 perf_branches:OK
    Summary: 1/2 PASSED, 1 SKIPPED, 0 FAILED

Also sync tools/include/uapi/linux/bpf.h.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200218030432.4600-3-dxu@dxuuu.xyz

67306f84

bpf: Add bpf_read_branch_records() helper · fff7b643

Daniel Xu authored Feb 17, 2020

Branch records are a CPU feature that can be configured to record
certain branches that are taken during code execution. This data is
particularly interesting for profile guided optimizations. perf has had
branch record support for a while but the data collection can be a bit
coarse grained.

We (Facebook) have seen in experiments that associating metadata with
branch records can improve results (after postprocessing). We generally
use bpf_probe_read_*() to get metadata out of userspace. That's why bpf
support for branch records is useful.

Aside from this particular use case, having branch data available to bpf
progs can be useful to get stack traces out of userspace applications
that omit frame pointers.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200218030432.4600-2-dxu@dxuuu.xyz

fff7b643

Merge branch 'bpf-skmsg-simplify-restore' · 2f14b2d9

Daniel Borkmann authored Feb 19, 2020

Jakub Sitnicki says:

====================
This series has been split out from "Extend SOCKMAP to store listening
sockets" [0]. I think it stands on its own, and makes the latter series
smaller, which will make the review easier, hopefully.

The essence is that we don't need to do a complicated dance in
sk_psock_restore_proto, if we agree that the contract with tcp_update_ulp
is to restore callbacks even when the socket doesn't use ULP. This is what
tcp_update_ulp currently does, and we just make use of it.

Series is accompanied by a test for a particularly tricky case of restoring
callbacks when we have both sockmap and tls callbacks configured in
sk->sk_prot.

[0] https://lore.kernel.org/bpf/20200127131057.150941-1-jakub@cloudflare.com/
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

2f14b2d9

selftests/bpf: Test unhashing kTLS socket after removing from map · d1ba1204

Jakub Sitnicki authored Feb 17, 2020

When a TCP socket gets inserted into a sockmap, its sk_prot callbacks get
replaced with tcp_bpf callbacks built from regular tcp callbacks. If TLS
gets enabled on the same socket, sk_prot callbacks get replaced once again,
this time with kTLS callbacks built from tcp_bpf callbacks.

Now, we allow removing a socket from a sockmap that has kTLS enabled. After
removal, socket remains with kTLS configured. This is where things things
get tricky.

Since the socket has a set of sk_prot callbacks that are a mix of kTLS and
tcp_bpf callbacks, we need to restore just the tcp_bpf callbacks to the
original ones. At the moment, it comes down to the the unhash operation.

We had a regression recently because tcp_bpf callbacks were not cleared in
this particular scenario of removing a kTLS socket from a sockmap. It got
fixed in commit 4da6a196 ("bpf: Sockmap/tls, during free we may call
tcp_bpf_unhash() in loop").

Add a test that triggers the regression so that we don't reintroduce it in
the future.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200217121530.754315-4-jakub@cloudflare.com

d1ba1204

bpf, sk_msg: Don't clear saved sock proto on restore · a178b458

Jakub Sitnicki authored Feb 17, 2020

There is no need to clear psock->sk_proto when restoring socket protocol
callbacks in sk->sk_prot. The psock is about to get detached from the sock
and eventually destroyed. At worst we will restore the protocol callbacks
and the write callback twice.

This makes reasoning about psock state easier. Once psock is initialized,
we can count on psock->sk_proto always being set.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200217121530.754315-3-jakub@cloudflare.com

a178b458

bpf, sk_msg: Let ULP restore sk_proto and write_space callback · a4393861

Jakub Sitnicki authored Feb 17, 2020

We don't need a fallback for when the socket is not using ULP.
tcp_update_ulp handles this case exactly the same as we do in
sk_psock_restore_proto. Get rid of the duplicated code.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200217121530.754315-2-jakub@cloudflare.com

a4393861

18 Feb, 2020 7 commits

bpf: Allow bpf_perf_event_read_value in all BPF programs · b80b033b

Song Liu authored Feb 14, 2020

bpf_perf_event_read_value() is NMI safe. Enable it for all BPF programs.
This can be used in fentry/fexit to profile BPF program and individual
kernel function with hardware counters.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200214234146.2910011-1-songliubraving@fb.com

b80b033b

net: ena: remove set but not used variable 'hash_key' · b182a667

YueHaibing authored Feb 18, 2020

drivers/net/ethernet/amazon/ena/ena_com.c: In function ena_com_hash_key_allocate:
drivers/net/ethernet/amazon/ena/ena_com.c:1070:50:
 warning: variable hash_key set but not used [-Wunused-but-set-variable]

commit 6a4f7dc8 ("net: ena: rss: do not allocate key when not supported")
introduced this, but not used, so remove it.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b182a667

net: netlink: Replace zero-length array with flexible-array member · 2b738124

Gustavo A. R. Silva authored Feb 17, 2020

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2b738124

net: switchdev: Replace zero-length array with flexible-array member · fbfc8502

Gustavo A. R. Silva authored Feb 17, 2020

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fbfc8502

bpf, sockmap: Replace zero-length array with flexible-array member · 45a4296b

Gustavo A. R. Silva authored Feb 17, 2020

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

45a4296b

NFC: digital: Replace zero-length array with flexible-array member · 9814428a

Gustavo A. R. Silva authored Feb 17, 2020

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9814428a

net: usb: cdc-phonet: Replace zero-length array with flexible-array member · dc3cc347

Gustavo A. R. Silva authored Feb 17, 2020

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dc3cc347

17 Feb, 2020 26 commits

net: phy: allow bcm84881 to be a module · 725d23b5

Russell King authored Feb 17, 2020

Now that the phylib module loading issue has been resolved, we can
allow this PHY driver to be built as a module.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

725d23b5

Merge branch 'net-smc-next' · 4c082221

David S. Miller authored Feb 17, 2020

Ursula Braun says:

====================
net/smc: patches 2020-02-17

here are patches for SMC making termination tasks more perfect.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

4c082221

net/smc: reduce port_event scheduling · 5613f20c

Ursula Braun authored Feb 17, 2020

IB event handlers schedule the port event worker for further
processing of port state changes. This patch reduces the number of
schedules to avoid duplicate processing of the same port change.
Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5613f20c

net/smc: simplify normal link termination · 5f78fe96

Karsten Graul authored Feb 17, 2020

smc_lgr_terminate() and smc_lgr_terminate_sched() both result in soft
link termination, smc_lgr_terminate_sched() is scheduling a worker for
this task. Take out complexity by always using the termination worker
and getting rid of smc_lgr_terminate() completely.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5f78fe96

net/smc: remove unused parameter of smc_lgr_terminate() · ba952060

Karsten Graul authored Feb 17, 2020

The soft parameter of smc_lgr_terminate() is not used and obsolete.
Remove it.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ba952060

net/smc: do not delete lgr from list twice · 3739707c

Karsten Graul authored Feb 17, 2020

When 2 callers call smc_lgr_terminate() at the same time
for the same lgr, one gets the lgr_lock and deletes the lgr from the
list and releases the lock. Then the second caller gets the lock and
tries to delete it again.
In smc_lgr_terminate() add a check if the link group lgr is already
deleted from the link group list and prevent to try to delete it a
second time.
And add a check if the lgr is marked as freeing, which means that a
termination is already pending.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3739707c

net/smc: use termination worker under send_lock · 354ea2ba

Karsten Graul authored Feb 17, 2020

smc_tx_rdma_write() is called under the send_lock and should not call
smc_lgr_terminate() directly. Call smc_lgr_terminate_sched() instead
which schedules a worker.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

354ea2ba

net/smc: improve smc_lgr_cleanup() · 55dd5758

Karsten Graul authored Feb 17, 2020

smc_lgr_cleanup() is called during termination processing, there is no
need to send a DELETE_LINK at that time. A DELETE_LINK should have been
sent before the termination is initiated, if needed.
And remove the extra call to wake_up(&lnk->wr_reg_wait) because
smc_llc_link_inactive() already calls the related helper function
smc_wr_wakeup_reg_wait().
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

55dd5758

Merge branch 'mlxsw-Reduce-dependency-between-bridge-and-router-code' · 790a9a7c

David S. Miller authored Feb 17, 2020

Ido Schimmel says:

====================
mlxsw: Reduce dependency between bridge and router code

This patch set reduces the dependency between the bridge and the router
code in preparation for RTNL removal from the route insertion path in
mlxsw.

The motivation and solution are explained in detail in patch #3. The
main idea is that we need to stop special-casing the VXLAN devices with
regards to the reference counting of the FIDs. Otherwise, we can bump
into the situation described in patch #3, where the routing code calls
into the bridge code which calls back into the routing code. After
adding a mutex to protect router data structures to remove RTNL
dependency, this can result in an AA deadlock.

Patches #1 and #2 are preparations. They convert the FIDs to use
'refcount_t' for reference counting in order to catch over/under flows
and add extack to the bridge creation function.

Patches #3-#5 reduce the dependency between the bridge and the router
code. First, by having the VXLAN device take a reference on the FID in
patch #3 and then by removing unnecessary code following the change in
patch #3.

Patches #6-#10 adjust existing selftests and add new ones to exercise
the new code paths.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

790a9a7c

selftests: mlxsw: vxlan: Add test for error path · 495c3da6

Ido Schimmel authored Feb 17, 2020

Test that when two VXLAN tunnels with conflicting configurations (i.e.,
different TTL) are enslaved to the same VLAN-aware bridge, then the
enslavement of a port to the bridge is denied.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

495c3da6

selftests: mlxsw: vxlan: Adjust test to recent changes · 58ba0238

Ido Schimmel authored Feb 17, 2020

After recent changes, the VXLAN tunnel will be offloaded regardless if
any local ports are member in the FID or not. Adjust the test to make
sure the tunnel is offloaded in this case.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

58ba0238

selftests: mlxsw: extack: Test creation of multiple VLAN-aware bridges · 6c4e61ff

Ido Schimmel authored Feb 17, 2020

The driver supports a single VLAN-aware bridge. Test that the
enslavement of a port to the second VLAN-aware bridge fails with an
extack.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6c4e61ff

selftests: mlxsw: extack: Test bridge creation with VXLAN · bdc58bea

Ido Schimmel authored Feb 17, 2020

Test that creation of a bridge (both VLAN-aware and VLAN-unaware) fails
with an extack when a VXLAN device with an unsupported configuration is
already enslaved to it.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bdc58bea

selftests: mlxsw: Remove deprecated test · 745a7ea7

Ido Schimmel authored Feb 17, 2020

The addition of a VLAN on a bridge slave prompts the driver to have the
local port in question join the FID corresponding to this VLAN.

Before recent changes, the operation of joining the FID would also mean
that the driver would enable VXLAN tunneling if a VXLAN device was also
member in the VLAN. In case the configuration of the VXLAN tunnel was
not supported, an extack error would be returned.

Since the operation of joining the FID no longer means that VXLAN
tunneling is potentially enabled, the test is no longer relevant. Remove
it.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

745a7ea7

mlxsw: spectrum: Reduce dependency between bridge and router code · da1f9f8c

Ido Schimmel authored Feb 17, 2020

Commit f40be47a ("mlxsw: spectrum_router: Do not force specific
configuration order") added a call from the routing code to the bridge
code in order to handle the case where VNI should be set on a FID
following the joining of the router port to the FID.

This is no longer required, as previous patches made VXLAN devices
explicitly take a reference on the FID and set VNI on it.

Therefore, remove the unnecessary call and simply have the RIF take a
reference on the FID without checking if VNI should also be set on it.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

da1f9f8c

mlxsw: spectrum_switchdev: Remove VXLAN checks during FID membership · 578e5512

Ido Schimmel authored Feb 17, 2020

As explained in previous patch, VXLAN devices now take a reference on
the FID and not only local ports. Therefore, there is no need for local
ports to check if they need to set a VNI on the FID when they join the
FID.

Remove these unnecessary checks.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

578e5512

mlxsw: spectrum_switchdev: Have VXLAN device take reference on FID · 71afb45a

Ido Schimmel authored Feb 17, 2020

Up until now only local ports and the router port (which is also a local
port) took a reference on the corresponding FID (Filtering Identifier)
when joining a bridge. For example:

        192.0.2.1/24
            br0
             |
      +------+------+
      |             |
     swp1        vxlan0

In this case the reference count of the FID will be '2'. Since the VXLAN
device does not take a reference on the FID, whenever a local port joins
the bridge it needs to check if a VXLAN device is already enslaved. If
the VXLAN device should be mapped to the FID in question, then the VXLAN
device's VNI is set on the FID.

Beside the fact that this scheme special-cases the VXLAN device, it also
creates an unnecessary dependency between the routing and bridge code:

1. [R] IP address is added on 'br0', which prompts the creation of a RIF
   and a backing FID
2. [B] VNI is enabled on backing FID
3. [R] Host route corresponding to VXLAN device's source address is
   promoted to perform NVE decapsulation

[R] - Routing code
[B] - Bridge code

This back and forth dependency will become problematic when a lock is
added in the routing code instead of relying on RTNL, as it will result
in an AA deadlock.

Instead, have the VXLAN device take a reference on the FID just like all
the other netdev members of the bridge. In order to correctly handle the
case where VXLAN devices are already enslaved to the bridge when it is
offloaded, walk the bridge's slaves and replay the configuration.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

71afb45a

mlxsw: spectrum_switchdev: Propagate extack to bridge creation function · 23a1a0b3

Ido Schimmel authored Feb 17, 2020

Propagate extack to bridge creation function so that error messages
could be passed to user space via netlink instead of printing them to
kernel log.

A subsequent patch will pass the new extack argument to more functions.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

23a1a0b3

mlxsw: spectrum_fid: Use 'refcount_t' for FID reference counting · b96f5469

Ido Schimmel authored Feb 17, 2020

'refcount_t' is very useful for catching over/under flows. Convert the
FID (Filtering Identifier) objects to use it instead of 'unsigned int'
for their reference count.

A subsequent patch in the series will change the way VXLAN devices hold
/ release the FID reference, which is why the conversion is made now.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b96f5469

net: bridge: teach ndo_dflt_bridge_getlink() more brport flags · 583cb0b4

Julian Wiedmann authored Feb 17, 2020

This enables ndo_dflt_bridge_getlink() to report a bridge port's
offload settings for multicast and broadcast flooding.

CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

583cb0b4

Merge branch 'sfc-couple-more-ARFS-tidy-ups' · 5f1475b1

David S. Miller authored Feb 17, 2020

Edward Cree says:

====================
couple more ARFS tidy-ups

Tie up some loose ends from the recent ARFS work.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

5f1475b1

sfc: move some ARFS code out of headers · 025c5a0b

Edward Cree authored Feb 17, 2020

efx_filter_rfs_expire() is a work-function, so it being inline makes no
 sense.  It's only ever used in efx_channels.c, so move it there.
While we're at it, clean out some related unused cruft.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

025c5a0b

sfc: only schedule asynchronous filter work if needed · b7683155

Edward Cree authored Feb 17, 2020

Prevent excessive CPU time spent running a workitem with nothing to do.

We avoid any races by keeping the same check in efx_filter_rfs_expire().
Suggested-by: Martin Habets <mhabets@solarflare.com>
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b7683155

net: vlan: suppress "failed to kill vid" warnings · bd706ff8

Julian Wiedmann authored Feb 17, 2020

When a real dev unregisters, vlan_device_event() also unregisters all
of its vlan interfaces. For each VID this ends up in __vlan_vid_del(),
which attempts to remove the VID from the real dev's VLAN filter.

But the unregistering real dev might no longer be able to issue the
required IOs, and return an error. Subsequently we raise a noisy warning
msg that is not appropriate for this situation: the real dev is being
torn down anyway, there shouldn't be any worry about cleanly releasing
all of its HW-internal resources.

So to avoid scaring innocent users, suppress this warning when the
failed deletion happens on an unregistering device.
While at it also convert the raw pr_warn() to a more fitting
netdev_warn().
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bd706ff8

net: stmmac: Get rid of custom STMMAC_DEVICE() macro · 3e07df43

Andy Shevchenko authored Feb 17, 2020

Since PCI core provides a generic PCI_DEVICE_DATA() macro,
replace STMMAC_DEVICE() with former one.

No functional change intended.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3e07df43

Merge branch 'Remove-rtnl-lock-dependency-from-flow_action-infra' · b5d30812

David S. Miller authored Feb 17, 2020

Vlad Buslov says:

====================
Remove rtnl lock dependency from flow_action infra

Currently, TC flow_action infrastructure code obtain rtnl lock before
accessing action state in tc_setup_flow_action() function and releases
it afterwards. This behavior is not supposed to impact TC filter
insertion rate because filling flow_action representation is only a
small part of creating new filter and expensive operations (hardware
offload callbacks, classifiers, cls API code that creates chains and
classifiers instances) already support unlocked execution. However,
typical vswitch implementation might need to also dump TC filters
concurrently, for example to age out unused flows or update flow
counters. TC dump is fully serialized and holds rtnl lock during its
whole execution in kernel space. As such, it can significantly impact
concurrent tasks that try to intermittently obtain rtnl lock when
filling intermediate representation for new filter offload (performance
evaluation at the end of this mail).

Refactor flow_action cls API infrastructure and its dependencies to not
rely on rtnl lock for synchronization. Patch set overview:

- Refactor tc_setup_flow_action() to obtain action tcf_lock when
  accessing action state. Fix its dependencies to not obtain tcf_lock
  themselves and assume that caller already holds it (needs to be done
  in same patch to prevent deadlock) and not to call sleeping functions
  (needs to be done in same patch to prevent "sleeping while atomic"
  dmesg warnings).

- Refactor action helper functions to require tcf_lock instead of rtnl.
  Internally, all of the actions already use tcf_lock for
  synchronization to accommodate unlocked classifier API, so this change
  relies on already existing functionality.

- Remove rtnl lock and "rtnl_held" argument from tc_setup_flow_action()
  function.

To test the change, multiple concurrent TC instances are invoked with
following command:

time ls add* | xargs -n 1 -P 100 sudo tc -b

Ten batch files with following typical rules (100k each) are used:

filter add dev ens1f0_0 protocol ip ingress prio 1 handle 1 flower
	src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 src_ip 192.168.111.1
	dst_ip 192.168.111.2 ip_proto udp dst_port 1 src_port 1 action
	tunnel_key set id 1 src_ip 2.2.2.2 dst_ip 2.2.2.3 dst_port 4789
	no_percpu action mirred egress redirect dev vxlan1 no_percpu

TC dump of same device is called in infinite loop from five concurrent
instances:

while true do tc -s filter show dev $NIC ingress >/dev/null done

Results obtained on current net-next commit 9f68e365 ("Merge tag
'drm-next-2020-01-30' of git://anongit.freedesktop.org/drm/drm"):

               | net-next | this change
---------------+----------+-------------
 TC add        | 6.3s     | 6.3s
 TC add + dump | 29.3s    | 6.8s

Test results confirm significant impact of concurrent TC dump. The
impact is almost fully mitigated by proposed change (differences can be
attributed to contention for chain and tp locks between add and dump TC
instances).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

b5d30812