Commits · 43da14110cb4d20de0b4b097da88addefeab5f13 · Kirill Smelkov / linux

21 Nov, 2019 19 commits

net: Fix Kconfig indentation, continued · 43da1411

Krzysztof Kozlowski authored Nov 21, 2019

Adjust indentation from spaces to tab (+optional two spaces) as in
coding style.  This fixes various indentation mixups (seven spaces,
tab+one space, etc).
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

43da1411

drivers: net: Fix Kconfig indentation, continued · 5421cf84

Krzysztof Kozlowski authored Nov 21, 2019

Adjust indentation from spaces to tab (+optional two spaces) as in
coding style.  This fixes various indentation mixups (seven spaces,
tab+one space, etc).
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

5421cf84

lwtunnel: check erspan options before allocating tun_info · 1841b982

Xin Long authored Nov 21, 2019

As Jakub suggested on another patch, it's better to do the check
on erspan options before allocating memory.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1841b982

lwtunnel: be STRICT to validate the new LWTUNNEL_IP(6)_OPTS · 7b6a70f7

Xin Long authored Nov 21, 2019

LWTUNNEL_IP(6)_OPTS are the new items in ip(6)_tun_policy, which
are parsed by nla_parse_nested_deprecated(). We should check it
strictly by setting .strict_start_type = LWTUNNEL_IP(6)_OPTS.

This patch also adds missing LWTUNNEL_IP6_OPTS in ip6_tun_policy.

Fixes: 4ece4778 ("lwtunnel: add options setting and dumping for geneve")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7b6a70f7

net: remove the unnecessary strict_start_type in some policies · f3bed7f8

Xin Long authored Nov 21, 2019

ct_policy and mpls_policy are parsed with nla_parse_nested(), which
does NL_VALIDATE_STRICT validation, strict_start_type is not needed
to set as it is actually trying to make some attributes parsed with
NL_VALIDATE_STRICT.

This patch is to remove it, and do the same on rtm_nh_policy which
is parsed by nlmsg_parse().
Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f3bed7f8

Merge branch 'net-sched-support-vxlan-and-erspan-options' · ff998a80

David S. Miller authored Nov 21, 2019

Xin Long says:

====================
net: sched: support vxlan and erspan options

This patchset is to add vxlan and erspan options support in
cls_flower and act_tunnel_key. The form is pretty much like
geneve_opts in:

  https://patchwork.ozlabs.org/patch/935272/
  https://patchwork.ozlabs.org/patch/954564/

but only one option is allowed for vxlan and erspan.

v1->v2:
  - see each patch changelog.
====================
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ff998a80

net: sched: allow flower to match erspan options · 79b1011c

Xin Long authored Nov 21, 2019

This patch is to allow matching options in erspan.

The options can be described in the form:
VER:INDEX:DIR:HWID/VER:INDEX_MASK:DIR_MASK:HWID_MASK.
When ver is set to 1, index will be applied while dir
and hwid will be ignored, and when ver is set to 2,
dir and hwid will be used while index will be ignored.

Different from geneve, only one option can be set. And
also, geneve options, vxlan options or erspan options
can't be set at the same time.

  # ip link add name erspan1 type erspan external
  # tc qdisc add dev erspan1 ingress
  # tc filter add dev erspan1 protocol ip parent ffff: \
      flower \
        enc_src_ip 10.0.99.192 \
        enc_dst_ip 10.0.99.193 \
        enc_key_id 11 \
        erspan_opts 1:12:0:0/1:ffff:0:0 \
        ip_proto udp \
        action mirred egress redirect dev eth0

v1->v2:
  - improve some err msgs of extack.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

79b1011c

net: sched: allow flower to match vxlan options · d8f9dfae

Xin Long authored Nov 21, 2019

This patch is to allow matching gbp option in vxlan.

The options can be described in the form GBP/GBP_MASK,
where GBP is represented as a 32bit hexadecimal value.
Different from geneve, only one option can be set. And
also, geneve options and vxlan options can't be set at
the same time.

  # ip link add name vxlan0 type vxlan dstport 0 external
  # tc qdisc add dev vxlan0 ingress
  # tc filter add dev vxlan0 protocol ip parent ffff: \
      flower \
        enc_src_ip 10.0.99.192 \
        enc_dst_ip 10.0.99.193 \
        enc_key_id 11 \
        vxlan_opts 01020304/ffffffff \
        ip_proto udp \
        action mirred egress redirect dev eth0

v1->v2:
  - add .strict_start_type for enc_opts_policy as Jakub noticed.
  - use Duplicate instead of Wrong in err msg for extack as Jakub
    suggested.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d8f9dfae

net: sched: add erspan option support to act_tunnel_key · e20d4ff2

Xin Long authored Nov 21, 2019

This patch is to allow setting erspan options using the
act_tunnel_key action. Different from geneve options,
only one option can be set. And also, geneve options,
vxlan options or erspan options can't be set at the
same time.

Options are expressed as ver:index:dir:hwid, when ver
is set to 1, index will be applied while dir and hwid
will be ignored, and when ver is set to 2, dir and
hwid will be used while index will be ignored.

  # ip link add name erspan1 type erspan external
  # tc qdisc add dev eth0 ingress
  # tc filter add dev eth0 protocol ip parent ffff: \
           flower indev eth0 \
              ip_proto udp \
              action tunnel_key \
                  set src_ip 10.0.99.192 \
                  dst_ip 10.0.99.193 \
                  dst_port 6081 \
                  id 11 \
  		erspan_opts 1:2:0:0 \
          action mirred egress redirect dev erspan1

v1->v2:
  - do the validation when dst is not yet allocated as Jakub suggested.
  - use Duplicate instead of Wrong in err msg for extack.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e20d4ff2

net: sched: add vxlan option support to act_tunnel_key · fca3f91c

Xin Long authored Nov 21, 2019

This patch is to allow setting vxlan options using the
act_tunnel_key action. Different from geneve options,
only one option can be set. And also, geneve options
and vxlan options can't be set at the same time.

gbp is the only param for vxlan options:

  # ip link add name vxlan0 type vxlan dstport 0 external
  # tc qdisc add dev eth0 ingress
  # tc filter add dev eth0 protocol ip parent ffff: \
           flower indev eth0 \
              ip_proto udp \
              action tunnel_key \
                  set src_ip 10.0.99.192 \
                  dst_ip 10.0.99.193 \
                  dst_port 6081 \
                  id 11 \
  		  vxlan_opts 01020304 \
          action mirred egress redirect dev vxlan0

v1->v2:
  - add .strict_start_type for enc_opts_policy as Jakub noticed.
  - use Duplicate instead of Wrong in err msg for extack as Jakub
    suggested.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fca3f91c

octeontx2-af: Fix uninitialized variable in debugfs · 0617aa98

Dan Carpenter authored Nov 21, 2019

If rvu_get_blkaddr() fails, then this rvu_cgx_nix_cuml_stats() returns
zero and we write some uninitialized data into the debugfs output.

On the error paths, the use of the uninitialized "*stat" is harmless,
but it will lead to a Smatch warning (static analysis) and a UBSan
warning (runtime analysis) so we should prevent that as well.

Fixes: f967488d ("octeontx2-af: Add per CGX port level NIX Rx/Tx counters")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0617aa98

vsock: avoid to assign transport if its initialization fails · 039fccca

Stefano Garzarella authored Nov 21, 2019

If transport->init() fails, we can't assign the transport to the
socket, because it's not initialized correctly, and any future
calls to the transport callbacks would have an unexpected behavior.

Fixes: c0cfa2d8 ("vsock: add multi-transports support")
Reported-and-tested-by: syzbot+e2e5c07bf353b2f79daa@syzkaller.appspotmail.com
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

039fccca

net: sfp: soft status and control support · f3c9a666

Russell King authored Nov 20, 2019

Add support for the soft status and control register, which allows
TX_FAULT and RX_LOS to be monitored and TX_DISABLE to be set.  We
make use of this when the board does not support GPIOs for these
signals.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f3c9a666

Merge branch 'sfp-quirks' · 9ce33351

David S. Miller authored Nov 20, 2019

Russell King says:

====================
Add rudimentary SFP module quirk support

The SFP module EEPROM describes the capabilities of the module, but
doesn't describe the host interface.  We have a certain amount of
guess-work to work out how to configure the host - which works most
of the time.

However, there are some (such as GPON) modules which are able to
support different host interfaces, such as 1000BASE-X and 2500BASE-X.
The module will switch between each mode until it achieves link with
the host.

There is no defined way to describe this in the SFP EEPROM, so we can
only recognise the module and handle it appropriately.  This series
adds the necessary recognition of the modules using a quirk system,
and tweaks the support mask to allow them to link with the host at
2500BASE-X, thereby allowing the user to achieve full line rate.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

9ce33351

net: sfp: add some quirks for GPON modules · b0eae33b

Russell King authored Nov 20, 2019

Marc Micalizzi reports that Huawei MA5671A and Alcatel/Lucent G-010S-P
modules are capable of 2500base-X, but incorrectly report their
capabilities in the EEPROM.  It seems rather common that GPON modules
mis-report.

Let's fix these modules by adding some quirks.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b0eae33b

net: sfp: add support for module quirks · b34bb2cb

Russell King authored Nov 20, 2019

Add support for applying module quirks to the list of supported
ethtool link modes.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b34bb2cb

tcp: warn if offset reach the maxlen limit when using snprintf · 9bb59a21

Hangbin Liu authored Nov 20, 2019

snprintf returns the number of chars that would be written, not number
of chars that were actually written. As such, 'offs' may get larger than
'tbl.maxlen', causing the 'tbl.maxlen - offs' being < 0, and since the
parameter is size_t, it would overflow.

Since using scnprintf may hide the limit error, while the buffer is still
enough now, let's just add a WARN_ON_ONCE in case it reach the limit
in future.

v2: Use WARN_ON_ONCE as Jiri and Eric suggested.
Suggested-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9bb59a21

ip_gre: Make none-tun-dst gre tunnel store tunnel info as metadat_dst in recv · c0d59da7

wenxu authored Nov 20, 2019

Currently collect_md gre tunnel will store the tunnel info(metadata_dst)
to skb_dst.
And now the non-tun-dst gre tunnel already can add tunnel header through
lwtunnel.

When received a arp_request on the non-tun-dst gre tunnel. The packet of
arp response will send through the non-tun-dst tunnel without tunnel info
which will lead the arp response packet to be dropped.

If the non-tun-dst gre tunnel also store the tunnel info as metadata_dst,
The arp response packet will set the releted tunnel info in the
iptunnel_metadata_reply.

The following is the test script:

ip netns add cl
ip l add dev vethc type veth peer name eth0 netns cl

ifconfig vethc 172.168.0.7/24 up
ip l add dev tun1000 type gretap key 1000

ip link add user1000 type vrf table 1
ip l set user1000 up
ip l set dev tun1000 master user1000
ifconfig tun1000 10.0.1.1/24 up

ip netns exec cl ifconfig eth0 172.168.0.17/24 up
ip netns exec cl ip l add dev tun type gretap local 172.168.0.17 remote 172.168.0.7 key 1000
ip netns exec cl ifconfig tun 10.0.1.7/24 up
ip r r 10.0.1.7 encap ip id 1000 dst 172.168.0.17 key dev tun1000 table 1

With this patch
ip netns exec cl ping 10.0.1.1 can success
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>

c0d59da7

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · ee5a489f

David S. Miller authored Nov 20, 2019

Daniel Borkmann says:

====================
pull-request: bpf-next 2019-11-20

The following pull-request contains BPF updates for your *net-next* tree.

We've added 81 non-merge commits during the last 17 day(s) which contain
a total of 120 files changed, 4958 insertions(+), 1081 deletions(-).

There are 3 trivial conflicts, resolve it by always taking the chunk from
196e8ca7:

<<<<<<< HEAD
=======
void *bpf_map_area_mmapable_alloc(u64 size, int numa_node);
>>>>>>> 196e8ca7

<<<<<<< HEAD
void *bpf_map_area_alloc(u64 size, int numa_node)
=======
static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable)
>>>>>>> 196e8ca7

<<<<<<< HEAD
        if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
=======
        /* kmalloc()'ed memory can't be mmap()'ed */
        if (!mmapable && size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
>>>>>>> 196e8ca7

The main changes are:

1) Addition of BPF trampoline which works as a bridge between kernel functions,
   BPF programs and other BPF programs along with two new use cases: i) fentry/fexit
   BPF programs for tracing with practically zero overhead to call into BPF (as
   opposed to k[ret]probes) and ii) attachment of the former to networking related
   programs to see input/output of networking programs (covering xdpdump use case),
   from Alexei Starovoitov.

2) BPF array map mmap support and use in libbpf for global data maps; also a big
   batch of libbpf improvements, among others, support for reading bitfields in a
   relocatable manner (via libbpf's CO-RE helper API), from Andrii Nakryiko.

3) Extend s390x JIT with usage of relative long jumps and loads in order to lift
   the current 64/512k size limits on JITed BPF programs there, from Ilya Leoshkevich.

4) Add BPF audit support and emit messages upon successful prog load and unload in
   order to have a timeline of events, from Daniel Borkmann and Jiri Olsa.

5) Extension to libbpf and xdpsock sample programs to demo the shared umem mode
   (XDP_SHARED_UMEM) as well as RX-only and TX-only sockets, from Magnus Karlsson.

6) Several follow-up bug fixes for libbpf's auto-pinning code and a new API
   call named bpf_get_link_xdp_info() for retrieving the full set of prog
   IDs attached to XDP, from Toke Høiland-Jørgensen.

7) Add BTF support for array of int, array of struct and multidimensional arrays
   and enable it for skb->cb[] access in kfree_skb test, from Martin KaFai Lau.

8) Fix AF_XDP by using the correct number of channels from ethtool, from Luigi Rizzo.

9) Two fixes for BPF selftest to get rid of a hang in test_tc_tunnel and to avoid
   xdping to be run as standalone, from Jiri Benc.

10) Various BPF selftest fixes when run with latest LLVM trunk, from Yonghong Song.

11) Fix a memory leak in BPF fentry test run data, from Colin Ian King.

12) Various smaller misc cleanups and improvements mostly all over BPF selftests and
    samples, from Daniel T. Lee, Andre Guedes, Anders Roxell, Mao Wenan, Yue Haibing.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

ee5a489f

20 Nov, 2019 21 commits

bpf: Switch bpf_map_{area_alloc,area_mmapable_alloc}() to u64 size · 196e8ca7

Daniel Borkmann authored Nov 20, 2019

Given we recently extended the original bpf_map_area_alloc() helper in
commit fc970227 ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY"),
we need to apply the same logic as in ff1c08e1 ("bpf: Change size
to u64 for bpf_map_{area_alloc, charge_init}()"). To avoid conflicts,
extend it for bpf-next.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

196e8ca7

bpf: Emit audit messages upon successful prog load and unload · 91e6015b

Daniel Borkmann authored Nov 20, 2019

Allow for audit messages to be emitted upon BPF program load and
unload for having a timeline of events. The load itself is in
syscall context, so additional info about the process initiating
the BPF prog creation can be logged and later directly correlated
to the unload event.

The only info really needed from BPF side is the globally unique
prog ID where then audit user space tooling can query / dump all
info needed about the specific BPF program right upon load event
and enrich the record, thus these changes needed here can be kept
small and non-intrusive to the core.

Raw example output:

  # auditctl -D
  # auditctl -a always,exit -F arch=x86_64 -S bpf
  # ausearch --start recent -m 1334
  [...]
  ----
  time->Wed Nov 20 12:45:51 2019
  type=PROCTITLE msg=audit(1574271951.590:8974): proctitle="./test_verifier"
  type=SYSCALL msg=audit(1574271951.590:8974): arch=c000003e syscall=321 success=yes exit=14 a0=5 a1=7ffe2d923e80 a2=78 a3=0 items=0 ppid=742 pid=949 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=2 comm="test_verifier" exe="/root/bpf-next/tools/testing/selftests/bpf/test_verifier" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)
  type=UNKNOWN[1334] msg=audit(1574271951.590:8974): auid=0 uid=0 gid=0 ses=2 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=949 comm="test_verifier" exe="/root/bpf-next/tools/testing/selftests/bpf/test_verifier" prog-id=3260 event=LOAD
  ----
  time->Wed Nov 20 12:45:51 2019
type=UNKNOWN[1334] msg=audit(1574271951.590:8975): prog-id=3260 event=UNLOAD
  ----
  [...]
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191120213816.8186-1-jolsa@kernel.org

91e6015b

Merge branch 'r8169-smaller-improvements-to-firmware-handling' · e2193c93

David S. Miller authored Nov 20, 2019

Heiner Kallweit says:

====================
r8169: smaller improvements to firmware handling

This series includes few smaller improvements to firmware handling.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

e2193c93

r8169: add check for PHY_MDIO_CHG to rtl_nic_fw_data_ok · df0120f1

Heiner Kallweit authored Nov 20, 2019

Only values 0 and 1 are currently defined as parameters for
PHY_MDIO_CHG. Instead of silently ignoring unknown values and
misinterpreting the firmware code let's explicitly check.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

df0120f1

r8169: use macro FIELD_SIZEOF in definition of FW_OPCODE_SIZE · cfccde80

Heiner Kallweit authored Nov 20, 2019

Using macro FIELD_SIZEOF makes this define easier understandable.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cfccde80

r8169: change mdelay to msleep in rtl_fw_write_firmware · e20c43db

Heiner Kallweit authored Nov 20, 2019

We're not in atomic context here, therefore switch to msleep.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e20c43db

net: ipconfig: Wait for deferred device probes · e2ffe3ff

Thomas Bogendoerfer authored Nov 20, 2019

If network device drives are using deferred probing, it was possible
that waiting for devices to show up in ipconfig was already over,
when the device eventually showed up. By calling wait_for_device_probe()
we now make sure deferred probing is done before checking for available
devices.
Signed-off-by: Thomas Bogendoerfer <tbogendoerfer@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

e2ffe3ff

vsock/vmci: make vmci_vsock_cb_host_called static · 2be8ca97

Mao Wenan authored Nov 20, 2019

When using make C=2 drivers/misc/vmw_vmci/vmci_driver.o
to compile, below warning can be seen:
drivers/misc/vmw_vmci/vmci_driver.c:33:6: warning:
symbol 'vmci_vsock_cb_host_called' was not declared. Should it be static?

This patch make symbol vmci_vsock_cb_host_called static.

Fixes: b1bba80a ("vsock/vmci: register vmci_transport only when VMCI guest/host are active")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2be8ca97

Merge branch 'page_pool-DMA-sync' · e07e7541

David S. Miller authored Nov 20, 2019

Lorenzo Bianconi says:

====================
add DMA-sync-for-device capability to page_pool API

Introduce the possibility to sync DMA memory for device in the page_pool API.
This feature allows to sync proper DMA size and not always full buffer
(dma_sync_single_for_device can be very costly).
Please note DMA-sync-for-CPU is still device driver responsibility.
Relying on page_pool DMA sync mvneta driver improves XDP_DROP pps of
about 170Kpps:

- XDP_DROP DMA sync managed by mvneta driver:	~420Kpps
- XDP_DROP DMA sync managed by page_pool API:	~585Kpps

Do not change naming convention for the moment since the changes will hit other
drivers as well. I will address it in another series.

Changes since v4:
- do not allow the driver to set max_len to 0
- convert PP_FLAG_DMA_MAP/PP_FLAG_DMA_SYNC_DEV to BIT() macro

Changes since v3:
- move dma_sync_for_device before putting the page in ptr_ring in
  __page_pool_recycle_into_ring since ptr_ring can be consumed
  concurrently. Simplify the code moving dma_sync_for_device
  before running __page_pool_recycle_direct/__page_pool_recycle_into_ring

Changes since v2:
- rely on PP_FLAG_DMA_SYNC_DEV flag instead of dma_sync

Changes since v1:
- rename sync in dma_sync
- set dma_sync_size to 0xFFFFFFFF in page_pool_recycle_direct and
  page_pool_put_page routines
- Improve documentation
====================
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

e07e7541

net: mvneta: get rid of huge dma sync in mvneta_rx_refill · 07e13edb

Lorenzo Bianconi authored Nov 20, 2019

Get rid of costly dma_sync_single_for_device in mvneta_rx_refill
since now the driver can let page_pool API to manage needed DMA
sync with a proper size.

- XDP_DROP DMA sync managed by mvneta driver:	~420Kpps
- XDP_DROP DMA sync managed by page_pool API:	~585Kpps
Tested-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

07e13edb

net: page_pool: add the possibility to sync DMA memory for device · e68bc756

Lorenzo Bianconi authored Nov 20, 2019

Introduce the following parameters in order to add the possibility to sync
DMA memory for device before putting allocated pages in the page_pool
caches:
- PP_FLAG_DMA_SYNC_DEV: if set in page_pool_params flags, all pages that
  the driver gets from page_pool will be DMA-synced-for-device according
  to the length provided by the device driver. Please note DMA-sync-for-CPU
  is still device driver responsibility
- offset: DMA address offset where the DMA engine starts copying rx data
- max_len: maximum DMA memory size page_pool is allowed to flush. This
  is currently used in __page_pool_alloc_pages_slow routine when pages
  are allocated from page allocator
These parameters are supposed to be set by device drivers.

This optimization reduces the length of the DMA-sync-for-device.
The optimization is valid because pages are initially
DMA-synced-for-device as defined via max_len. At RX time, the driver
will perform a DMA-sync-for-CPU on the memory for the packet length.
What is important is the memory occupied by packet payload, because
this is the area CPU is allowed to read and modify. As we don't track
cache-lines written into by the CPU, simply use the packet payload length
as dma_sync_size at page_pool recycle time. This also take into account
any tail-extend.
Tested-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

e68bc756

net: mvneta: rely on page_pool_recycle_direct in mvneta_run_xdp · f383b295

Lorenzo Bianconi authored Nov 20, 2019

Rely on page_pool_recycle_direct and not on xdp_return_buff in
mvneta_run_xdp. This is a preliminary patch to limit the dma sync len
to the one strictly necessary
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f383b295

net: sched: pie: enable timestamp based delay calculation · cec2975f

Gautam Ramakrishnan authored Nov 20, 2019

RFC 8033 suggests an alternative approach to calculate the queue
delay in PIE by using a timestamp on every enqueued packet. This
patch adds an implementation of that approach and sets it as the
default method to calculate queue delay. The previous method (based
on Little's law) to calculate queue delay is set as optional.
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Acked-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cec2975f

isdn: Fix Kconfig indentation · f01b437d

Krzysztof Kozlowski authored Nov 20, 2019

Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
	$ sed -e 's/^        /\t/' -i */Kconfig
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

f01b437d

nfc: Fix Kconfig indentation · 041ccdb6

Krzysztof Kozlowski authored Nov 20, 2019

Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
	$ sed -e 's/^        /\t/' -i */Kconfig
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

041ccdb6

Merge branch 'cxgb4-add-TC-MATCHALL-classifier-offload' · 07def463

David S. Miller authored Nov 20, 2019

Rahul Lakkireddy says:

====================
cxgb4: add TC-MATCHALL classifier offload

This series of patches add support to offload TC-MATCHALL classifier
to hardware to classify all outgoing and incoming traffic on the
underlying port. Only 1 egress and 1 ingress rule each can be
offloaded on the underlying port.

Patch 1 adds support for TC-MATCHALL classifier offload on the egress
side. TC-POLICE is the only action that can be offloaded on the egress
side and is used to rate limit all outgoing traffic to specified max
rate.

Patch 2 adds logic to reject the current rule offload if its priority
conflicts with existing rules in the TCAM.

Patch 3 adds support for TC-MATCHALL classifier offload on the ingress
side. The same set of actions supported by existing TC-FLOWER
classifier offload can be applied on all the incoming traffic.

v5:
- Fixed commit message and comment to include comparison for equal
  priority in patch 2.

v4:
- Removed check in patch 1 to reject police offload if prio is not 1.
- Moved TC_SETUP_BLOCK code to separate function in patch 1.
- Added logic to ensure the prio passed by TC doesn't conflict with
  other rules in TCAM in patch 2.
- Higher index has lower priority than lower index in TCAM. So, rework
  cxgb4_get_free_ftid() to search free index from end of TCAM in
  descending order in patch 2.
- Added check to ensure the matchall rule's prio doesn't conflict with
  other rules in TCAM in patch 3.
- Added logic to fill default mask for VIID, if none has been
  provided, to prevent conflict with duplicate VIID rules in patch 3.
- Used existing variables in private structure to fill VIID info,
  instead of extracting the info manually in patch 3.

v3:
- Added check in patch 1 to reject police offload if prio is not 1.
- Assign block_shared variable only for TC_SETUP_BLOCK in patch 1.

v2:
- Added check to reject flow block sharing for policers in patch 1.
- Removed logic to fetch free index from end of TCAM in patch 2.
  Must maintain the same ordering as in kernel.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

07def463

cxgb4: add TC-MATCHALL classifier ingress offload · 21c4c60b

Rahul Lakkireddy authored Nov 20, 2019

Add TC-MATCHALL classifier ingress offload support. The same actions
supported by existing TC-FLOWER offload can be applied to all incoming
traffic on the underlying interface.

Ensure the rule priority doesn't conflict with existing rules in the
TCAM. Only 1 ingress matchall rule can be active at a time on the
underlying interface.

v5:
- No change.

v4:
- Added check to ensure the matchall rule's prio doesn't conflict with
  other rules in TCAM.
- Added logic to fill default mask for VIID, if none has been
  provided, to prevent conflict with duplicate VIID rules.
- Used existing variables in private structure to fill VIID info,
  instead of extracting the info manually.

v3:
- No change.

v2:
- Removed logic to fetch free index from end of TCAM. Must maintain
  same ordering as in kernel.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

21c4c60b

cxgb4: check rule prio conflicts before offload · 41ec03e5

Rahul Lakkireddy authored Nov 20, 2019

Only offload rule if it satisfies both of the following conditions:
1. The immediate previous rule has priority <= current rule's priority.
2. The immediate next rule has priority >= current rule's priority.

Also rework free entry fetch logic to search from end of TCAM, instead
of beginning, because higher indices have lower priority than lower
indices. This is similar to how TC auto generates priority values.

v5:
- Fixed commit message and comment to include comparison for equal
  priority.

v4:
- Patch added in this version.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

41ec03e5

cxgb4: add TC-MATCHALL classifier egress offload · 4ec4762d

Rahul Lakkireddy authored Nov 20, 2019

Add TC-MATCHALL classifier offload with TC-POLICE action applied for
all outgoing traffic on the underlying interface. Split flow block
offload to support both egress and ingress classification.

For example, to rate limit all outgoing traffic to 1 Gbps:

$ tc qdisc add dev enp2s0f4 clsact
$ tc filter add dev enp2s0f4 egress matchall skip_sw \
	action police rate 1Gbit burst 8Kbit

Note that skip_sw is important. Otherwise, both stack and hardware
will end up doing policing. Policing can't be shared across flow
blocks. Only 1 egress matchall rule can be active at a time on the
underlying interface.

v5:
- No change.

v4:
- Removed check to reject police offload if prio is not 1.
- Moved TC_SETUP_BLOCK code to separate function.

v3:
- Added check to reject police offload if prio is not 1.
- Assign block_shared variable only for TC_SETUP_BLOCK.

v2:
- Added check to reject flow block sharing for policers.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4ec4762d

Merge branch 'page_pool-API-for-numa-node-change-handling' · 77c05d2f

David S. Miller authored Nov 20, 2019

Saeed Mahameed says:

====================
page_pool: API for numa node change handling

This series extends page pool API to allow page pool consumers to update
page pool numa node on the fly. This is required since on some systems,
rx rings irqs can migrate between numa nodes, due to irq balancer or user
defined scripts, current page pool has no way to know of such migration
and will keep allocating and holding on to pages from a wrong numa node,
which is bad for the consumer performance.

1) Add API to update numa node id of the page pool
Consumers will call this API to update the page pool numa node id.

2) Don't recycle non-reusable pages:
Page pool will check upon page return whether a page is suitable for
recycling or not.
 2.1) when it belongs to a different num node.
 2.2) when it was allocated under memory pressure.

3) mlx5 will use the new API to update page pool numa id on demand.

The series is a joint work between me and Jonathan, we tested it and it
proved itself worthy to avoid page allocator bottlenecks and improve
packet rate and cpu utilization significantly for the described
scenarios above.

Performance testing:
XDP drop/tx rate and TCP single/multi stream, on mlx5 driver
while migrating rx ring irq from close to far numa:

mlx5 internal page cache was locally disabled to get pure page pool
results.

CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
NIC: Mellanox Technologies MT27700 Family [ConnectX-4] (100G)

XDP Drop/TX single core:
NUMA  | XDP  | Before    | After
---------------------------------------
Close | Drop | 11   Mpps | 10.9 Mpps
Far   | Drop | 4.4  Mpps | 5.8  Mpps

Close | TX   | 6.5 Mpps  | 6.5 Mpps
Far   | TX   | 3.5 Mpps  | 4   Mpps

Improvement is about 30% drop packet rate, 15% tx packet rate for numa
far test.
No degradation for numa close tests.

TCP single/multi cpu/stream:
NUMA  | #cpu | Before  | After
--------------------------------------
Close | 1    | 18 Gbps | 18 Gbps
Far   | 1    | 15 Gbps | 18 Gbps
Close | 12   | 80 Gbps | 80 Gbps
Far   | 12   | 68 Gbps | 80 Gbps

In all test cases we see improvement for the far numa case, and no
impact on the close numa case.

==================

Performance analysis and conclusions by Jesper [1]:
Impact on XDP drop x86_64 is inconclusive and shows only 0.3459ns
slow-down, as this is below measurement accuracy of system.

v2->v3:
 - Rebase on top of latest net-next and Jesper's page pool object
   release patchset [2]
 - No code changes
 - Performance analysis by Jesper added to the cover letter.

v1->v2:
  - Drop last patch, as requested by Ilias and Jesper.
  - Fix documentation's performance numbers order.

[1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool04_inflight_changes.org#performance-notes
[2] https://patchwork.ozlabs.org/cover/1192098/
====================
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

77c05d2f

net/mlx5e: Rx, Update page pool numa node when changed · 6849c6d8

Saeed Mahameed authored Nov 20, 2019

Once every napi poll cycle, check if numa node is different than
the page pool's numa id, and update it using page_pool_update_nid().

Alternatively, we could have registered an irq affinity change handler,
but page_pool_update_nid() must be called from napi context anyways, so
the handler won't actually help.

Performance testing:
XDP drop/tx rate and TCP single/multi stream, on mlx5 driver
while migrating rx ring irq from close to far numa:

mlx5 internal page cache was locally disabled to get pure page pool
results.

CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
NIC: Mellanox Technologies MT27700 Family [ConnectX-4] (100G)

XDP Drop/TX single core:
NUMA  | XDP  | Before    | After
---------------------------------------
Close | Drop | 11   Mpps | 10.9 Mpps
Far   | Drop | 4.4  Mpps | 5.8  Mpps

Close | TX   | 6.5 Mpps  | 6.5 Mpps
Far   | TX   | 3.5 Mpps  | 4  Mpps

Improvement is about 30% drop packet rate, 15% tx packet rate for numa
far test.
No degradation for numa close tests.

TCP single/multi cpu/stream:
NUMA  | #cpu | Before  | After
--------------------------------------
Close | 1    | 18 Gbps | 18 Gbps
Far   | 1    | 15 Gbps | 18 Gbps
Close | 12   | 80 Gbps | 80 Gbps
Far   | 12   | 68 Gbps | 80 Gbps

In all test cases we see improvement for the far numa case, and no
impact on the close numa case.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6849c6d8