1. 15 May, 2020 40 commits
    • David S. Miller's avatar
      Merge branch 'mptcp-fix-MP_JOIN-failure-handling' · 93d43e58
      David S. Miller authored
      Paolo Abeni says:
      
      ====================
      mptcp: fix MP_JOIN failure handling
      
      Currently if we hit an MP_JOIN failure on the third ack, the child socket is
      closed with reset, but the request socket is not deleted, causing weird
      behaviors.
      
      The main problem is that MPTCP's MP_JOIN code needs to plug it's own
      'valid 3rd ack' checks and the current TCP callbacks do not allow that.
      
      This series tries to address the above shortcoming introducing a new MPTCP
      specific bit in a 'struct tcp_request_sock' hole, and leveraging that to allow
      tcp_check_req releasing the request socket when needed.
      
      The above allows cleaning-up a bit current MPTCP hooking in tcp_check_req().
      
      An alternative solution, possibly cleaner but more invasive, would be
      changing the 'bool *own_req' syn_recv_sock() argument into 'int *req_status'
      and let MPTCP set it to 'REQ_DROP'.
      
      v1 -> v2:
       - be more conservative about drop_req initialization
      
      RFC -> v1:
       - move the drop_req bit inside tcp_request_sock (Eric)
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      93d43e58
    • Paolo Abeni's avatar
      mptcp: cope better with MP_JOIN failure · 729cd643
      Paolo Abeni authored
      Currently, on MP_JOIN failure we reset the child
      socket, but leave the request socket untouched.
      
      tcp_check_req will deal with it according to the
      'tcp_abort_on_overflow' sysctl value - by default the
      req socket will stay alive.
      
      The above leads to inconsistent behavior on MP JOIN
      failure, and bad listener overflow accounting.
      
      This patch addresses the issue leveraging the infrastructure
      just introduced to ask the TCP stack to drop the req on
      failure.
      
      The child socket is not freed anymore by subflow_syn_recv_sock(),
      instead it's moved to a dead state and will be disposed by the
      next sock_put done by the TCP stack, so that listener overflow
      accounting is not affected by MP JOIN failure.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      729cd643
    • Paolo Abeni's avatar
      inet_connection_sock: factor out destroy helper. · 2f8a397d
      Paolo Abeni authored
      Move the steps to prepare an inet_connection_sock for
      forced disposal inside a separate helper. No functional
      changes inteded, this will just simplify the next patch.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2f8a397d
    • Paolo Abeni's avatar
      mptcp: add new sock flag to deal with join subflows · 90bf4513
      Paolo Abeni authored
      MP_JOIN subflows must not land into the accept queue.
      Currently tcp_check_req() calls an mptcp specific helper
      to detect such scenario.
      
      Such helper leverages the subflow context to check for
      MP_JOIN subflows. We need to deal also with MP JOIN
      failures, even when the subflow context is not available
      due allocation failure.
      
      A possible solution would be changing the syn_recv_sock()
      signature to allow returning a more descriptive action/
      error code and deal with that in tcp_check_req().
      
      Since the above need is MPTCP specific, this patch instead
      uses a TCP request socket hole to add a MPTCP specific flag.
      Such flag is used by the MPTCP syn_recv_sock() to tell
      tcp_check_req() how to deal with the request socket.
      
      This change is a no-op for !MPTCP build, and makes the
      MPTCP code simpler. It allows also the next patch to deal
      correctly with MP JOIN failure.
      
      v1 -> v2:
       - be more conservative on drop_req initialization (Mat)
      
      RFC -> v1:
       - move the drop_req bit inside tcp_request_sock (Eric)
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Reviewed-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      90bf4513
    • Oleksij Rempel's avatar
      net: phy: tja11xx: execute cable test on link up · ca1c933b
      Oleksij Rempel authored
      A typical 100Base-T1 link should be always connected. If the link is in
      a shot or open state, it is a failure. In most cases, we won't be able
      to automatically handle this issue, but we need to log it or notify user
      (if possible).
      
      With this patch, the cable will be tested on "ip l s dev .. up" attempt
      and send ethnl notification to the user space.
      
      This patch was tested with TJA1102 PHY and "ethtool --monitor" command.
      Signed-off-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca1c933b
    • Kevin Lo's avatar
      net: phy: broadcom: add support for BCM54811 PHY · b0ed0bbf
      Kevin Lo authored
      The BCM54811 PHY shares many similarities with the already supported BCM54810
      PHY but additionally requires some semi-unique configuration.
      Signed-off-by: default avatarKevin Lo <kevlo@kevlo.org>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0ed0bbf
    • David S. Miller's avatar
      Merge branch 'cxgb4-improve-and-tune-TC-MQPRIO-offload' · d42d118c
      David S. Miller authored
      Rahul Lakkireddy says:
      
      ====================
      cxgb4: improve and tune TC-MQPRIO offload
      
      Patch 1 improves the Tx path's credit request and recovery mechanism
      when running under heavy load.
      
      Patch 2 adds ability to tune the burst buffer sizes of all traffic
      classes to improve performance for <= 1500 MTU, under heavy load.
      
      Patch 3 adds support to track EOTIDs and dump software queue
      contexts used by TC-MQPRIO offload.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d42d118c
    • Rahul Lakkireddy's avatar
      cxgb4: add EOTID tracking and software context dump · 5148e595
      Rahul Lakkireddy authored
      Rework and add support for dumping EOTID software context used by
      TC-MQPRIO. Also track number of EOTIDs in use.
      Signed-off-by: default avatarRahul Lakkireddy <rahul.lakkireddy@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5148e595
    • Rahul Lakkireddy's avatar
      cxgb4: tune burst buffer size for TC-MQPRIO offload · 4bccfc03
      Rahul Lakkireddy authored
      For each traffic class, firmware handles up to 4 * MTU amount of data
      per burst cycle. Under heavy load, this small buffer size is a
      bottleneck when buffering large TSO packets in <= 1500 MTU case.
      Increase the burst buffer size to 8 * MTU when supported.
      
      Also, keep the driver's traffic class configuration API similar to
      the firmware API counterpart.
      Signed-off-by: default avatarRahul Lakkireddy <rahul.lakkireddy@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4bccfc03
    • Rahul Lakkireddy's avatar
      cxgb4: improve credits recovery in TC-MQPRIO Tx path · 4f1d9726
      Rahul Lakkireddy authored
      Request credit update for every half credits consumed, including
      the current request. Also, avoid re-trying to post packets when there
      are no credits left. The credit update reply via interrupt will
      eventually restore the credits and will invoke the Tx path again.
      Signed-off-by: default avatarRahul Lakkireddy <rahul.lakkireddy@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4f1d9726
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 3430223d
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf-next 2020-05-15
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      We've added 37 non-merge commits during the last 1 day(s) which contain
      a total of 67 files changed, 741 insertions(+), 252 deletions(-).
      
      The main changes are:
      
      1) bpf_xdp_adjust_tail() now allows to grow the tail as well, from Jesper.
      
      2) bpftool can probe CONFIG_HZ, from Daniel.
      
      3) CAP_BPF is introduced to isolate user processes that use BPF infra and
         to secure BPF networking services by dropping CAP_SYS_ADMIN requirement
         in certain cases, from Alexei.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3430223d
    • DENG Qingfang's avatar
      net: dsa: mt7530: fix VLAN setup · 0141792f
      DENG Qingfang authored
      Allow DSA to add VLAN entries even if VLAN filtering is disabled, so
      enabling it will not block the traffic of existent ports in the bridge
      Signed-off-by: default avatarDENG Qingfang <dqfext@gmail.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0141792f
    • David S. Miller's avatar
      Merge branch 'Implement-classifier-action-terse-dump-mode' · cd2809cc
      David S. Miller authored
      Vlad Buslov says:
      
      ====================
      Implement classifier-action terse dump mode
      
      Output rate of current upstream kernel TC filter dump implementation if
      relatively low (~100k rules/sec depending on configuration). This
      constraint impacts performance of software switch implementation that
      rely on TC for their datapath implementation and periodically call TC
      filter dump to update rules stats. Moreover, TC filter dump output a lot
      of static data that don't change during the filter lifecycle (filter
      key, specific action details, etc.) which constitutes significant
      portion of payload on resulting netlink packets and increases amount of
      syscalls necessary to dump all filters on particular Qdisc. In order to
      significantly improve filter dump rate this patch sets implement new
      mode of TC filter dump operation named "terse dump" mode. In this mode
      only parameters necessary to identify the filter (handle, action cookie,
      etc.) and data that can change during filter lifecycle (filter flags,
      action stats, etc.) are preserved in dump output while everything else
      is omitted.
      
      Userspace API is implemented using new TCA_DUMP_FLAGS tlv with only
      available flag value TCA_DUMP_FLAGS_TERSE. Internally, new API requires
      individual classifier support (new tcf_proto_ops->terse_dump()
      callback). Support for action terse dump is implemented in act API and
      don't require changing individual action implementations.
      
      The following table provides performance comparison between regular
      filter dump and new terse dump mode for two classifier-action profiles:
      one minimal config with L2 flower classifier and single gact action and
      another heavier config with L2+5tuple flower classifier with
      tunnel_key+mirred actions.
      
       Classifier-action type      |        dump |  terse dump | X improvement
                                   | (rules/sec) | (rules/sec) |
      -----------------------------+-------------+-------------+---------------
       L2 with gact                |       141.8 |       293.2 |          2.07
       L2+5tuple tunnel_key+mirred |        76.4 |       198.8 |          2.60
      
      Benchmark details: to measure the rate tc filter dump and terse dump
      commands are invoked on ingress Qdisc that have one million filters
      configured using following commands.
      
      > time sudo tc -s filter show dev ens1f0 ingress >/dev/null
      
      > time sudo tc -s filter show terse dev ens1f0 ingress >/dev/null
      
      Value in results table is calculated by dividing 1000000 total rules by
      "real" time reported by time command.
      
      Setup details: 2x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 32GB memory
      ====================
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd2809cc
    • Vlad Buslov's avatar
      selftests: implement flower classifier terse dump tests · e7534fd4
      Vlad Buslov authored
      Implement two basic tests to verify terse dump functionality of flower
      classifier:
      
      - Test that verifies that terse dump works.
      
      - Test that verifies that terse dump doesn't print filter key.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7534fd4
    • Vlad Buslov's avatar
      net: sched: cls_flower: implement terse dump support · 0348451d
      Vlad Buslov authored
      Implement tcf_proto_ops->terse_dump() callback for flower classifier. Only
      dump handle, flags and action data in terse mode.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0348451d
    • Vlad Buslov's avatar
      net: sched: implement terse dump support in act · ca44b738
      Vlad Buslov authored
      Extend tcf_action_dump() with boolean argument 'terse' that is used to
      request terse-mode action dump. In terse mode only essential data needed to
      identify particular action (action kind, cookie, etc.) and its stats is put
      to resulting skb and everything else is omitted. Implement
      tcf_exts_terse_dump() helper in cls API that is intended to be used to
      request terse dump of all exts (actions) attached to the filter.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca44b738
    • Vlad Buslov's avatar
      net: sched: introduce terse dump flag · f8ab1807
      Vlad Buslov authored
      Add new TCA_DUMP_FLAGS attribute and use it in cls API to request terse
      filter output from classifiers with TCA_DUMP_FLAGS_TERSE flag. This option
      is intended to be used to improve performance of TC filter dump when
      userland only needs to obtain stats and not the whole classifier/action
      data. Extend struct tcf_proto_ops with new terse_dump() callback that must
      be defined by supporting classifier implementations.
      
      Support of the options in specific classifiers and actions is
      implemented in following patches in the series.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8ab1807
    • Tobias Waldekranz's avatar
      net: core: recursively find netdev by device node · 2e186a2c
      Tobias Waldekranz authored
      The assumption that a device node is associated either with the
      netdev's device, or the parent of that device, does not hold for all
      drivers. E.g. Freescale's DPAA has two layers of platform devices
      above the netdev. Instead, recursively walk up the tree from the
      netdev, allowing any parent to match against the sought after node.
      Signed-off-by: default avatarTobias Waldekranz <tobias@waldekranz.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2e186a2c
    • Daniel Borkmann's avatar
      Merge branch 'bpf-cap' · ed24a7a8
      Daniel Borkmann authored
      Alexei Starovoitov says:
      
      ====================
      v6->v7:
      - permit SK_REUSEPORT program type under CAP_BPF as suggested by Marek Majkowski.
        It's equivalent to SOCKET_FILTER which is unpriv.
      
      v5->v6:
      - split allow_ptr_leaks into four flags.
      - retain bpf_jit_limit under cap_sys_admin.
      - fixed few other issues spotted by Daniel.
      
      v4->v5:
      
      Split BPF operations that are allowed under CAP_SYS_ADMIN into combination of
      CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN and keep some of them under CAP_SYS_ADMIN.
      
      The user process has to have
      - CAP_BPF to create maps, do other sys_bpf() commands and load SK_REUSEPORT progs.
        Note: dev_map, sock_hash, sock_map map types still require CAP_NET_ADMIN.
        That could be relaxed in the future.
      - CAP_BPF and CAP_PERFMON to load tracing programs.
      - CAP_BPF and CAP_NET_ADMIN to load networking programs.
      (or CAP_SYS_ADMIN for backward compatibility).
      
      CAP_BPF solves three main goals:
      1. provides isolation to user space processes that drop CAP_SYS_ADMIN and switch to CAP_BPF.
         More on this below. This is the major difference vs v4 set back from Sep 2019.
      2. makes networking BPF progs more secure, since CAP_BPF + CAP_NET_ADMIN
         prevents pointer leaks and arbitrary kernel memory access.
      3. enables fuzzers to exercise all of the verifier logic. Eventually finding bugs
         and making BPF infra more secure. Currently fuzzers run in unpriv.
         They will be able to run with CAP_BPF.
      
      The patchset is long overdue follow-up from the last plumbers conference.
      Comparing to what was discussed at LPC the CAP* checks at attach time are gone.
      For tracing progs the CAP_SYS_ADMIN check was done at load time only. There was
      no check at attach time. For networking and cgroup progs CAP_SYS_ADMIN was
      required at load time and CAP_NET_ADMIN at attach time, but there are several
      ways to bypass CAP_NET_ADMIN:
      - if networking prog is using tail_call writing FD into prog_array will
        effectively attach it, but bpf_map_update_elem is an unprivileged operation.
      - freplace prog with CAP_SYS_ADMIN can replace networking prog
      
      Consolidating all CAP checks at load time makes security model similar to
      open() syscall. Once the user got an FD it can do everything with it.
      read/write/poll don't check permissions. The same way when bpf_prog_load
      command returns an FD the user can do everything (including attaching,
      detaching, and bpf_test_run).
      
      The important design decision is to allow ID->FD transition for
      CAP_SYS_ADMIN only. What it means that user processes can run
      with CAP_BPF and CAP_NET_ADMIN and they will not be able to affect each
      other unless they pass FDs via scm_rights or via pinning in bpffs.
      ID->FD is a mechanism for human override and introspection.
      An admin can do 'sudo bpftool prog ...'. It's possible to enforce via LSM that
      only bpftool binary does bpf syscall with CAP_SYS_ADMIN and the rest of user
      space processes do bpf syscall with CAP_BPF isolating bpf objects (progs, maps,
      links) that are owned by such processes from each other.
      
      Another significant change from LPC is that the verifier checks are split into
      four flags. The allow_ptr_leaks flag allows pointer manipulations. The
      bpf_capable flag enables all modern verifier features like bpf-to-bpf calls,
      BTF, bounded loops, dead code elimination, etc. All the goodness. The
      bypass_spec_v1 flag enables indirect stack access from bpf programs and
      disables speculative analysis and bpf array mitigations. The bypass_spec_v4
      flag disables store sanitation. That allows networking progs with CAP_BPF +
      CAP_NET_ADMIN enjoy modern verifier features while being more secure.
      
      Some networking progs may need CAP_BPF + CAP_NET_ADMIN + CAP_PERFMON,
      since subtracting pointers (like skb->data_end - skb->data) is a pointer leak,
      but the verifier may get smarter in the future.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      ed24a7a8
    • Alexei Starovoitov's avatar
      selftests/bpf: Use CAP_BPF and CAP_PERFMON in tests · 81626001
      Alexei Starovoitov authored
      Make all test_verifier test exercise CAP_BPF and CAP_PERFMON
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200513230355.7858-4-alexei.starovoitov@gmail.com
      81626001
    • Alexei Starovoitov's avatar
      bpf: Implement CAP_BPF · 2c78ee89
      Alexei Starovoitov authored
      Implement permissions as stated in uapi/linux/capability.h
      In order to do that the verifier allow_ptr_leaks flag is split
      into four flags and they are set as:
        env->allow_ptr_leaks = bpf_allow_ptr_leaks();
        env->bypass_spec_v1 = bpf_bypass_spec_v1();
        env->bypass_spec_v4 = bpf_bypass_spec_v4();
        env->bpf_capable = bpf_capable();
      
      The first three currently equivalent to perfmon_capable(), since leaking kernel
      pointers and reading kernel memory via side channel attacks is roughly
      equivalent to reading kernel memory with cap_perfmon.
      
      'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and
      other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions,
      subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the
      verifier, run time mitigations in bpf array, and enables indirect variable
      access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code
      by the verifier.
      
      That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN
      will have speculative checks done by the verifier and other spectre mitigation
      applied. Such networking BPF program will not be able to leak kernel pointers
      and will not be able to access arbitrary kernel memory.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com
      2c78ee89
    • Alexei Starovoitov's avatar
      bpf, capability: Introduce CAP_BPF · a17b53c4
      Alexei Starovoitov authored
      Split BPF operations that are allowed under CAP_SYS_ADMIN into
      combination of CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN.
      For backward compatibility include them in CAP_SYS_ADMIN as well.
      
      The end result provides simple safety model for applications that use BPF:
      - to load tracing program types
        BPF_PROG_TYPE_{KPROBE, TRACEPOINT, PERF_EVENT, RAW_TRACEPOINT, etc}
        use CAP_BPF and CAP_PERFMON
      - to load networking program types
        BPF_PROG_TYPE_{SCHED_CLS, XDP, SK_SKB, etc}
        use CAP_BPF and CAP_NET_ADMIN
      
      There are few exceptions from this rule:
      - bpf_trace_printk() is allowed in networking programs, but it's using
        tracing mechanism, hence this helper needs additional CAP_PERFMON
        if networking program is using this helper.
      - BPF_F_ZERO_SEED flag for hash/lru map is allowed under CAP_SYS_ADMIN only
        to discourage production use.
      - BPF HW offload is allowed under CAP_SYS_ADMIN.
      - bpf_probe_write_user() is allowed under CAP_SYS_ADMIN only.
      
      CAPs are not checked at attach/detach time with two exceptions:
      - loading BPF_PROG_TYPE_CGROUP_SKB is allowed for unprivileged users,
        hence CAP_NET_ADMIN is required at attach time.
      - flow_dissector detach doesn't check prog FD at detach,
        hence CAP_NET_ADMIN is required at detach time.
      
      CAP_SYS_ADMIN is required to iterate BPF objects (progs, maps, links) via get_next_id
      command and convert them to file descriptor via GET_FD_BY_ID command.
      This restriction guarantees that mutliple tasks with CAP_BPF are not able to
      affect each other. That leads to clean isolation of tasks. For example:
      task A with CAP_BPF and CAP_NET_ADMIN loads and attaches a firewall via bpf_link.
      task B with the same capabilities cannot detach that firewall unless
      task A explicitly passed link FD to task B via scm_rights or bpffs.
      CAP_SYS_ADMIN can still detach/unload everything.
      
      Two networking user apps with CAP_SYS_ADMIN and CAP_NET_ADMIN can
      accidentely mess with each other programs and maps.
      Two networking user apps with CAP_NET_ADMIN and CAP_BPF cannot affect each other.
      
      CAP_NET_ADMIN + CAP_BPF allows networking programs access only packet data.
      Such networking progs cannot access arbitrary kernel memory or leak pointers.
      
      bpftool, bpftrace, bcc tools binaries should NOT be installed with
      CAP_BPF and CAP_PERFMON, since unpriv users will be able to read kernel secrets.
      But users with these two permissions will be able to use these tracing tools.
      
      CAP_PERFMON is least secure, since it allows kprobes and kernel memory access.
      CAP_NET_ADMIN can stop network traffic via iproute2.
      CAP_BPF is the safest from security point of view and harmless on its own.
      
      Having CAP_BPF and/or CAP_NET_ADMIN is not enough to write into arbitrary map
      and if that map is used by firewall-like bpf prog.
      CAP_BPF allows many bpf prog_load commands in parallel. The verifier
      may consume large amount of memory and significantly slow down the system.
      
      Existing unprivileged BPF operations are not affected.
      In particular unprivileged users are allowed to load socket_filter and cg_skb
      program types and to create array, hash, prog_array, map-in-map map types.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200513230355.7858-2-alexei.starovoitov@gmail.com
      a17b53c4
    • Daniel Borkmann's avatar
      bpf, bpftool: Allow probing for CONFIG_HZ from kernel config · 0ee52c0f
      Daniel Borkmann authored
      In Cilium we've recently switched to make use of bpf_jiffies64() for
      parts of our tc and XDP datapath since bpf_ktime_get_ns() is more
      expensive and high-precision is not needed for our timeouts we have
      anyway. Our agent has a probe manager which picks up the json of
      bpftool's feature probe and we also use the macro output in our C
      programs e.g. to have workarounds when helpers are not available on
      older kernels.
      
      Extend the kernel config info dump to also include the kernel's
      CONFIG_HZ, and rework the probe_kernel_image_config() for allowing a
      macro dump such that CONFIG_HZ can be propagated to BPF C code as a
      simple define if available via config. Latter allows to have _compile-
      time_ resolution of jiffies <-> sec conversion in our code since all
      are propagated as known constants.
      
      Given we cannot generally assume availability of kconfig everywhere,
      we also have a kernel hz probe [0] as a fallback. Potentially, bpftool
      could have an integrated probe fallback as well, although to derive it,
      we might need to place it under 'bpftool feature probe full' or similar
      given it would slow down the probing process overall. Yet 'full' doesn't
      fit either for us since we don't want to pollute the kernel log with
      warning messages from bpf_probe_write_user() and bpf_trace_printk() on
      agent startup; I've left it out for the time being.
      
        [0] https://github.com/cilium/cilium/blob/master/bpf/cilium-probe-kernel-hz.cSigned-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200513075849.20868-1-daniel@iogearbox.net
      0ee52c0f
    • Alexei Starovoitov's avatar
      Merge branch 'xdp-grow-tail' · 5cc5924d
      Alexei Starovoitov authored
      Jesper Dangaard Brouer says:
      
      ====================
      V4:
      - Fixup checkpatch.pl issues
      - Collected more ACKs
      
      V3:
      - Fix issue on virtio_net patch spotted by Jason Wang
      - Adjust name for variable in mlx5 patch
      - Collected more ACKs
      
      V2:
      - Fix bug in mlx5 for XDP_PASS case
      - Collected nitpicks and ACKs from mailing list
      
      V1:
      - Fix bug in dpaa2
      
      XDP have evolved to support several frame sizes, but xdp_buff was not
      updated with this information. This have caused the side-effect that
      XDP frame data hard end is unknown. This have limited the BPF-helper
      bpf_xdp_adjust_tail to only shrink the packet. This patchset address
      this and add packet tail extend/grow.
      
      The purpose of the patchset is ALSO to reserve a memory area that can be
      used for storing extra information, specifically for extending XDP with
      multi-buffer support. One proposal is to use same layout as
      skb_shared_info, which is why this area is currently 320 bytes.
      
      When converting xdp_frame to SKB (veth and cpumap), the full tailroom
      area can now be used and SKB truesize is now correct. For most
      drivers this result in a much larger tailroom in SKB "head" data
      area. The network stack can now take advantage of this when doing SKB
      coalescing. Thus, a good driver test is to use xdp_redirect_cpu from
      samples/bpf/ and do some TCP stream testing.
      
      Use-cases for tail grow/extend:
      (1) IPsec / XFRM needs a tail extend[1][2].
      (2) DNS-cache responses in XDP.
      (3) HAProxy ALOHA would need it to convert to XDP.
      (4) Add tail info e.g. timestamp and collect via tcpdump
      
      [1] http://vger.kernel.org/netconf2019_files/xfrm_xdp.pdf
      [2] http://vger.kernel.org/netconf2019.html
      
      Examples on howto access the tail area of an XDP packet is shown in the
      XDP-tutorial example[3].
      
      [3] https://github.com/xdp-project/xdp-tutorial/blob/master/experiment01-tailgrow/
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5cc5924d
    • Jesper Dangaard Brouer's avatar
      selftests/bpf: Xdp_adjust_tail add grow tail tests · 7ae2e00e
      Jesper Dangaard Brouer authored
      Extend BPF selftest xdp_adjust_tail with grow tail tests, which is added
      as subtest's. The first grow test stays in same form as original shrink
      test. The second grow test use the newer bpf_prog_test_run_xattr() calls,
      and does extra checking of data contents.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/158945350567.97035.9632611946765811876.stgit@firesoul
      7ae2e00e
    • Jesper Dangaard Brouer's avatar
      selftests/bpf: Adjust BPF selftest for xdp_adjust_tail · 68545fb6
      Jesper Dangaard Brouer authored
      Current selftest for BPF-helper xdp_adjust_tail only shrink tail.
      Make it more clear that this is a shrink test case.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/158945350058.97035.17280775016196207372.stgit@firesoul
      68545fb6
    • Jesper Dangaard Brouer's avatar
      bpf: Add xdp.frame_sz in bpf_prog_test_run_xdp(). · bc56c919
      Jesper Dangaard Brouer authored
      Update the memory requirements, when adding xdp.frame_sz in BPF test_run
      function bpf_prog_test_run_xdp() which e.g. is used by XDP selftests.
      
      Specifically add the expected reserved tailroom, but also allocated a
      larger memory area to reflect that XDP frames usually comes in this
      format. Limit the provided packet data size to 4096 minus headroom +
      tailroom, as this also reflect a common 3520 bytes MTU limit with XDP.
      
      Note that bpf_test_init already use a memory allocation method that clears
      memory.  Thus, this already guards against leaking uninit kernel memory.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/158945349549.97035.15316291762482444006.stgit@firesoul
      bc56c919
    • Jesper Dangaard Brouer's avatar
      xdp: Clear grow memory in bpf_xdp_adjust_tail() · ddb47d51
      Jesper Dangaard Brouer authored
      Clearing memory of tail when grow happens, because it is too easy
      to write a XDP_PASS program that extend the tail, which expose
      this memory to users that can run tcpdump.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/158945349039.97035.5262100484553494.stgit@firesoul
      ddb47d51
    • Jesper Dangaard Brouer's avatar
      xdp: Allow bpf_xdp_adjust_tail() to grow packet size · c8741e2b
      Jesper Dangaard Brouer authored
      Finally, after all drivers have a frame size, allow BPF-helper
      bpf_xdp_adjust_tail() to grow or extend packet size at frame tail.
      
      Remember that helper/macro xdp_data_hard_end have reserved some
      tailroom.  Thus, this helper makes sure that the BPF-prog don't have
      access to this tailroom area.
      
      V2: Remove one chicken check and use WARN_ONCE for other
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/158945348530.97035.12577148209134239291.stgit@firesoul
      c8741e2b
    • Jesper Dangaard Brouer's avatar
      mlx5: Rx queue setup time determine frame_sz for XDP · d628ee4f
      Jesper Dangaard Brouer authored
      The mlx5 driver have multiple memory models, which are also changed
      according to whether a XDP bpf_prog is attached.
      
      The 'rx_striding_rq' setting is adjusted via ethtool priv-flags e.g.:
       # ethtool --set-priv-flags mlx5p2 rx_striding_rq off
      
      On the general case with 4K page_size and regular MTU packet, then
      the frame_sz is 2048 and 4096 when XDP is enabled, in both modes.
      
      The info on the given frame size is stored differently depending on the
      RQ-mode and encoded in a union in struct mlx5e_rq union wqe/mpwqe.
      In rx striding mode rq->mpwqe.log_stride_sz is either 11 or 12, which
      corresponds to 2048 or 4096 (MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ).
      In non-striding mode (MLX5_WQ_TYPE_CYCLIC) the frag_stride is stored
      in rq->wqe.info.arr[0].frag_stride, for the first fragment, which is
      what the XDP case cares about.
      
      To reduce effect on fast-path, this patch determine the frame_sz at
      setup time, to avoid determining the memory model runtime. Variable
      is named frame0_sz to make it clear that this is only the frame
      size of the first fragment.
      
      This mlx5 driver does a DMA-sync on XDP_TX action, but grow is safe
      as it have done a DMA-map on the entire PAGE_SIZE. The driver also
      already does a XDP length check against sq->hw_mtu on the possible
      XDP xmit paths mlx5e_xmit_xdp_frame() + mlx5e_xmit_xdp_frame_mpwqe().
      
      V3+4: Change variable name first_frame_sz to frame0_sz
      
      V2: Fix that frag_size need to be recalc before creating SKB.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Link: https://lore.kernel.org/bpf/158945348021.97035.12295039384250022883.stgit@firesoul
      d628ee4f
    • Jesper Dangaard Brouer's avatar
      xdp: For Intel AF_XDP drivers add XDP frame_sz · 2a637c5b
      Jesper Dangaard Brouer authored
      Intel drivers implement native AF_XDP zerocopy in separate C-files,
      that have its own invocation of bpf_prog_run_xdp(). The setup of
      xdp_buff is also handled in separately from normal code path.
      
      This patch update XDP frame_sz for AF_XDP zerocopy drivers i40e, ice
      and ixgbe, as the code changes needed are very similar.  Introduce a
      helper function xsk_umem_xdp_frame_sz() for calculating frame size.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Cc: intel-wired-lan@lists.osuosl.org
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/bpf/158945347511.97035.8536753731329475655.stgit@firesoul
      2a637c5b
    • Jesper Dangaard Brouer's avatar
      ice: Add XDP frame size to driver · d4ecdbf7
      Jesper Dangaard Brouer authored
      This driver uses different memory models depending on PAGE_SIZE at
      compile time. For PAGE_SIZE 4K it uses page splitting, meaning for
      normal MTU frame size is 2048 bytes (and headroom 192 bytes). For
      larger MTUs the driver still use page splitting, by allocating
      order-1 pages (8192 bytes) for RX frames. For PAGE_SIZE larger than
      4K, driver instead advance its rx_buffer->page_offset with the frame
      size "truesize".
      
      For XDP frame size calculations, this mean that in PAGE_SIZE larger
      than 4K mode the frame_sz change on a per packet basis. For the page
      split 4K PAGE_SIZE mode, xdp.frame_sz is more constant and can be
      updated once outside the main NAPI loop.
      
      The default setting in the driver uses build_skb(), which provides
      the necessary headroom and tailroom for XDP-redirect in RX-frame
      (in both modes).
      
      There is one complication, which is legacy-rx mode (configurable via
      ethtool priv-flags). There are zero headroom in this mode, which is a
      requirement for XDP-redirect to work. The conversion to xdp_frame
      (convert_to_xdp_frame) will detect this insufficient space, and
      xdp_do_redirect() call will fail. This is deemed acceptable, as it
      allows other XDP actions to still work in legacy-mode. In
      legacy-mode + larger PAGE_SIZE due to lacking tailroom, we also
      accept that xdp_adjust_tail shrink doesn't work.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: intel-wired-lan@lists.osuosl.org
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Link: https://lore.kernel.org/bpf/158945347002.97035.328088795813704587.stgit@firesoul
      d4ecdbf7
    • Jesper Dangaard Brouer's avatar
      i40e: Add XDP frame size to driver · 24104024
      Jesper Dangaard Brouer authored
      This driver uses different memory models depending on PAGE_SIZE at
      compile time. For PAGE_SIZE 4K it uses page splitting, meaning for
      normal MTU frame size is 2048 bytes (and headroom 192 bytes). For
      larger MTUs the driver still use page splitting, by allocating
      order-1 pages (8192 bytes) for RX frames. For PAGE_SIZE larger than
      4K, driver instead advance its rx_buffer->page_offset with the frame
      size "truesize".
      
      For XDP frame size calculations, this mean that in PAGE_SIZE larger
      than 4K mode the frame_sz change on a per packet basis. For the page
      split 4K PAGE_SIZE mode, xdp.frame_sz is more constant and can be
      updated once outside the main NAPI loop.
      
      The default setting in the driver uses build_skb(), which provides
      the necessary headroom and tailroom for XDP-redirect in RX-frame
      (in both modes).
      
      There is one complication, which is legacy-rx mode (configurable via
      ethtool priv-flags). There are zero headroom in this mode, which is a
      requirement for XDP-redirect to work. The conversion to xdp_frame
      (convert_to_xdp_frame) will detect this insufficient space, and
      xdp_do_redirect() call will fail. This is deemed acceptable, as it
      allows other XDP actions to still work in legacy-mode. In
      legacy-mode + larger PAGE_SIZE due to lacking tailroom, we also
      accept that xdp_adjust_tail shrink doesn't work.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: intel-wired-lan@lists.osuosl.org
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Link: https://lore.kernel.org/bpf/158945346494.97035.12809400414566061815.stgit@firesoul
      24104024
    • Jesper Dangaard Brouer's avatar
      ixgbevf: Add XDP frame size to VF driver · 81f3c628
      Jesper Dangaard Brouer authored
      This patch mirrors the changes to ixgbe in previous patch.
      
      This VF driver doesn't support XDP_REDIRECT, but correct tailroom is
      still necessary for BPF-helper xdp_adjust_tail.  In legacy-mode +
      larger PAGE_SIZE, due to lacking tailroom, we accept that
      xdp_adjust_tail shrink doesn't work.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: intel-wired-lan@lists.osuosl.org
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Link: https://lore.kernel.org/bpf/158945345984.97035.13518286183248025173.stgit@firesoul
      81f3c628
    • Jesper Dangaard Brouer's avatar
      ixgbe: Add XDP frame size to driver · cf025128
      Jesper Dangaard Brouer authored
      This driver uses different memory models depending on PAGE_SIZE at
      compile time. For PAGE_SIZE 4K it uses page splitting, meaning for
      normal MTU frame size is 2048 bytes (and headroom 192 bytes). For
      larger MTUs the driver still use page splitting, by allocating
      order-1 pages (8192 bytes) for RX frames. For PAGE_SIZE larger than
      4K, driver instead advance its rx_buffer->page_offset with the frame
      size "truesize".
      
      For XDP frame size calculations, this mean that in PAGE_SIZE larger
      than 4K mode the frame_sz change on a per packet basis. For the page
      split 4K PAGE_SIZE mode, xdp.frame_sz is more constant and can be
      updated once outside the main NAPI loop.
      
      The default setting in the driver uses build_skb(), which provides
      the necessary headroom and tailroom for XDP-redirect in RX-frame
      (in both modes).
      
      There is one complication, which is legacy-rx mode (configurable via
      ethtool priv-flags). There are zero headroom in this mode, which is a
      requirement for XDP-redirect to work. The conversion to xdp_frame
      (convert_to_xdp_frame) will detect this insufficient space, and
      xdp_do_redirect() call will fail. This is deemed acceptable, as it
      allows other XDP actions to still work in legacy-mode. In
      legacy-mode + larger PAGE_SIZE due to lacking tailroom, we also
      accept that xdp_adjust_tail shrink doesn't work.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: intel-wired-lan@lists.osuosl.org
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Link: https://lore.kernel.org/bpf/158945345455.97035.14334355929030628741.stgit@firesoul
      cf025128
    • Jesper Dangaard Brouer's avatar
      ixgbe: Fix XDP redirect on archs with PAGE_SIZE above 4K · 88eb0ee1
      Jesper Dangaard Brouer authored
      The ixgbe driver have another memory model when compiled on archs with
      PAGE_SIZE above 4096 bytes. In this mode it doesn't split the page in
      two halves, but instead increment rx_buffer->page_offset by truesize of
      packet (which include headroom and tailroom for skb_shared_info).
      
      This is done correctly in ixgbe_build_skb(), but in ixgbe_rx_buffer_flip
      which is currently only called on XDP_TX and XDP_REDIRECT, it forgets
      to add the tailroom for skb_shared_info. This breaks XDP_REDIRECT, for
      veth and cpumap.  Fix by adding size of skb_shared_info tailroom.
      
      Maintainers notice: This fix have been queued to Jeff.
      
      Fixes: 64530739 ("ixgbe: add initial support for xdp redirect")
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Link: https://lore.kernel.org/bpf/158945344946.97035.17031588499266605743.stgit@firesoul
      88eb0ee1
    • Jesper Dangaard Brouer's avatar
      virtio_net: Add XDP frame size in two code paths · 9ce6146e
      Jesper Dangaard Brouer authored
      The virtio_net driver is running inside the guest-OS. There are two
      XDP receive code-paths in virtio_net, namely receive_small() and
      receive_mergeable(). The receive_big() function does not support XDP.
      
      In receive_small() the frame size is available in buflen. The buffer
      backing these frames are allocated in add_recvbuf_small() with same
      size, except for the headroom, but tailroom have reserved room for
      skb_shared_info. The headroom is encoded in ctx pointer as a value.
      
      In receive_mergeable() the frame size is more dynamic. There are two
      basic cases: (1) buffer size is based on a exponentially weighted
      moving average (see DECLARE_EWMA) of packet length. Or (2) in case
      virtnet_get_headroom() have any headroom then buffer size is
      PAGE_SIZE. The ctx pointer is this time used for encoding two values;
      the buffer len "truesize" and headroom. In case (1) if the rx buffer
      size is underestimated, the packet will have been split over more
      buffers (num_buf info in virtio_net_hdr_mrg_rxbuf placed in top of
      buffer area). If that happens the XDP path does a xdp_linearize_page
      operation.
      
      V3: Adjust frame_sz in receive_mergeable() case, spotted by Jason Wang.
      
      The code is really hard to follow, so some hints to reviewers.
      The receive_mergeable() case gets frames that were allocated in
      add_recvbuf_mergeable() which uses headroom=virtnet_get_headroom(),
      and 'buf' ptr is advanced this headroom.  The headroom can only
      be 0 or VIRTIO_XDP_HEADROOM, as virtnet_get_headroom is really
      simple:
      
        static unsigned int virtnet_get_headroom(struct virtnet_info *vi)
        {
      	return vi->xdp_queue_pairs ? VIRTIO_XDP_HEADROOM : 0;
        }
      
      As frame_sz is an offset size from xdp.data_hard_start, reviewers
      should notice how this is calculated in receive_mergeable():
      
        int offset = buf - page_address(page);
        [...]
        data = page_address(xdp_page) + offset;
        xdp.data_hard_start = data - VIRTIO_XDP_HEADROOM + vi->hdr_len;
      
      The calculated offset will always be VIRTIO_XDP_HEADROOM when
      reaching this code.  Thus, xdp.data_hard_start will be page-start
      address plus vi->hdr_len.  Given this xdp.frame_sz need to be
      reduced with vi->hdr_len size.
      
      IMHO a followup patch should cleanup this code to make it easier
      to maintain and understand, but it is outside the scope of this
      patchset.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Link: https://lore.kernel.org/bpf/158945344436.97035.9445115070189151680.stgit@firesoul
      9ce6146e
    • Jesper Dangaard Brouer's avatar
      vhost_net: Also populate XDP frame size · 05afee29
      Jesper Dangaard Brouer authored
      In vhost_net_build_xdp() the 'buf' that gets queued via an xdp_buff
      have embedded a struct tun_xdp_hdr (located at xdp->data_hard_start)
      which contains the buffer length 'buflen' (with tailroom for
      skb_shared_info). Also storing this buflen in xdp->frame_sz, does not
      obsolete struct tun_xdp_hdr, as it also contains a struct
      virtio_net_hdr with other information.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Link: https://lore.kernel.org/bpf/158945343928.97035.4620233649151726289.stgit@firesoul
      05afee29
    • Jesper Dangaard Brouer's avatar
      tun: Add XDP frame size · fb3e6e93
      Jesper Dangaard Brouer authored
      The tun driver have two code paths for running XDP (bpf_prog_run_xdp).
      In both cases 'buflen' contains enough tailroom for skb_shared_info.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Link: https://lore.kernel.org/bpf/158945343419.97035.9594485183958037621.stgit@firesoul
      fb3e6e93
    • Jesper Dangaard Brouer's avatar
      nfp: Add XDP frame size to netronome driver · fa6540b8
      Jesper Dangaard Brouer authored
      The netronome nfp driver use PAGE_SIZE when xdp_prog is set, but
      xdp.data_hard_start begins at offset NFP_NET_RX_BUF_HEADROOM.
      Thus, adjust for this when setting xdp.frame_sz, as it counts
      from data_hard_start.
      
      When doing XDP_TX this driver is smart and instead of a full DMA-map
      does a DMA-sync on with packet length. As xdp_adjust_tail can now
      grow packet length, add checks to make sure that grow size is within
      the DMA-mapped size.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/bpf/158945342911.97035.11214251236208648808.stgit@firesoul
      fa6540b8