1. 09 Aug, 2021 4 commits
    • Jussi Maki's avatar
      bpf, devmap: Exclude XDP broadcast to master device · aeea1b86
      Jussi Maki authored
      If the ingress device is bond slave, do not broadcast back through it or
      the bond master.
      Signed-off-by: default avatarJussi Maki <joamaki@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210731055738.16820-5-joamaki@gmail.com
      aeea1b86
    • Jussi Maki's avatar
      net, bonding: Add XDP support to the bonding driver · 9e2ee5c7
      Jussi Maki authored
      XDP is implemented in the bonding driver by transparently delegating
      the XDP program loading, removal and xmit operations to the bonding
      slave devices. The overall goal of this work is that XDP programs
      can be attached to a bond device *without* any further changes (or
      awareness) necessary to the program itself, meaning the same XDP
      program can be attached to a native device but also a bonding device.
      
      Semantics of XDP_TX when attached to a bond are equivalent in such
      setting to the case when a tc/BPF program would be attached to the
      bond, meaning transmitting the packet out of the bond itself using one
      of the bond's configured xmit methods to select a slave device (rather
      than XDP_TX on the slave itself). Handling of XDP_TX to transmit
      using the configured bonding mechanism is therefore implemented by
      rewriting the BPF program return value in bpf_prog_run_xdp. To avoid
      performance impact this check is guarded by a static key, which is
      incremented when a XDP program is loaded onto a bond device. This
      approach was chosen to avoid changes to drivers implementing XDP. If
      the slave device does not match the receive device, then XDP_REDIRECT
      is transparently used to perform the redirection in order to have
      the network driver release the packet from its RX ring. The bonding
      driver hashing functions have been refactored to allow reuse with
      xdp_buff's to avoid code duplication.
      
      The motivation for this change is to enable use of bonding (and
      802.3ad) in hairpinning L4 load-balancers such as [1] implemented with
      XDP and also to transparently support bond devices for projects that
      use XDP given most modern NICs have dual port adapters. An alternative
      to this approach would be to implement 802.3ad in user-space and
      implement the bonding load-balancing in the XDP program itself, but
      is rather a cumbersome endeavor in terms of slave device management
      (e.g. by watching netlink) and requires separate programs for native
      vs bond cases for the orchestrator. A native in-kernel implementation
      overcomes these issues and provides more flexibility.
      
      Below are benchmark results done on two machines with 100Gbit
      Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and
      16-core 3950X on receiving machine. 64 byte packets were sent with
      pktgen-dpdk at full rate. Two issues [2, 3] were identified with the
      ice driver, so the tests were performed with iommu=off and patch [2]
      applied. Additionally the bonding round robin algorithm was modified
      to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate
      of cache misses were caused by the shared rr_tx_counter (see patch
      2/3). The statistics were collected using "sar -n dev -u 1 10". On top
      of that, for ice, further work is in progress on improving the XDP_TX
      numbers [4].
      
       -----------------------|  CPU  |--| rxpck/s |--| txpck/s |----
       without patch (1 dev):
         XDP_DROP:              3.15%      48.6Mpps
         XDP_TX:                3.12%      18.3Mpps     18.3Mpps
         XDP_DROP (RSS):        9.47%      116.5Mpps
         XDP_TX (RSS):          9.67%      25.3Mpps     24.2Mpps
       -----------------------
       with patch, bond (1 dev):
         XDP_DROP:              3.14%      46.7Mpps
         XDP_TX:                3.15%      13.9Mpps     13.9Mpps
         XDP_DROP (RSS):        10.33%     117.2Mpps
         XDP_TX (RSS):          10.64%     25.1Mpps     24.0Mpps
       -----------------------
       with patch, bond (2 devs):
         XDP_DROP:              6.27%      92.7Mpps
         XDP_TX:                6.26%      17.6Mpps     17.5Mpps
         XDP_DROP (RSS):       11.38%      117.2Mpps
         XDP_TX (RSS):         14.30%      28.7Mpps     27.4Mpps
       --------------------------------------------------------------
      
      RSS: Receive Side Scaling, e.g. the packets were sent to a range of
      destination IPs.
      
        [1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
        [2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t
        [3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/
        [4]: https://lore.kernel.org/bpf/20210805230046.28715-1-maciej.fijalkowski@intel.com/T/#tSigned-off-by: default avatarJussi Maki <joamaki@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/bpf/20210731055738.16820-4-joamaki@gmail.com
      9e2ee5c7
    • Jussi Maki's avatar
      net, core: Add support for XDP redirection to slave device · 879af96f
      Jussi Maki authored
      This adds the ndo_xdp_get_xmit_slave hook for transforming XDP_TX
      into XDP_REDIRECT after BPF program run when the ingress device
      is a bond slave.
      
      The dev_xdp_prog_count is exposed so that slave devices can be checked
      for loaded XDP programs in order to avoid the situation where both
      bond master and slave have programs loaded according to xdp_state.
      Signed-off-by: default avatarJussi Maki <joamaki@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Link: https://lore.kernel.org/bpf/20210731055738.16820-3-joamaki@gmail.com
      879af96f
    • Jussi Maki's avatar
      net, bonding: Refactor bond_xmit_hash for use with xdp_buff · a815bde5
      Jussi Maki authored
      In preparation for adding XDP support to the bonding driver
      refactor the packet hashing functions to be able to work with
      any linear data buffer without an skb.
      Signed-off-by: default avatarJussi Maki <joamaki@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Link: https://lore.kernel.org/bpf/20210731055738.16820-2-joamaki@gmail.com
      a815bde5
  2. 06 Aug, 2021 5 commits
  3. 04 Aug, 2021 3 commits
  4. 03 Aug, 2021 1 commit
  5. 02 Aug, 2021 22 commits
  6. 31 Jul, 2021 2 commits
    • Jakub Kicinski's avatar
      Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · d39e8b92
      Jakub Kicinski authored
      Andrii Nakryiko says:
      
      ====================
      bpf-next 2021-07-30
      
      We've added 64 non-merge commits during the last 15 day(s) which contain
      a total of 83 files changed, 5027 insertions(+), 1808 deletions(-).
      
      The main changes are:
      
      1) BTF-guided binary data dumping libbpf API, from Alan.
      
      2) Internal factoring out of libbpf CO-RE relocation logic, from Alexei.
      
      3) Ambient BPF run context and cgroup storage cleanup, from Andrii.
      
      4) Few small API additions for libbpf 1.0 effort, from Evgeniy and Hengqi.
      
      5) bpf_program__attach_kprobe_opts() fixes in libbpf, from Jiri.
      
      6) bpf_{get,set}sockopt() support in BPF iterators, from Martin.
      
      7) BPF map pinning improvements in libbpf, from Martynas.
      
      8) Improved module BTF support in libbpf and bpftool, from Quentin.
      
      9) Bpftool cleanups and documentation improvements, from Quentin.
      
      10) Libbpf improvements for supporting CO-RE on old kernels, from Shuyi.
      
      11) Increased maximum cgroup storage size, from Stanislav.
      
      12) Small fixes and improvements to BPF tests and samples, from various folks.
      
      * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (64 commits)
        tools: bpftool: Complete metrics list in "bpftool prog profile" doc
        tools: bpftool: Document and add bash completion for -L, -B options
        selftests/bpf: Update bpftool's consistency script for checking options
        tools: bpftool: Update and synchronise option list in doc and help msg
        tools: bpftool: Complete and synchronise attach or map types
        selftests/bpf: Check consistency between bpftool source, doc, completion
        tools: bpftool: Slightly ease bash completion updates
        unix_bpf: Fix a potential deadlock in unix_dgram_bpf_recvmsg()
        libbpf: Add btf__load_vmlinux_btf/btf__load_module_btf
        tools: bpftool: Support dumping split BTF by id
        libbpf: Add split BTF support for btf__load_from_kernel_by_id()
        tools: Replace btf__get_from_id() with btf__load_from_kernel_by_id()
        tools: Free BTF objects at various locations
        libbpf: Rename btf__get_from_id() as btf__load_from_kernel_by_id()
        libbpf: Rename btf__load() as btf__load_into_kernel()
        libbpf: Return non-null error on failures in libbpf_find_prog_btf_id()
        bpf: Emit better log message if bpf_iter ctx arg btf_id == 0
        tools/resolve_btfids: Emit warnings and patch zero id for missing symbols
        bpf: Increase supported cgroup storage value size
        libbpf: Fix race when pinning maps in parallel
        ...
      ====================
      
      Link: https://lore.kernel.org/r/20210730225606.1897330-1-andrii@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d39e8b92
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · d2e11fd2
      Jakub Kicinski authored
      Conflicting commits, all resolutions pretty trivial:
      
      drivers/bus/mhi/pci_generic.c
        5c2c8531 ("bus: mhi: pci-generic: configurable network interface MRU")
        56f6f4c4 ("bus: mhi: pci_generic: Apply no-op for wake using sideband wake boolean")
      
      drivers/nfc/s3fwrn5/firmware.c
        a0302ff5 ("nfc: s3fwrn5: remove unnecessary label")
        46573e3a ("nfc: s3fwrn5: fix undefined parameter values in dev_err()")
        801e541c ("nfc: s3fwrn5: fix undefined parameter values in dev_err()")
      
      MAINTAINERS
        7d901a1e ("net: phy: add Maxlinear GPY115/21x/24x driver")
        8a7b46fa ("MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d2e11fd2
  7. 30 Jul, 2021 3 commits
    • Linus Torvalds's avatar
      Merge tag 'net-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · c7d10223
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Networking fixes for 5.14-rc4, including fixes from bpf, can, WiFi
        (mac80211) and netfilter trees.
      
        Current release - regressions:
      
         - mac80211: fix starting aggregation sessions on mesh interfaces
      
        Current release - new code bugs:
      
         - sctp: send pmtu probe only if packet loss in Search Complete state
      
         - bnxt_en: add missing periodic PHC overflow check
      
         - devlink: fix phys_port_name of virtual port and merge error
      
         - hns3: change the method of obtaining default ptp cycle
      
         - can: mcba_usb_start(): add missing urb->transfer_dma initialization
      
        Previous releases - regressions:
      
         - set true network header for ECN decapsulation
      
         - mlx5e: RX, avoid possible data corruption w/ relaxed ordering and
           LRO
      
         - phy: re-add check for PHY_BRCM_DIS_TXCRXC_NOENRGY on the BCM54811
           PHY
      
         - sctp: fix return value check in __sctp_rcv_asconf_lookup
      
        Previous releases - always broken:
      
         - bpf:
             - more spectre corner case fixes, introduce a BPF nospec
               instruction for mitigating Spectre v4
             - fix OOB read when printing XDP link fdinfo
             - sockmap: fix cleanup related races
      
         - mac80211: fix enabling 4-address mode on a sta vif after assoc
      
         - can:
             - raw: raw_setsockopt(): fix raw_rcv panic for sock UAF
             - j1939: j1939_session_deactivate(): clarify lifetime of session
               object, avoid UAF
             - fix number of identical memory leaks in USB drivers
      
         - tipc:
             - do not blindly write skb_shinfo frags when doing decryption
             - fix sleeping in tipc accept routine"
      
      * tag 'net-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits)
        gve: Update MAINTAINERS list
        can: esd_usb2: fix memory leak
        can: ems_usb: fix memory leak
        can: usb_8dev: fix memory leak
        can: mcba_usb_start(): add missing urb->transfer_dma initialization
        can: hi311x: fix a signedness bug in hi3110_cmd()
        MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver
        bpf: Fix leakage due to insufficient speculative store bypass mitigation
        bpf: Introduce BPF nospec instruction for mitigating Spectre v4
        sis900: Fix missing pci_disable_device() in probe and remove
        net: let flow have same hash in two directions
        nfc: nfcsim: fix use after free during module unload
        tulip: windbond-840: Fix missing pci_disable_device() in probe and remove
        sctp: fix return value check in __sctp_rcv_asconf_lookup
        nfc: s3fwrn5: fix undefined parameter values in dev_err()
        net/mlx5: Fix mlx5_vport_tbl_attr chain from u16 to u32
        net/mlx5e: Fix nullptr in mlx5e_hairpin_get_mdev()
        net/mlx5: Unload device upon firmware fatal error
        net/mlx5e: Fix page allocation failure for ptp-RQ over SF
        net/mlx5e: Fix page allocation failure for trap-RQ over SF
        ...
      c7d10223
    • Linus Torvalds's avatar
      Merge tag 'acpi-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · e1dab4c0
      Linus Torvalds authored
      Pull ACPI fixes from Rafael Wysocki:
       "These revert a recent IRQ resources handling modification that turned
        out to be problematic, fix suspend-to-idle handling on AMD platforms
        to take upcoming systems into account properly and fix the retrieval
        of the DPTF attributes of the PCH FIVR.
      
        Specifics:
      
         - Revert recent change of the ACPI IRQ resources handling that
           attempted to improve the ACPI IRQ override selection logic, but
           introduced serious regressions on some systems (Hui Wang).
      
         - Fix up quirks for AMD platforms in the suspend-to-idle support code
           so as to take upcoming systems using uPEP HID AMDI007 into account
           as appropriate (Mario Limonciello).
      
         - Fix the code retrieving DPTF attributes of the PCH FIVR so that it
           agrees on the return data type with the ACPI control method
           evaluated for this purpose (Srinivas Pandruvada)"
      
      * tag 'acpi-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        ACPI: DPTF: Fix reading of attributes
        Revert "ACPI: resources: Add checks for ACPI IRQ override"
        ACPI: PM: Add support for upcoming AMD uPEP HID AMDI007
      e1dab4c0
    • Linus Torvalds's avatar
      pipe: make pipe writes always wake up readers · 3a34b13a
      Linus Torvalds authored
      Since commit 1b6b26ae ("pipe: fix and clarify pipe write wakeup
      logic") we have sanitized the pipe write logic, and would only try to
      wake up readers if they needed it.
      
      In particular, if the pipe already had data in it before the write,
      there was no point in trying to wake up a reader, since any existing
      readers must have been aware of the pre-existing data already.  Doing
      extraneous wakeups will only cause potential thundering herd problems.
      
      However, it turns out that some Android libraries have misused the EPOLL
      interface, and expected "edge triggered" be to "any new write will
      trigger it".  Even if there was no edge in sight.
      
      Quoting Sandeep Patil:
       "The commit 1b6b26ae ('pipe: fix and clarify pipe write wakeup
        logic') changed pipe write logic to wakeup readers only if the pipe
        was empty at the time of write. However, there are libraries that
        relied upon the older behavior for notification scheme similar to
        what's described in [1]
      
        One such library 'realm-core'[2] is used by numerous Android
        applications. The library uses a similar notification mechanism as GNU
        Make but it never drains the pipe until it is full. When Android moved
        to v5.10 kernel, all applications using this library stopped working.
      
        The library has since been fixed[3] but it will be a while before all
        applications incorporate the updated library"
      
      Our regression rule for the kernel is that if applications break from
      new behavior, it's a regression, even if it was because the application
      did something patently wrong.  Also note the original report [4] by
      Michal Kerrisk about a test for this epoll behavior - but at that point
      we didn't know of any actual broken use case.
      
      So add the extraneous wakeup, to approximate the old behavior.
      
      [ I say "approximate", because the exact old behavior was to do a wakeup
        not for each write(), but for each pipe buffer chunk that was filled
        in. The behavior introduced by this change is not that - this is just
        "every write will cause a wakeup, whether necessary or not", which
        seems to be sufficient for the broken library use. ]
      
      It's worth noting that this adds the extraneous wakeup only for the
      write side, while the read side still considers the "edge" to be purely
      about reading enough from the pipe to allow further writes.
      
      See commit f467a6a6 ("pipe: fix and clarify pipe read wakeup logic")
      for the pipe read case, which remains that "only wake up if the pipe was
      full, and we read something from it".
      
      Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/ [1]
      Link: https://github.com/realm/realm-core [2]
      Link: https://github.com/realm/realm-core/issues/4666 [3]
      Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/ [4]
      Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/Reported-by: default avatarSandeep Patil <sspatil@android.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a34b13a