1. 25 Jan, 2024 33 commits
    • Jakub Kicinski's avatar
      Merge branch 'selftests-net-a-few-fixes' · ce36ea75
      Jakub Kicinski authored
      Paolo Abeni says:
      
      ====================
      selftests: net: a few fixes
      
      This series address self-tests failures for udp gro-related tests.
      
      The first patch addresses the main problem I observe locally - the XDP
      program required by such tests, xdp_dummy, is currently build in the
      ebpf self-tests directory, not available if/when the user targets net
      only. Arguably is more a refactor than a fix, but still targeting net
      to hopefully
      
      The second patch fixes the integration of such tests with the build
      system.
      
      Patch 3/3 fixes sporadic failures due to races.
      
      Tested with:
      
      make -C tools/testing/selftests/ TARGETS=net install
      ./tools/testing/selftests/kselftest_install/run_kselftest.sh \
      	-t "net:udpgro_bench.sh net:udpgro.sh net:udpgro_fwd.sh \
      	    net:udpgro_frglist.sh net:veth.sh"
      
      no failures.
      ====================
      
      Link: https://lore.kernel.org/r/cover.1706131762.git.pabeni@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ce36ea75
    • Paolo Abeni's avatar
      selftests: net: explicitly wait for listener ready · 4acffb66
      Paolo Abeni authored
      The UDP GRO forwarding test still hard-code an arbitrary pause
      to wait for the UDP listener becoming ready in background.
      
      That causes sporadic failures depending on the host load.
      
      Replace the sleep with the existing helper waiting for the desired
      port being exposed.
      
      Fixes: a062260a ("selftests: net: add UDP GRO forwarding self-tests")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/r/4d58900fb09cef42749cfcf2ad7f4b91a97d225c.1706131762.git.pabeni@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4acffb66
    • Paolo Abeni's avatar
      selftests: net: included needed helper in the install targets · f5173fe3
      Paolo Abeni authored
      The blamed commit below introduce a dependency in some net self-tests
      towards a newly introduce helper script.
      
      Such script is currently not included into the TEST_PROGS_EXTENDED list
      and thus is not installed, causing failure for the relevant tests when
      executed from the install dir.
      
      Fix the issue updating the install targets.
      
      Fixes: 3bdd9fd2 ("selftests/net: synchronize udpgro tests' tx and rx connection")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/r/076e8758e21ff2061cc9f81640e7858df775f0a9.1706131762.git.pabeni@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f5173fe3
    • Paolo Abeni's avatar
      selftests: net: remove dependency on ebpf tests · 98cb12eb
      Paolo Abeni authored
      Several net tests requires an XDP program build under the ebpf
      directory, and error out if such program is not available.
      
      That makes running successful net test hard, let's duplicate into the
      net dir the [very small] program, re-using the existing rules to build
      it, and finally dropping the bogus dependency.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/r/28e7af7c031557f691dc8045ee41dd549dd5e74c.1706131762.git.pabeni@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      98cb12eb
    • Jakub Kicinski's avatar
      selftests: tcp_ao: add a config file · b6478784
      Jakub Kicinski authored
      Still a bit unclear whether each directory should have its own
      config file, but assuming they should lets add one for tcp_ao.
      
      The following tests still fail with this config in place:
       - rst_ipv4,
       - rst_ipv6,
       - bench-lookups_ipv6.
      other 21 pass.
      
      Fixes: d11301f6 ("selftests/net: Add TCP-AO ICMPs accept test")
      Reviewed-by: default avatarDmitry Safonov <0x7f454c46@gmail.com>
      Link: https://lore.kernel.org/r/20240124192550.1865743-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b6478784
    • Linus Torvalds's avatar
      Merge tag 'net-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · ecb1b828
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Including fixes from bpf, netfilter and WiFi.
      
        Jakub is doing a lot of work to include the self-tests in our CI, as a
        result a significant amount of self-tests related fixes is flowing in
        (and will likely continue in the next few weeks).
      
        Current release - regressions:
      
         - bpf: fix a kernel crash for the riscv 64 JIT
      
         - bnxt_en: fix memory leak in bnxt_hwrm_get_rings()
      
         - revert "net: macsec: use skb_ensure_writable_head_tail to expand
           the skb"
      
        Previous releases - regressions:
      
         - core: fix removing a namespace with conflicting altnames
      
         - tc/flower: fix chain template offload memory leak
      
         - tcp:
            - make sure init the accept_queue's spinlocks once
            - fix autocork on CPUs with weak memory model
      
         - udp: fix busy polling
      
         - mlx5e:
            - fix out-of-bound read in port timestamping
            - fix peer flow lists corruption
      
         - iwlwifi: fix a memory corruption
      
        Previous releases - always broken:
      
         - netfilter:
            - nft_chain_filter: handle NETDEV_UNREGISTER for inet/ingress
              basechain
            - nft_limit: reject configurations that cause integer overflow
      
         - bpf: fix bpf_xdp_adjust_tail() with XSK zero-copy mbuf, avoiding a
           NULL pointer dereference upon shrinking
      
         - llc: make llc_ui_sendmsg() more robust against bonding changes
      
         - smc: fix illegal rmb_desc access in SMC-D connection dump
      
         - dpll: fix pin dump crash for rebound module
      
         - bnxt_en: fix possible crash after creating sw mqprio TCs
      
         - hv_netvsc: calculate correct ring size when PAGE_SIZE is not 4kB
      
        Misc:
      
         - several self-tests fixes for better integration with the netdev CI
      
         - added several missing modules descriptions"
      
      * tag 'net-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
        tsnep: Fix XDP_RING_NEED_WAKEUP for empty fill ring
        tsnep: Remove FCS for XDP data path
        net: fec: fix the unhandled context fault from smmu
        selftests: bonding: do not test arp/ns target with mode balance-alb/tlb
        fjes: fix memleaks in fjes_hw_setup
        i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue
        i40e: set xdp_rxq_info::frag_size
        xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL
        ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue
        intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers
        ice: remove redundant xdp_rxq_info registration
        i40e: handle multi-buffer packets that are shrunk by xdp prog
        ice: work on pre-XDP prog frag count
        xsk: fix usage of multi-buffer BPF helpers for ZC XDP
        xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags
        xsk: recycle buffer in case Rx queue was full
        net: fill in MODULE_DESCRIPTION()s for rvu_mbox
        net: fill in MODULE_DESCRIPTION()s for litex
        net: fill in MODULE_DESCRIPTION()s for fsl_pq_mdio
        net: fill in MODULE_DESCRIPTION()s for fec
        ...
      ecb1b828
    • Linus Torvalds's avatar
      Merge tag 'ovl-fixes-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs · bdc01020
      Linus Torvalds authored
      Pull overlayfs fix from Amir Goldstein:
       "Change the on-disk format for the new "xwhiteouts" feature introduced
        in v6.7
      
        The change reduces unneeded overhead of an extra getxattr per readdir.
        The only user of the "xwhiteout" feature is the external composefs
        tool, which has been updated to support the new on-disk format.
      
        This change is also designated for 6.7.y"
      
      * tag 'ovl-fixes-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
        ovl: mark xwhiteouts directory with overlay.opaque='x'
      bdc01020
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.8-rc2.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · a658e0e9
      Linus Torvalds authored
      Pull netfs fixes from Christian Brauner:
       "This contains various fixes for the netfs work merged earlier this
        cycle:
      
        afs:
         - Fix locking imbalance in afs_proc_addr_prefs_show()
         - Remove afs_dynroot_d_revalidate() which is redundant
         - Fix error handling during lookup
         - Hide sillyrenames from userspace. This fixes a race between
           silly-rename files being created/removed and userspace iterating
           over directory entries
         - Don't use unnecessary folio_*() functions
      
        cifs:
         - Don't use unnecessary folio_*() functions
      
        cachefiles:
         - erofs: Fix Null dereference when cachefiles are not doing
           ondemand-mode
         - Update mailing list
      
        netfs library:
         - Add Jeff Layton as reviewer
         - Update mailing list
         - Fix a error checking in netfs_perform_write()
         - fscache: Check error before dereferencing
         - Don't use unnecessary folio_*() functions"
      
      * tag 'vfs-6.8-rc2.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
        afs: Fix missing/incorrect unlocking of RCU read lock
        afs: Remove afs_dynroot_d_revalidate() as it is redundant
        afs: Fix error handling with lookup via FS.InlineBulkStatus
        afs: Hide silly-rename files from userspace
        cachefiles, erofs: Fix NULL deref in when cachefiles is not doing ondemand-mode
        netfs: Fix a NULL vs IS_ERR() check in netfs_perform_write()
        netfs, fscache: Prevent Oops in fscache_put_cache()
        cifs: Don't use certain unnecessary folio_*() functions
        afs: Don't use certain unnecessary folio_*() functions
        netfs: Don't use certain unnecessary folio_*() functions
        netfs: Add Jeff Layton as reviewer
        netfs, cachefiles: Change mailing list
      a658e0e9
    • Linus Torvalds's avatar
      Merge tag 'nfsd-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux · b9fa4cbd
      Linus Torvalds authored
      Pull nfsd fixes from Chuck Lever:
      
       - Fix in-kernel RPC UDP transport
      
       - Fix NFSv4.0 RELEASE_LOCKOWNER
      
      * tag 'nfsd-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
        nfsd: fix RELEASE_LOCKOWNER
        SUNRPC: use request size to initialize bio_vec in svc_udp_sendto()
      b9fa4cbd
    • Linus Torvalds's avatar
      Merge tag 'urgent-rcu.2024.01.24a' of https://github.com/neeraju/linux · 3cb9871f
      Linus Torvalds authored
      Pull RCU fix from Neeraj Upadhyay:
       "This fixes RCU grace period stalls, which are observed when an
        outgoing CPU's quiescent state reporting results in wakeup of one of
        the grace period kthreads, to complete the grace period.
      
        If those kthreads have SCHED_FIFO policy, the wake up can indirectly
        arm the RT bandwith timer to the local offline CPU.
      
        Earlier migration of the hrtimers from the CPU introduced in commit
        5c0930cc ("hrtimers: Push pending hrtimers away from outgoing CPU
        earlier") results in this timer getting ignored.
      
        If the RCU grace period kthreads are waiting for RT bandwidth to be
        available, they may never be actually scheduled, resulting in RCU
        stall warnings"
      
      * tag 'urgent-rcu.2024.01.24a' of https://github.com/neeraju/linux:
        rcu: Defer RCU kthreads wakeup when CPU is dying
      3cb9871f
    • Paolo Abeni's avatar
      Merge branch 'tsnep-xdp-fixes' · 0a5bd0ff
      Paolo Abeni authored
      Gerhard Engleder says:
      
      ====================
      tsnep: XDP fixes
      
      Found two driver specific problems during XDP and XSK testing.
      ====================
      
      Link: https://lore.kernel.org/r/20240123200918.61219-1-gerhard@engleder-embedded.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      0a5bd0ff
    • Gerhard Engleder's avatar
      tsnep: Fix XDP_RING_NEED_WAKEUP for empty fill ring · 9a91c05f
      Gerhard Engleder authored
      The fill ring of the XDP socket may contain not enough buffers to
      completey fill the RX queue during socket creation. In this case the
      flag XDP_RING_NEED_WAKEUP is not set as this flag is only set if the RX
      queue is not completely filled during polling.
      
      Set XDP_RING_NEED_WAKEUP flag also if RX queue is not completely filled
      during XDP socket creation.
      
      Fixes: 3fc23339 ("tsnep: Add XDP socket zero-copy RX support")
      Signed-off-by: default avatarGerhard Engleder <gerhard@engleder-embedded.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      9a91c05f
    • Gerhard Engleder's avatar
      tsnep: Remove FCS for XDP data path · 50bad6f7
      Gerhard Engleder authored
      The RX data buffer includes the FCS. The FCS is already stripped for the
      normal data path. But for the XDP data path the FCS is included and
      acts like additional/useless data.
      
      Remove the FCS from the RX data buffer also for XDP.
      
      Fixes: 65b28c81 ("tsnep: Add XDP RX support")
      Fixes: 3fc23339 ("tsnep: Add XDP socket zero-copy RX support")
      Signed-off-by: default avatarGerhard Engleder <gerhard@engleder-embedded.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      50bad6f7
    • Paolo Abeni's avatar
      Merge tag 'mlx5-fixes-2024-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 5da45971
      Paolo Abeni authored
      Saeed Mahameed says:
      
      ====================
      mlx5 fixes 2024-01-24
      
      This series provides bug fixes to mlx5 driver.
      Please pull and let me know if there is any problem.
      
      * tag 'mlx5-fixes-2024-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
        net/mlx5e: fix a potential double-free in fs_any_create_groups
        net/mlx5e: fix a double-free in arfs_create_groups
        net/mlx5e: Ignore IPsec replay window values on sender side
        net/mlx5e: Allow software parsing when IPsec crypto is enabled
        net/mlx5: Use mlx5 device constant for selecting CQ period mode for ASO
        net/mlx5: DR, Can't go to uplink vport on RX rule
        net/mlx5: DR, Use the right GVMI number for drop action
        net/mlx5: Bridge, fix multicast packets sent to uplink
        net/mlx5: Fix a WARN upon a callback command failure
        net/mlx5e: Fix peer flow lists handling
        net/mlx5e: Fix inconsistent hairpin RQT sizes
        net/mlx5e: Fix operation precedence bug in port timestamping napi_poll context
        net/mlx5: Fix query of sd_group field
        net/mlx5e: Use the correct lag ports number when creating TISes
      ====================
      
      Link: https://lore.kernel.org/r/20240124081855.115410-1-saeed@kernel.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      5da45971
    • Paolo Abeni's avatar
      Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · fdf8e6d1
      Paolo Abeni authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2024-01-25
      
      The following pull-request contains BPF updates for your *net* tree.
      
      We've added 12 non-merge commits during the last 2 day(s) which contain
      a total of 13 files changed, 190 insertions(+), 91 deletions(-).
      
      The main changes are:
      
      1) Fix bpf_xdp_adjust_tail() in context of XSK zero-copy drivers which
         support XDP multi-buffer. The former triggered a NULL pointer
         dereference upon shrinking, from Maciej Fijalkowski & Tirthendu Sarkar.
      
      2) Fix a bug in riscv64 BPF JIT which emitted a wrong prologue and
         epilogue for struct_ops programs, from Pu Lehui.
      
      * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue
        i40e: set xdp_rxq_info::frag_size
        xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL
        ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue
        intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers
        ice: remove redundant xdp_rxq_info registration
        i40e: handle multi-buffer packets that are shrunk by xdp prog
        ice: work on pre-XDP prog frag count
        xsk: fix usage of multi-buffer BPF helpers for ZC XDP
        xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags
        xsk: recycle buffer in case Rx queue was full
        riscv, bpf: Fix unpredictable kernel crash about RV64 struct_ops
      ====================
      
      Link: https://lore.kernel.org/r/20240125084416.10876-1-daniel@iogearbox.netSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      fdf8e6d1
    • Shenwei Wang's avatar
      net: fec: fix the unhandled context fault from smmu · 5e344807
      Shenwei Wang authored
      When repeatedly changing the interface link speed using the command below:
      
      ethtool -s eth0 speed 100 duplex full
      ethtool -s eth0 speed 1000 duplex full
      
      The following errors may sometimes be reported by the ARM SMMU driver:
      
      [ 5395.035364] fec 5b040000.ethernet eth0: Link is Down
      [ 5395.039255] arm-smmu 51400000.iommu: Unhandled context fault:
      fsr=0x402, iova=0x00000000, fsynr=0x100001, cbfrsynra=0x852, cb=2
      [ 5398.108460] fec 5b040000.ethernet eth0: Link is Up - 100Mbps/Full -
      flow control off
      
      It is identified that the FEC driver does not properly stop the TX queue
      during the link speed transitions, and this results in the invalid virtual
      I/O address translations from the SMMU and causes the context faults.
      
      Fixes: dbc64a8e ("net: fec: move calls to quiesce/resume packet processing out of fec_restart()")
      Signed-off-by: default avatarShenwei Wang <shenwei.wang@nxp.com>
      Link: https://lore.kernel.org/r/20240123165141.2008104-1-shenwei.wang@nxp.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      5e344807
    • Hangbin Liu's avatar
      selftests: bonding: do not test arp/ns target with mode balance-alb/tlb · a2933a87
      Hangbin Liu authored
      The prio_arp/ns tests hard code the mode to active-backup. At the same
      time, The balance-alb/tlb modes do not support arp/ns target. So remove
      the prio_arp/ns tests from the loop and only test active-backup mode.
      
      Fixes: 481b56e0 ("selftests: bonding: re-format bond option tests")
      Reported-by: default avatarJay Vosburgh <jay.vosburgh@canonical.com>
      Closes: https://lore.kernel.org/netdev/17415.1705965957@famine/Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Acked-by: default avatarJay Vosburgh <jay.vosburgh@canonical.com>
      Link: https://lore.kernel.org/r/20240123075917.1576360-1-liuhangbin@gmail.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      a2933a87
    • Jakub Kicinski's avatar
      Merge tag 'nf-24-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · a717932d
      Jakub Kicinski authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for net:
      
      1) Update nf_tables kdoc to keep it in sync with the code, from George Guo.
      
      2) Handle NETDEV_UNREGISTER event for inet/ingress basechain.
      
      3) Reject configuration that cause nft_limit to overflow,
         from Florian Westphal.
      
      4) Restrict anonymous set/map names to 16 bytes, from Florian Westphal.
      
      5) Disallow to encode queue number and error in verdicts. This reverts
         a patch which seems to have introduced an early attempt to support for
         nfqueue maps, which is these days supported via nft_queue expression.
      
      6) Sanitize family via .validate for expressions that explicitly refer
         to NF_INET_* hooks.
      
      * tag 'nf-24-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
        netfilter: nf_tables: validate NFPROTO_* family
        netfilter: nf_tables: reject QUEUE/DROP verdict parameters
        netfilter: nf_tables: restrict anonymous set and map names to 16 bytes
        netfilter: nft_limit: reject configurations that cause integer overflow
        netfilter: nft_chain_filter: handle NETDEV_UNREGISTER for inet/ingress basechain
        netfilter: nf_tables: cleanup documentation
      ====================
      
      Link: https://lore.kernel.org/r/20240124191248.75463-1-pablo@netfilter.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a717932d
    • Zhipeng Lu's avatar
      fjes: fix memleaks in fjes_hw_setup · f6cc4b6a
      Zhipeng Lu authored
      In fjes_hw_setup, it allocates several memory and delay the deallocation
      to the fjes_hw_exit in fjes_probe through the following call chain:
      
      fjes_probe
        |-> fjes_hw_init
              |-> fjes_hw_setup
        |-> fjes_hw_exit
      
      However, when fjes_hw_setup fails, fjes_hw_exit won't be called and thus
      all the resources allocated in fjes_hw_setup will be leaked. In this
      patch, we free those resources in fjes_hw_setup and prevents such leaks.
      
      Fixes: 2fcbca68 ("fjes: platform_driver's .probe and .remove routine")
      Signed-off-by: default avatarZhipeng Lu <alexious@zju.edu.cn>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240122172445.3841883-1-alexious@zju.edu.cnSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f6cc4b6a
    • Linus Torvalds's avatar
      Merge tag 'ceph-for-6.8-rc2' of https://github.com/ceph/ceph-client · 6098d87e
      Linus Torvalds authored
      Pull ceph fixes from Ilya Dryomov:
       "A fix to avoid triggering an assert in some cases where RBD exclusive
        mappings are involved and a deprecated API cleanup"
      
      * tag 'ceph-for-6.8-rc2' of https://github.com/ceph/ceph-client:
        rbd: don't move requests to the running list on errors
        rbd: remove usage of the deprecated ida_simple_*() API
      6098d87e
    • Linus Torvalds's avatar
      Merge tag 'integrity-v6.8-rc1' of... · f22face1
      Linus Torvalds authored
      Merge tag 'integrity-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
      
      Pull integrity fix from Mimi Zohar:
       "Revert patch that required user-provided key data, since keys can be
        created from kernel-generated random numbers"
      
      * tag 'integrity-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
        Revert "KEYS: encrypted: Add check for strsep"
      f22face1
    • Alexei Starovoitov's avatar
      Merge branch 'net-bpf_xdp_adjust_tail-and-intel-mbuf-fixes' · 9d71bc83
      Alexei Starovoitov authored
      Maciej Fijalkowski says:
      
      ====================
      net: bpf_xdp_adjust_tail() and Intel mbuf fixes
      
      Hey,
      
      after a break followed by dealing with sickness, here is a v6 that makes
      bpf_xdp_adjust_tail() actually usable for ZC drivers that support XDP
      multi-buffer. Since v4 I tried also using bpf_xdp_adjust_tail() with
      positive offset which exposed yet another issues, which can be observed
      by increased commit count when compared to v3.
      
      John, in the end I think we should remove handling
      MEM_TYPE_XSK_BUFF_POOL from __xdp_return(), but it is out of the scope
      for fixes set, IMHO.
      
      Thanks,
      Maciej
      
      v6:
      - add acks [Magnus]
      - fix spelling mistakes [Magnus]
      - avoid touching xdp_buff in xp_alloc_{reused,new_from_fq}() [Magnus]
      - s/shrink_data/bpf_xdp_shrink_data [Jakub]
      - remove __shrink_data() [Jakub]
      - check retvals from __xdp_rxq_info_reg() [Magnus]
      
      v5:
      - pick correct version of patch 5 [Simon]
      - elaborate a bit more on what patch 2 fixes
      
      v4:
      - do not clear frags flag when deleting tail; xsk_buff_pool now does
        that
      - skip some NULL tests for xsk_buff_get_tail [Martin, John]
      - address problems around registering xdp_rxq_info
      - fix bpf_xdp_frags_increase_tail() for ZC mbuf
      
      v3:
      - add acks
      - s/xsk_buff_tail_del/xsk_buff_del_tail
      - address i40e as well (thanks Tirthendu)
      
      v2:
      - fix !CONFIG_XDP_SOCKETS builds
      - add reviewed-by tag to patch 3
      ====================
      
      Link: https://lore.kernel.org/r/20240124191602.566724-1-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9d71bc83
    • Maciej Fijalkowski's avatar
      i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue · 0cbb0870
      Maciej Fijalkowski authored
      Now that i40e driver correctly sets up frag_size in xdp_rxq_info, let us
      make it work for ZC multi-buffer as well. i40e_ring::rx_buf_len for ZC
      is being set via xsk_pool_get_rx_frame_size() and this needs to be
      propagated up to xdp_rxq_info.
      
      Fixes: 1c9ba9c1 ("i40e: xsk: add RX multi-buffer support")
      Acked-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-12-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0cbb0870
    • Maciej Fijalkowski's avatar
      i40e: set xdp_rxq_info::frag_size · a045d2f2
      Maciej Fijalkowski authored
      i40e support XDP multi-buffer so it is supposed to use
      __xdp_rxq_info_reg() instead of xdp_rxq_info_reg() and set the
      frag_size. It can not be simply converted at existing callsite because
      rx_buf_len could be un-initialized, so let us register xdp_rxq_info
      within i40e_configure_rx_ring(), which happen to be called with already
      initialized rx_buf_len value.
      
      Commit 5180ff13 ("i40e: use int for i40e_status") converted 'err' to
      int, so two variables to deal with return codes are not needed within
      i40e_configure_rx_ring(). Remove 'ret' and use 'err' to handle status
      from xdp_rxq_info registration.
      
      Fixes: e213ced1 ("i40e: add support for XDP multi-buffer Rx")
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-11-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a045d2f2
    • Maciej Fijalkowski's avatar
      xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL · fbadd83a
      Maciej Fijalkowski authored
      XSK ZC Rx path calculates the size of data that will be posted to XSK Rx
      queue via subtracting xdp_buff::data_end from xdp_buff::data.
      
      In bpf_xdp_frags_increase_tail(), when underlying memory type of
      xdp_rxq_info is MEM_TYPE_XSK_BUFF_POOL, add offset to data_end in tail
      fragment, so that later on user space will be able to take into account
      the amount of bytes added by XDP program.
      
      Fixes: 24ea5012 ("xsk: support mbuf on ZC RX")
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-10-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fbadd83a
    • Maciej Fijalkowski's avatar
      ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue · 3de38c87
      Maciej Fijalkowski authored
      Now that ice driver correctly sets up frag_size in xdp_rxq_info, let us
      make it work for ZC multi-buffer as well. ice_rx_ring::rx_buf_len for ZC
      is being set via xsk_pool_get_rx_frame_size() and this needs to be
      propagated up to xdp_rxq_info.
      
      Use a bigger hammer and instead of unregistering only xdp_rxq_info's
      memory model, unregister it altogether and register it again and have
      xdp_rxq_info with correct frag_size value.
      
      Fixes: 1bbc04de ("ice: xsk: add RX multi-buffer support")
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-9-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3de38c87
    • Maciej Fijalkowski's avatar
      intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers · 29077990
      Maciej Fijalkowski authored
      Ice and i40e ZC drivers currently set offset of a frag within
      skb_shared_info to 0, which is incorrect. xdp_buffs that come from
      xsk_buff_pool always have 256 bytes of a headroom, so they need to be
      taken into account to retrieve xdp_buff::data via skb_frag_address().
      Otherwise, bpf_xdp_frags_increase_tail() would be starting its job from
      xdp_buff::data_hard_start which would result in overwriting existing
      payload.
      
      Fixes: 1c9ba9c1 ("i40e: xsk: add RX multi-buffer support")
      Fixes: 1bbc04de ("ice: xsk: add RX multi-buffer support")
      Acked-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-8-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      29077990
    • Maciej Fijalkowski's avatar
      ice: remove redundant xdp_rxq_info registration · 2ee788c0
      Maciej Fijalkowski authored
      xdp_rxq_info struct can be registered by drivers via two functions -
      xdp_rxq_info_reg() and __xdp_rxq_info_reg(). The latter one allows
      drivers that support XDP multi-buffer to set up xdp_rxq_info::frag_size
      which in turn will make it possible to grow the packet via
      bpf_xdp_adjust_tail() BPF helper.
      
      Currently, ice registers xdp_rxq_info in two spots:
      1) ice_setup_rx_ring() // via xdp_rxq_info_reg(), BUG
      2) ice_vsi_cfg_rxq()   // via __xdp_rxq_info_reg(), OK
      
      Cited commit under fixes tag took care of setting up frag_size and
      updated registration scheme in 2) but it did not help as
      1) is called before 2) and as shown above it uses old registration
      function. This means that 2) sees that xdp_rxq_info is already
      registered and never calls __xdp_rxq_info_reg() which leaves us with
      xdp_rxq_info::frag_size being set to 0.
      
      To fix this misbehavior, simply remove xdp_rxq_info_reg() call from
      ice_setup_rx_ring().
      
      Fixes: 2fba7dc5 ("ice: Add support for XDP multi-buffer on Rx side")
      Acked-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-7-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2ee788c0
    • Tirthendu Sarkar's avatar
      i40e: handle multi-buffer packets that are shrunk by xdp prog · 83014323
      Tirthendu Sarkar authored
      XDP programs can shrink packets by calling the bpf_xdp_adjust_tail()
      helper function. For multi-buffer packets this may lead to reduction of
      frag count stored in skb_shared_info area of the xdp_buff struct. This
      results in issues with the current handling of XDP_PASS and XDP_DROP
      cases.
      
      For XDP_PASS, currently skb is being built using frag count of
      xdp_buffer before it was processed by XDP prog and thus will result in
      an inconsistent skb when frag count gets reduced by XDP prog. To fix
      this, get correct frag count while building the skb instead of using
      pre-obtained frag count.
      
      For XDP_DROP, current page recycling logic will not reuse the page but
      instead will adjust the pagecnt_bias so that the page can be freed. This
      again results in inconsistent behavior as the page refcnt has already
      been changed by the helper while freeing the frag(s) as part of
      shrinking the packet. To fix this, only adjust pagecnt_bias for buffers
      that are stillpart of the packet post-xdp prog run.
      
      Fixes: e213ced1 ("i40e: add support for XDP multi-buffer Rx")
      Reported-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarTirthendu Sarkar <tirthendu.sarkar@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-6-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      83014323
    • Maciej Fijalkowski's avatar
      ice: work on pre-XDP prog frag count · ad2047cf
      Maciej Fijalkowski authored
      Fix an OOM panic in XDP_DRV mode when a XDP program shrinks a
      multi-buffer packet by 4k bytes and then redirects it to an AF_XDP
      socket.
      
      Since support for handling multi-buffer frames was added to XDP, usage
      of bpf_xdp_adjust_tail() helper within XDP program can free the page
      that given fragment occupies and in turn decrease the fragment count
      within skb_shared_info that is embedded in xdp_buff struct. In current
      ice driver codebase, it can become problematic when page recycling logic
      decides not to reuse the page. In such case, __page_frag_cache_drain()
      is used with ice_rx_buf::pagecnt_bias that was not adjusted after
      refcount of page was changed by XDP prog which in turn does not drain
      the refcount to 0 and page is never freed.
      
      To address this, let us store the count of frags before the XDP program
      was executed on Rx ring struct. This will be used to compare with
      current frag count from skb_shared_info embedded in xdp_buff. A smaller
      value in the latter indicates that XDP prog freed frag(s). Then, for
      given delta decrement pagecnt_bias for XDP_DROP verdict.
      
      While at it, let us also handle the EOP frag within
      ice_set_rx_bufs_act() to make our life easier, so all of the adjustments
      needed to be applied against freed frags are performed in the single
      place.
      
      Fixes: 2fba7dc5 ("ice: Add support for XDP multi-buffer on Rx side")
      Acked-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-5-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ad2047cf
    • Maciej Fijalkowski's avatar
      xsk: fix usage of multi-buffer BPF helpers for ZC XDP · c5114710
      Maciej Fijalkowski authored
      Currently when packet is shrunk via bpf_xdp_adjust_tail() and memory
      type is set to MEM_TYPE_XSK_BUFF_POOL, null ptr dereference happens:
      
      [1136314.192256] BUG: kernel NULL pointer dereference, address:
      0000000000000034
      [1136314.203943] #PF: supervisor read access in kernel mode
      [1136314.213768] #PF: error_code(0x0000) - not-present page
      [1136314.223550] PGD 0 P4D 0
      [1136314.230684] Oops: 0000 [#1] PREEMPT SMP NOPTI
      [1136314.239621] CPU: 8 PID: 54203 Comm: xdpsock Not tainted 6.6.0+ #257
      [1136314.250469] Hardware name: Intel Corporation S2600WFT/S2600WFT,
      BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
      [1136314.265615] RIP: 0010:__xdp_return+0x6c/0x210
      [1136314.274653] Code: ad 00 48 8b 47 08 49 89 f8 a8 01 0f 85 9b 01 00 00 0f 1f 44 00 00 f0 41 ff 48 34 75 32 4c 89 c7 e9 79 cd 80 ff 83 fe 03 75 17 <f6> 41 34 01 0f 85 02 01 00 00 48 89 cf e9 22 cc 1e 00 e9 3d d2 86
      [1136314.302907] RSP: 0018:ffffc900089f8db0 EFLAGS: 00010246
      [1136314.312967] RAX: ffffc9003168aed0 RBX: ffff8881c3300000 RCX:
      0000000000000000
      [1136314.324953] RDX: 0000000000000000 RSI: 0000000000000003 RDI:
      ffffc9003168c000
      [1136314.336929] RBP: 0000000000000ae0 R08: 0000000000000002 R09:
      0000000000010000
      [1136314.348844] R10: ffffc9000e495000 R11: 0000000000000040 R12:
      0000000000000001
      [1136314.360706] R13: 0000000000000524 R14: ffffc9003168aec0 R15:
      0000000000000001
      [1136314.373298] FS:  00007f8df8bbcb80(0000) GS:ffff8897e0e00000(0000)
      knlGS:0000000000000000
      [1136314.386105] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [1136314.396532] CR2: 0000000000000034 CR3: 00000001aa912002 CR4:
      00000000007706f0
      [1136314.408377] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
      0000000000000000
      [1136314.420173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
      0000000000000400
      [1136314.431890] PKRU: 55555554
      [1136314.439143] Call Trace:
      [1136314.446058]  <IRQ>
      [1136314.452465]  ? __die+0x20/0x70
      [1136314.459881]  ? page_fault_oops+0x15b/0x440
      [1136314.468305]  ? exc_page_fault+0x6a/0x150
      [1136314.476491]  ? asm_exc_page_fault+0x22/0x30
      [1136314.484927]  ? __xdp_return+0x6c/0x210
      [1136314.492863]  bpf_xdp_adjust_tail+0x155/0x1d0
      [1136314.501269]  bpf_prog_ccc47ae29d3b6570_xdp_sock_prog+0x15/0x60
      [1136314.511263]  ice_clean_rx_irq_zc+0x206/0xc60 [ice]
      [1136314.520222]  ? ice_xmit_zc+0x6e/0x150 [ice]
      [1136314.528506]  ice_napi_poll+0x467/0x670 [ice]
      [1136314.536858]  ? ttwu_do_activate.constprop.0+0x8f/0x1a0
      [1136314.546010]  __napi_poll+0x29/0x1b0
      [1136314.553462]  net_rx_action+0x133/0x270
      [1136314.561619]  __do_softirq+0xbe/0x28e
      [1136314.569303]  do_softirq+0x3f/0x60
      
      This comes from __xdp_return() call with xdp_buff argument passed as
      NULL which is supposed to be consumed by xsk_buff_free() call.
      
      To address this properly, in ZC case, a node that represents the frag
      being removed has to be pulled out of xskb_list. Introduce
      appropriate xsk helpers to do such node operation and use them
      accordingly within bpf_xdp_adjust_tail().
      
      Fixes: 24ea5012 ("xsk: support mbuf on ZC RX")
      Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> # For the xsk header part
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-4-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c5114710
    • Maciej Fijalkowski's avatar
      xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags · f7f6aa8e
      Maciej Fijalkowski authored
      XDP multi-buffer support introduced XDP_FLAGS_HAS_FRAGS flag that is
      used by drivers to notify data path whether xdp_buff contains fragments
      or not. Data path looks up mentioned flag on first buffer that occupies
      the linear part of xdp_buff, so drivers only modify it there. This is
      sufficient for SKB and XDP_DRV modes as usually xdp_buff is allocated on
      stack or it resides within struct representing driver's queue and
      fragments are carried via skb_frag_t structs. IOW, we are dealing with
      only one xdp_buff.
      
      ZC mode though relies on list of xdp_buff structs that is carried via
      xsk_buff_pool::xskb_list, so ZC data path has to make sure that
      fragments do *not* have XDP_FLAGS_HAS_FRAGS set. Otherwise,
      xsk_buff_free() could misbehave if it would be executed against xdp_buff
      that carries a frag with XDP_FLAGS_HAS_FRAGS flag set. Such scenario can
      take place when within supplied XDP program bpf_xdp_adjust_tail() is
      used with negative offset that would in turn release the tail fragment
      from multi-buffer frame.
      
      Calling xsk_buff_free() on tail fragment with XDP_FLAGS_HAS_FRAGS would
      result in releasing all the nodes from xskb_list that were produced by
      driver before XDP program execution, which is not what is intended -
      only tail fragment should be deleted from xskb_list and then it should
      be put onto xsk_buff_pool::free_list. Such multi-buffer frame will never
      make it up to user space, so from AF_XDP application POV there would be
      no traffic running, however due to free_list getting constantly new
      nodes, driver will be able to feed HW Rx queue with recycled buffers.
      Bottom line is that instead of traffic being redirected to user space,
      it would be continuously dropped.
      
      To fix this, let us clear the mentioned flag on xsk_buff_pool side
      during xdp_buff initialization, which is what should have been done
      right from the start of XSK multi-buffer support.
      
      Fixes: 1bbc04de ("ice: xsk: add RX multi-buffer support")
      Fixes: 1c9ba9c1 ("i40e: xsk: add RX multi-buffer support")
      Fixes: 24ea5012 ("xsk: support mbuf on ZC RX")
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-3-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f7f6aa8e
    • Maciej Fijalkowski's avatar
      xsk: recycle buffer in case Rx queue was full · 26900989
      Maciej Fijalkowski authored
      Add missing xsk_buff_free() call when __xsk_rcv_zc() failed to produce
      descriptor to XSK Rx queue.
      
      Fixes: 24ea5012 ("xsk: support mbuf on ZC RX")
      Acked-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-2-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      26900989
  2. 24 Jan, 2024 7 commits