1. 21 Mar, 2024 10 commits
    • Florian Westphal's avatar
      MAINTAINERS: step down as netfilter maintainer · b5048d27
      Florian Westphal authored
      I do not feel that I'm up to the task anymore.
      
      I hope this to be a temporary emergeny measure, but for now I'm sure this
      is the best course of action for me.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Link: https://lore.kernel.org/r/20240319121223.24474-1-fw@strlen.deSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      b5048d27
    • Paolo Abeni's avatar
      Merge branch 'mt7530-dsa-subdriver-fix-vlan-egress-and-handling-of-all-link-local-frames' · 61fbfac1
      Paolo Abeni authored
       says:
      
      ====================
      MT7530 DSA subdriver fix VLAN egress and handling of all link-local frames
      
      This patch series fixes the VLAN tag egress procedure for link-local
      frames, and fixes handling of all link-local frames.
      Signed-off-by: default avatarArınç ÜNAL <arinc.unal@arinc9.com>
      ====================
      
      Link: https://lore.kernel.org/r/20240314-b4-for-net-mt7530-fix-link-local-vlan-v2-0-7dbcf6429ba0@arinc9.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      61fbfac1
    • Arınç ÜNAL's avatar
      net: dsa: mt7530: fix handling of all link-local frames · 69ddba9d
      Arınç ÜNAL authored
      Currently, the MT753X switches treat frames with :01-0D and :0F MAC DAs as
      regular multicast frames, therefore flooding them to user ports.
      
      On page 205, section "8.6.3 Frame filtering" of the active standard, IEEE
      Std 802.1Q-2022, it is stated that frames with 01:80:C2:00:00:00-0F as MAC
      DA must only be propagated to C-VLAN and MAC Bridge components. That means
      VLAN-aware and VLAN-unaware bridges. On the switch designs with CPU ports,
      these frames are supposed to be processed by the CPU (software). So we make
      the switch only forward them to the CPU port. And if received from a CPU
      port, forward to a single port. The software is responsible of making the
      switch conform to the latter by setting a single port as destination port
      on the special tag.
      
      This switch intellectual property cannot conform to this part of the
      standard fully. Whilst the REV_UN frame tag covers the remaining :04-0D and
      :0F MAC DAs, it also includes :22-FF which the scope of propagation is not
      supposed to be restricted for these MAC DAs.
      
      Set frames with :01-03 MAC DAs to be trapped to the CPU port(s). Add a
      comment for the remaining MAC DAs.
      
      Note that the ingress port must have a PVID assigned to it for the switch
      to forward untagged frames. A PVID is set by default on VLAN-aware and
      VLAN-unaware ports. However, when the network interface that pertains to
      the ingress port is attached to a vlan_filtering enabled bridge, the user
      can remove the PVID assignment from it which would prevent the link-local
      frames from being trapped to the CPU port. I am yet to see a way to forward
      link-local frames while preventing other untagged frames from being
      forwarded too.
      
      Fixes: b8f126a8 ("net-next: dsa: add dsa support for Mediatek MT7530 switch")
      Signed-off-by: default avatarArınç ÜNAL <arinc.unal@arinc9.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      69ddba9d
    • Arınç ÜNAL's avatar
      net: dsa: mt7530: fix link-local frames that ingress vlan filtering ports · e8bf3535
      Arınç ÜNAL authored
      Whether VLAN-aware or not, on every VID VLAN table entry that has the CPU
      port as a member of it, frames are set to egress the CPU port with the VLAN
      tag stacked. This is so that VLAN tags can be appended after hardware
      special tag (called DSA tag in the context of Linux drivers).
      
      For user ports on a VLAN-unaware bridge, frame ingressing the user port
      egresses CPU port with only the special tag.
      
      For user ports on a VLAN-aware bridge, frame ingressing the user port
      egresses CPU port with the special tag and the VLAN tag.
      
      This causes issues with link-local frames, specifically BPDUs, because the
      software expects to receive them VLAN-untagged.
      
      There are two options to make link-local frames egress untagged. Setting
      CONSISTENT or UNTAGGED on the EG_TAG bits on the relevant register.
      CONSISTENT means frames egress exactly as they ingress. That means
      egressing with the VLAN tag they had at ingress or egressing untagged if
      they ingressed untagged. Although link-local frames are not supposed to be
      transmitted VLAN-tagged, if they are done so, when egressing through a CPU
      port, the special tag field will be broken.
      
      BPDU egresses CPU port with VLAN tag egressing stacked, received on
      software:
      
      00:01:25.104821 AF Unknown (382365846), length 106:
                                           | STAG  | | VLAN  |
              0x0000:  0000 6c27 614d 4143 0001 0000 8100 0001  ..l'aMAC........
              0x0010:  0026 4242 0300 0000 0000 0000 6c27 614d  .&BB........l'aM
              0x0020:  4143 0000 0000 0000 6c27 614d 4143 0000  AC......l'aMAC..
              0x0030:  0000 1400 0200 0f00 0000 0000 0000 0000  ................
      
      BPDU egresses CPU port with VLAN tag egressing untagged, received on
      software:
      
      00:23:56.628708 AF Unknown (25215488), length 64:
                                           | STAG  |
              0x0000:  0000 6c27 614d 4143 0001 0000 0026 4242  ..l'aMAC.....&BB
              0x0010:  0300 0000 0000 0000 6c27 614d 4143 0000  ........l'aMAC..
              0x0020:  0000 0000 6c27 614d 4143 0000 0000 1400  ....l'aMAC......
              0x0030:  0200 0f00 0000 0000 0000 0000            ............
      
      BPDU egresses CPU port with VLAN tag egressing tagged, received on
      software:
      
      00:01:34.311963 AF Unknown (25215488), length 64:
                                           | Mess  |
              0x0000:  0000 6c27 614d 4143 0001 0001 0026 4242  ..l'aMAC.....&BB
              0x0010:  0300 0000 0000 0000 6c27 614d 4143 0000  ........l'aMAC..
              0x0020:  0000 0000 6c27 614d 4143 0000 0000 1400  ....l'aMAC......
              0x0030:  0200 0f00 0000 0000 0000 0000            ............
      
      To prevent confusing the software, force the frame to egress UNTAGGED
      instead of CONSISTENT. This way, frames can't possibly be received TAGGED
      by software which would have the special tag field broken.
      
      VLAN Tag Egress Procedure
      
         For all frames, one of these options set the earliest in this order will
         apply to the frame:
      
         - EG_TAG in certain registers for certain frames.
           This will apply to frame with matching MAC DA or EtherType.
      
         - EG_TAG in the address table.
           This will apply to frame at its incoming port.
      
         - EG_TAG in the PVC register.
           This will apply to frame at its incoming port.
      
         - EG_CON and [EG_TAG per port] in the VLAN table.
           This will apply to frame at its outgoing port.
      
         - EG_TAG in the PCR register.
           This will apply to frame at its outgoing port.
      
         EG_TAG in certain registers for certain frames:
      
         PPPoE Discovery_ARP/RARP: PPP_EG_TAG and ARP_EG_TAG in the APC register.
         IGMP_MLD: IGMP_EG_TAG and MLD_EG_TAG in the IMC register.
         BPDU and PAE: BPDU_EG_TAG and PAE_EG_TAG in the BPC register.
         REV_01 and REV_02: R01_EG_TAG and R02_EG_TAG in the RGAC1 register.
         REV_03 and REV_0E: R03_EG_TAG and R0E_EG_TAG in the RGAC2 register.
         REV_10 and REV_20: R10_EG_TAG and R20_EG_TAG in the RGAC3 register.
         REV_21 and REV_UN: R21_EG_TAG and RUN_EG_TAG in the RGAC4 register.
      
      With this change, it can be observed that a bridge interface with stp_state
      and vlan_filtering enabled will properly block ports now.
      
      Fixes: b8f126a8 ("net-next: dsa: add dsa support for Mediatek MT7530 switch")
      Signed-off-by: default avatarArınç ÜNAL <arinc.unal@arinc9.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      e8bf3535
    • Jakub Kicinski's avatar
      Merge branch 'report-rcu-qs-for-busy-network-kthreads' · 3201de46
      Jakub Kicinski authored
      Yan Zhai says:
      
      ====================
      Report RCU QS for busy network kthreads
      
      This changeset fixes a common problem for busy networking kthreads.
      These threads, e.g. NAPI threads, typically will do:
      
      * polling a batch of packets
      * if there are more work, call cond_resched() to allow scheduling
      * continue to poll more packets when rx queue is not empty
      
      We observed this being a problem in production, since it can block RCU
      tasks from making progress under heavy load. Investigation indicates
      that just calling cond_resched() is insufficient for RCU tasks to reach
      quiescent states. This also has the side effect of frequently clearing
      the TIF_NEED_RESCHED flag on voluntary preempt kernels. As a result,
      schedule() will not be called in these circumstances, despite schedule()
      in fact provides required quiescent states. This at least affects NAPI
      threads, napi_busy_loop, and also cpumap kthread.
      
      By reporting RCU QSes in these kthreads periodically before cond_resched, the
      blocked RCU waiters can correctly progress. Instead of just reporting QS for
      RCU tasks, these code share the same concern as noted in the commit
      d28139c4 ("rcu: Apply RCU-bh QSes to RCU-sched and RCU-preempt when safe").
      So report a consolidated QS for safety.
      
      It is worth noting that, although this problem is reproducible in
      napi_busy_loop, it only shows up when setting the polling interval to as high
      as 2ms, which is far larger than recommended 50us-100us in the documentation.
      So napi_busy_loop is left untouched.
      
      Lastly, this does not affect RT kernels, which does not enter the scheduler
      through cond_resched(). Without the mentioned side effect, schedule() will
      be called time by time, and clear the RCU task holdouts.
      
      V4: https://lore.kernel.org/bpf/cover.1710525524.git.yan@cloudflare.com/
      V3: https://lore.kernel.org/lkml/20240314145459.7b3aedf1@kernel.org/t/
      V2: https://lore.kernel.org/bpf/ZeFPz4D121TgvCje@debian.debian/
      V1: https://lore.kernel.org/lkml/Zd4DXTyCf17lcTfq@debian.debian/#t
      ====================
      
      Link: https://lore.kernel.org/r/cover.1710877680.git.yan@cloudflare.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3201de46
    • Yan Zhai's avatar
      bpf: report RCU QS in cpumap kthread · 00bf6312
      Yan Zhai authored
      When there are heavy load, cpumap kernel threads can be busy polling
      packets from redirect queues and block out RCU tasks from reaching
      quiescent states. It is insufficient to just call cond_resched() in such
      context. Periodically raise a consolidated RCU QS before cond_resched
      fixes the problem.
      
      Fixes: 6710e112 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP")
      Reviewed-by: default avatarJesper Dangaard Brouer <hawk@kernel.org>
      Signed-off-by: default avatarYan Zhai <yan@cloudflare.com>
      Acked-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Acked-by: default avatarJesper Dangaard Brouer <hawk@kernel.org>
      Link: https://lore.kernel.org/r/c17b9f1517e19d813da3ede5ed33ee18496bb5d8.1710877680.git.yan@cloudflare.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      00bf6312
    • Yan Zhai's avatar
      net: report RCU QS on threaded NAPI repolling · d6dbbb11
      Yan Zhai authored
      NAPI threads can keep polling packets under load. Currently it is only
      calling cond_resched() before repolling, but it is not sufficient to
      clear out the holdout of RCU tasks, which prevent BPF tracing programs
      from detaching for long period. This can be reproduced easily with
      following set up:
      
      ip netns add test1
      ip netns add test2
      
      ip -n test1 link add veth1 type veth peer name veth2 netns test2
      
      ip -n test1 link set veth1 up
      ip -n test1 link set lo up
      ip -n test2 link set veth2 up
      ip -n test2 link set lo up
      
      ip -n test1 addr add 192.168.1.2/31 dev veth1
      ip -n test1 addr add 1.1.1.1/32 dev lo
      ip -n test2 addr add 192.168.1.3/31 dev veth2
      ip -n test2 addr add 2.2.2.2/31 dev lo
      
      ip -n test1 route add default via 192.168.1.3
      ip -n test2 route add default via 192.168.1.2
      
      for i in `seq 10 210`; do
       for j in `seq 10 210`; do
          ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201
       done
      done
      
      ip netns exec test2 ethtool -K veth2 gro on
      ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded'
      ip netns exec test1 ethtool -K veth1 tso off
      
      Then run an iperf3 client/server and a bpftrace script can trigger it:
      
      ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null&
      ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null&
      bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}'
      
      Report RCU quiescent states periodically will resolve the issue.
      
      Fixes: 29863d41 ("net: implement threaded-able napi poll loop support")
      Reviewed-by: default avatarJesper Dangaard Brouer <hawk@kernel.org>
      Signed-off-by: default avatarYan Zhai <yan@cloudflare.com>
      Acked-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Acked-by: default avatarJesper Dangaard Brouer <hawk@kernel.org>
      Link: https://lore.kernel.org/r/4c3b0d3f32d3b18949d75b18e5e1d9f13a24f025.1710877680.git.yan@cloudflare.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d6dbbb11
    • Yan Zhai's avatar
      rcu: add a helper to report consolidated flavor QS · 1a77557d
      Yan Zhai authored
      When under heavy load, network processing can run CPU-bound for many
      tens of seconds. Even in preemptible kernels (non-RT kernel), this can
      block RCU Tasks grace periods, which can cause trace-event removal to
      take more than a minute, which is unacceptably long.
      
      This commit therefore creates a new helper function that passes through
      both RCU and RCU-Tasks quiescent states every 100 milliseconds. This
      hard-coded value suffices for current workloads.
      Suggested-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarJesper Dangaard Brouer <hawk@kernel.org>
      Signed-off-by: default avatarYan Zhai <yan@cloudflare.com>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Acked-by: default avatarJesper Dangaard Brouer <hawk@kernel.org>
      Link: https://lore.kernel.org/r/90431d46ee112d2b0af04dbfe936faaca11810a5.1710877680.git.yan@cloudflare.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1a77557d
    • Shannon Nelson's avatar
      ionic: update documentation for XDP support · f7bf0ec1
      Shannon Nelson authored
      Add information to our documentation for the XDP features
      and related ethtool stats.
      
      While we're here, we also add the missing timestamp stats.
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240319163534.38796-1-shannon.nelson@amd.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f7bf0ec1
    • Herve Codina's avatar
      lib/bitmap: Fix bitmap_scatter() and bitmap_gather() kernel doc · 2d9d9f25
      Herve Codina authored
      The make htmldoc command failed with the following error
        ... include/linux/bitmap.h:524: ERROR: Unexpected indentation.
        ... include/linux/bitmap.h:524: CRITICAL: Unexpected section title or transition.
      
      Move the visual representation to a literal block.
      
      Fixes: de5f8433 ("lib/bitmap: Introduce bitmap_scatter() and bitmap_gather() helpers")
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Closes: https://lore.kernel.org/linux-kernel/20240312153059.3ffde1b7@canb.auug.org.au/Signed-off-by: default avatarHerve Codina <herve.codina@bootlin.com>
      Reviewed-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Reviewed-by: default avatarBagas Sanjaya <bagasdotme@gmail.com>
      Acked-by: default avatarYury Norov <yury.norov@gmail.com>
      Link: https://lore.kernel.org/r/20240314120006.458580-1-herve.codina@bootlin.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2d9d9f25
  2. 20 Mar, 2024 10 commits
    • David S. Miller's avatar
      Merge branch 'octeontx2-pf-mbox-fixes' · 9c6a5954
      David S. Miller authored
      Subbaraya Sundeep says:
      
      ====================
      octeontx2-pf: RVU Mailbox fixes
      
      This patchset fixes the problems related to RVU mailbox.
      During long run tests some times VF commands like setting
      MTU or toggling interface fails because VF mailbox is timedout
      waiting for response from PF.
      
      Below are the fixes
      Patch 1: There are two types of messages in RVU mailbox namely up and down
      messages. Down messages are synchronous messages where a PF/VF sends
      a message to AF and AF replies back with response. UP messages are
      notifications and are asynchronous like AF sending link events to
      PF. When VF sends a down message to PF, PF forwards to AF and sends
      the response from AF back to VF. PF has to forward VF messages since
      there is no path in hardware for VF to send directly to AF.
      There is one mailbox interrupt from AF to PF when raised could mean
      two scenarios one is where AF sending reply to PF for a down message
      sent by PF and another one is AF sending up message asynchronously
      when link changed for that PF. Receiving the up message interrupt while
      PF is in middle of forwarding down message causes mailbox errors.
      Fix this by receiver detecting the type of message from the mbox data register
      set by sender.
      
      Patch 2:
      During VF driver remove, VF has to wait until last message is
      completed and then turn off mailbox interrupts from PF.
      
      Patch 3:
      Do not use ordered workqueue for message processing since multiple works are
      queued simultaneously by all the VFs and PF link UP messages.
      
      Patch 4:
      When sending link event to VF by PF check whether VF is really up to
      receive this message.
      
      Patch 5:
      In AF driver, use separate interrupt handlers for the AF-VF interrupt and
      AF-PF interrupt. Sometimes both interrupts are raised to two CPUs at same
      time and both CPUs execute same function at same time corrupting the data.
      
      v2 changes:
      	Added missing mutex unlock in error path in patch 1
      	Refactored if else logic in patch 1 as suggested by Paolo Abeni
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c6a5954
    • Subbaraya Sundeep's avatar
      octeontx2-af: Use separate handlers for interrupts · 50e60de3
      Subbaraya Sundeep authored
      For PF to AF interrupt vector and VF to AF vector same
      interrupt handler is registered which is causing race condition.
      When two interrupts are raised to two CPUs at same time
      then two cores serve same event corrupting the data.
      
      Fixes: 7304ac45 ("octeontx2-af: Add mailbox IRQ and msg handlers")
      Signed-off-by: default avatarSubbaraya Sundeep <sbhatta@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      50e60de3
    • Subbaraya Sundeep's avatar
      octeontx2-pf: Send UP messages to VF only when VF is up. · dfcf6355
      Subbaraya Sundeep authored
      When PF sending link status messages to VF, it is possible
      that by the time link_event_task work function is executed
      VF might have brought down. Hence before sending VF link
      status message check whether VF is up to receive it.
      
      Fixes: ad513ed9 ("octeontx2-vf: Link event notification support")
      Signed-off-by: default avatarSubbaraya Sundeep <sbhatta@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dfcf6355
    • Subbaraya Sundeep's avatar
      octeontx2-pf: Use default max_active works instead of one · 7558ce0d
      Subbaraya Sundeep authored
      Only one execution context for the workqueue used for PF and
      VFs mailbox communication is incorrect since multiple works are
      queued simultaneously by all the VFs and PF link UP messages.
      Hence use default number of execution contexts by passing zero
      as max_active to alloc_workqueue function. With this fix in place,
      modify UP messages also to wait until completion.
      
      Fixes: d424b6c0 ("octeontx2-pf: Enable SRIOV and added VF mbox handling")
      Signed-off-by: default avatarSubbaraya Sundeep <sbhatta@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7558ce0d
    • Subbaraya Sundeep's avatar
      octeontx2-pf: Wait till detach_resources msg is complete · cbf2f249
      Subbaraya Sundeep authored
      During VF driver remove, a message is sent to detach VF
      resources to PF but VF is not waiting until message is
      complete. Also mailbox interrupts need to be turned off
      after the detach resource message is complete. This patch
      fixes that problem.
      
      Fixes: 05fcc9e0 ("octeontx2-pf: Attach NIX and NPA block LFs")
      Signed-off-by: default avatarSubbaraya Sundeep <sbhatta@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cbf2f249
    • Subbaraya Sundeep's avatar
      octeontx2: Detect the mbox up or down message via register · a88e0f93
      Subbaraya Sundeep authored
      A single line of interrupt is used to receive up notifications
      and down reply messages from AF to PF (similarly from PF to its VF).
      PF acts as bridge and forwards VF messages to AF and sends respsones
      back from AF to VF. When an async event like link event is received
      by up message when PF is in middle of forwarding VF message then
      mailbox errors occur because PF state machine is corrupted.
      Since VF is a separate driver or VF driver can be in a VM it is
      not possible to serialize from the start of communication at VF.
      Hence to differentiate between type of messages at PF this patch makes
      sender to set mbox data register with distinct values for up and down
      messages. Sender also checks whether previous interrupt is received
      before triggering current interrupt by waiting for mailbox data register
      to become zero.
      
      Fixes: 5a6d7c9d ("octeontx2-pf: Mailbox communication with AF")
      Signed-off-by: default avatarSubbaraya Sundeep <sbhatta@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a88e0f93
    • Jakub Kicinski's avatar
      Merge tag 'ipsec-2024-03-19' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec · 94e3ca2f
      Jakub Kicinski authored
      Steffen Klassert says:
      
      ====================
      pull request (net): ipsec 2024-03-19
      
      1) Fix possible page_pool leak triggered by esp_output.
         From Dragos Tatulea.
      
      2) Fix UDP encapsulation in software GSO path.
         From Leon Romanovsky.
      
      * tag 'ipsec-2024-03-19' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
        xfrm: Allow UDP encapsulation only in offload modes
        net: esp: fix bad handling of pages from page_pool
      ====================
      
      Link: https://lore.kernel.org/r/20240319110151.409825-1-steffen.klassert@secunet.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      94e3ca2f
    • Jiri Pirko's avatar
      devlink: fix port new reply cmd type · 78a2f5e6
      Jiri Pirko authored
      Due to a c&p error, port new reply fills-up cmd with wrong value,
      any other existing port command replies and notifications.
      
      Fix it by filling cmd with value DEVLINK_CMD_PORT_NEW.
      
      Skimmed through devlink userspace implementations, none of them cares
      about this cmd value.
      Reported-by: default avatarChenyuan Yang <chenyuan0y@gmail.com>
      Closes: https://lore.kernel.org/all/ZfZcDxGV3tSy4qsV@cy-server/
      Fixes: cd76dcd6 ("devlink: Support add and delete devlink port")
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarParav Pandit <parav@nvidia.com>
      Reviewed-by: default avatarKalesh AP <kalesh-anakkur.purayil@broadcom.com>
      Link: https://lore.kernel.org/r/20240318091908.2736542-1-jiri@resnulli.usSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      78a2f5e6
    • Kuniyuki Iwashima's avatar
      tcp: Clear req->syncookie in reqsk_alloc(). · 956c0d61
      Kuniyuki Iwashima authored
      syzkaller reported a read of uninit req->syncookie. [0]
      
      Originally, req->syncookie was used only in tcp_conn_request()
      to indicate if we need to encode SYN cookie in SYN+ACK, so the
      field remains uninitialised in other places.
      
      The commit 695751e3 ("bpf: tcp: Handle BPF SYN Cookie in
      cookie_v[46]_check().") added another meaning in ACK path;
      req->syncookie is set true if SYN cookie is validated by BPF
      kfunc.
      
      After the change, cookie_v[46]_check() always read req->syncookie,
      but it is not initialised in the normal SYN cookie case as reported
      by KMSAN.
      
      Let's make sure we always initialise req->syncookie in reqsk_alloc().
      
      [0]:
      BUG: KMSAN: uninit-value in cookie_v4_check+0x22b7/0x29e0
       net/ipv4/syncookies.c:477
       cookie_v4_check+0x22b7/0x29e0 net/ipv4/syncookies.c:477
       tcp_v4_cookie_check net/ipv4/tcp_ipv4.c:1855 [inline]
       tcp_v4_do_rcv+0xb17/0x10b0 net/ipv4/tcp_ipv4.c:1914
       tcp_v4_rcv+0x4ce4/0x5420 net/ipv4/tcp_ipv4.c:2322
       ip_protocol_deliver_rcu+0x2a3/0x13d0 net/ipv4/ip_input.c:205
       ip_local_deliver_finish+0x332/0x500 net/ipv4/ip_input.c:233
       NF_HOOK include/linux/netfilter.h:314 [inline]
       ip_local_deliver+0x21f/0x490 net/ipv4/ip_input.c:254
       dst_input include/net/dst.h:460 [inline]
       ip_rcv_finish+0x4a2/0x520 net/ipv4/ip_input.c:449
       NF_HOOK include/linux/netfilter.h:314 [inline]
       ip_rcv+0xcd/0x380 net/ipv4/ip_input.c:569
       __netif_receive_skb_one_core net/core/dev.c:5538 [inline]
       __netif_receive_skb+0x319/0x9e0 net/core/dev.c:5652
       process_backlog+0x480/0x8b0 net/core/dev.c:5981
       __napi_poll+0xe7/0x980 net/core/dev.c:6632
       napi_poll net/core/dev.c:6701 [inline]
       net_rx_action+0x89d/0x1820 net/core/dev.c:6813
       __do_softirq+0x1c0/0x7d7 kernel/softirq.c:554
       do_softirq+0x9a/0x100 kernel/softirq.c:455
       __local_bh_enable_ip+0x9f/0xb0 kernel/softirq.c:382
       local_bh_enable include/linux/bottom_half.h:33 [inline]
       rcu_read_unlock_bh include/linux/rcupdate.h:820 [inline]
       __dev_queue_xmit+0x2776/0x52c0 net/core/dev.c:4362
       dev_queue_xmit include/linux/netdevice.h:3091 [inline]
       neigh_hh_output include/net/neighbour.h:526 [inline]
       neigh_output include/net/neighbour.h:540 [inline]
       ip_finish_output2+0x187a/0x1b70 net/ipv4/ip_output.c:235
       __ip_finish_output+0x287/0x810
       ip_finish_output+0x4b/0x550 net/ipv4/ip_output.c:323
       NF_HOOK_COND include/linux/netfilter.h:303 [inline]
       ip_output+0x15f/0x3f0 net/ipv4/ip_output.c:433
       dst_output include/net/dst.h:450 [inline]
       ip_local_out net/ipv4/ip_output.c:129 [inline]
       __ip_queue_xmit+0x1e93/0x2030 net/ipv4/ip_output.c:535
       ip_queue_xmit+0x60/0x80 net/ipv4/ip_output.c:549
       __tcp_transmit_skb+0x3c70/0x4890 net/ipv4/tcp_output.c:1462
       tcp_transmit_skb net/ipv4/tcp_output.c:1480 [inline]
       tcp_write_xmit+0x3ee1/0x8900 net/ipv4/tcp_output.c:2792
       __tcp_push_pending_frames net/ipv4/tcp_output.c:2977 [inline]
       tcp_send_fin+0xa90/0x12e0 net/ipv4/tcp_output.c:3578
       tcp_shutdown+0x198/0x1f0 net/ipv4/tcp.c:2716
       inet_shutdown+0x33f/0x5b0 net/ipv4/af_inet.c:923
       __sys_shutdown_sock net/socket.c:2425 [inline]
       __sys_shutdown net/socket.c:2437 [inline]
       __do_sys_shutdown net/socket.c:2445 [inline]
       __se_sys_shutdown+0x2a4/0x440 net/socket.c:2443
       __x64_sys_shutdown+0x6c/0xa0 net/socket.c:2443
       do_syscall_64+0xd5/0x1f0
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      
      Uninit was stored to memory at:
       reqsk_alloc include/net/request_sock.h:148 [inline]
       inet_reqsk_alloc+0x651/0x7a0 net/ipv4/tcp_input.c:6978
       cookie_tcp_reqsk_alloc+0xd4/0x900 net/ipv4/syncookies.c:328
       cookie_tcp_check net/ipv4/syncookies.c:388 [inline]
       cookie_v4_check+0x289f/0x29e0 net/ipv4/syncookies.c:420
       tcp_v4_cookie_check net/ipv4/tcp_ipv4.c:1855 [inline]
       tcp_v4_do_rcv+0xb17/0x10b0 net/ipv4/tcp_ipv4.c:1914
       tcp_v4_rcv+0x4ce4/0x5420 net/ipv4/tcp_ipv4.c:2322
       ip_protocol_deliver_rcu+0x2a3/0x13d0 net/ipv4/ip_input.c:205
       ip_local_deliver_finish+0x332/0x500 net/ipv4/ip_input.c:233
       NF_HOOK include/linux/netfilter.h:314 [inline]
       ip_local_deliver+0x21f/0x490 net/ipv4/ip_input.c:254
       dst_input include/net/dst.h:460 [inline]
       ip_rcv_finish+0x4a2/0x520 net/ipv4/ip_input.c:449
       NF_HOOK include/linux/netfilter.h:314 [inline]
       ip_rcv+0xcd/0x380 net/ipv4/ip_input.c:569
       __netif_receive_skb_one_core net/core/dev.c:5538 [inline]
       __netif_receive_skb+0x319/0x9e0 net/core/dev.c:5652
       process_backlog+0x480/0x8b0 net/core/dev.c:5981
       __napi_poll+0xe7/0x980 net/core/dev.c:6632
       napi_poll net/core/dev.c:6701 [inline]
       net_rx_action+0x89d/0x1820 net/core/dev.c:6813
       __do_softirq+0x1c0/0x7d7 kernel/softirq.c:554
      
      Uninit was created at:
       __alloc_pages+0x9a7/0xe00 mm/page_alloc.c:4592
       __alloc_pages_node include/linux/gfp.h:238 [inline]
       alloc_pages_node include/linux/gfp.h:261 [inline]
       alloc_slab_page mm/slub.c:2175 [inline]
       allocate_slab mm/slub.c:2338 [inline]
       new_slab+0x2de/0x1400 mm/slub.c:2391
       ___slab_alloc+0x1184/0x33d0 mm/slub.c:3525
       __slab_alloc mm/slub.c:3610 [inline]
       __slab_alloc_node mm/slub.c:3663 [inline]
       slab_alloc_node mm/slub.c:3835 [inline]
       kmem_cache_alloc+0x6d3/0xbe0 mm/slub.c:3852
       reqsk_alloc include/net/request_sock.h:131 [inline]
       inet_reqsk_alloc+0x66/0x7a0 net/ipv4/tcp_input.c:6978
       tcp_conn_request+0x484/0x44e0 net/ipv4/tcp_input.c:7135
       tcp_v4_conn_request+0x16f/0x1d0 net/ipv4/tcp_ipv4.c:1716
       tcp_rcv_state_process+0x2e5/0x4bb0 net/ipv4/tcp_input.c:6655
       tcp_v4_do_rcv+0xbfd/0x10b0 net/ipv4/tcp_ipv4.c:1929
       tcp_v4_rcv+0x4ce4/0x5420 net/ipv4/tcp_ipv4.c:2322
       ip_protocol_deliver_rcu+0x2a3/0x13d0 net/ipv4/ip_input.c:205
       ip_local_deliver_finish+0x332/0x500 net/ipv4/ip_input.c:233
       NF_HOOK include/linux/netfilter.h:314 [inline]
       ip_local_deliver+0x21f/0x490 net/ipv4/ip_input.c:254
       dst_input include/net/dst.h:460 [inline]
       ip_sublist_rcv_finish net/ipv4/ip_input.c:580 [inline]
       ip_list_rcv_finish net/ipv4/ip_input.c:631 [inline]
       ip_sublist_rcv+0x15f3/0x17f0 net/ipv4/ip_input.c:639
       ip_list_rcv+0x9ef/0xa40 net/ipv4/ip_input.c:674
       __netif_receive_skb_list_ptype net/core/dev.c:5581 [inline]
       __netif_receive_skb_list_core+0x15c5/0x1670 net/core/dev.c:5629
       __netif_receive_skb_list net/core/dev.c:5681 [inline]
       netif_receive_skb_list_internal+0x106c/0x16f0 net/core/dev.c:5773
       gro_normal_list include/net/gro.h:438 [inline]
       napi_complete_done+0x425/0x880 net/core/dev.c:6113
       virtqueue_napi_complete drivers/net/virtio_net.c:465 [inline]
       virtnet_poll+0x149d/0x2240 drivers/net/virtio_net.c:2211
       __napi_poll+0xe7/0x980 net/core/dev.c:6632
       napi_poll net/core/dev.c:6701 [inline]
       net_rx_action+0x89d/0x1820 net/core/dev.c:6813
       __do_softirq+0x1c0/0x7d7 kernel/softirq.c:554
      
      CPU: 0 PID: 16792 Comm: syz-executor.2 Not tainted 6.8.0-syzkaller-05562-g61387b8d #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024
      
      Fixes: 695751e3 ("bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check().")
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Closes: https://lore.kernel.org/bpf/CANn89iKdN9c+C_2JAUbc+VY3DDQjAQukMtiBbormAmAk9CdvQA@mail.gmail.com/Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20240315224710.55209-1-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      956c0d61
    • Thinh Tran's avatar
      net/bnx2x: Prevent access to a freed page in page_pool · d27e2da9
      Thinh Tran authored
      Fix race condition leading to system crash during EEH error handling
      
      During EEH error recovery, the bnx2x driver's transmit timeout logic
      could cause a race condition when handling reset tasks. The
      bnx2x_tx_timeout() schedules reset tasks via bnx2x_sp_rtnl_task(),
      which ultimately leads to bnx2x_nic_unload(). In bnx2x_nic_unload()
      SGEs are freed using bnx2x_free_rx_sge_range(). However, this could
      overlap with the EEH driver's attempt to reset the device using
      bnx2x_io_slot_reset(), which also tries to free SGEs. This race
      condition can result in system crashes due to accessing freed memory
      locations in bnx2x_free_rx_sge()
      
      799  static inline void bnx2x_free_rx_sge(struct bnx2x *bp,
      800				struct bnx2x_fastpath *fp, u16 index)
      801  {
      802	struct sw_rx_page *sw_buf = &fp->rx_page_ring[index];
      803     struct page *page = sw_buf->page;
      ....
      where sw_buf was set to NULL after the call to dma_unmap_page()
      by the preceding thread.
      
          EEH: Beginning: 'slot_reset'
          PCI 0011:01:00.0#10000: EEH: Invoking bnx2x->slot_reset()
          bnx2x: [bnx2x_io_slot_reset:14228(eth1)]IO slot reset initializing...
          bnx2x 0011:01:00.0: enabling device (0140 -> 0142)
          bnx2x: [bnx2x_io_slot_reset:14244(eth1)]IO slot reset --> driver unload
          Kernel attempted to read user page (0) - exploit attempt? (uid: 0)
          BUG: Kernel NULL pointer dereference on read at 0x00000000
          Faulting instruction address: 0xc0080000025065fc
          Oops: Kernel access of bad area, sig: 11 [#1]
          .....
          Call Trace:
          [c000000003c67a20] [c00800000250658c] bnx2x_io_slot_reset+0x204/0x610 [bnx2x] (unreliable)
          [c000000003c67af0] [c0000000000518a8] eeh_report_reset+0xb8/0xf0
          [c000000003c67b60] [c000000000052130] eeh_pe_report+0x180/0x550
          [c000000003c67c70] [c00000000005318c] eeh_handle_normal_event+0x84c/0xa60
          [c000000003c67d50] [c000000000053a84] eeh_event_handler+0xf4/0x170
          [c000000003c67da0] [c000000000194c58] kthread+0x1c8/0x1d0
          [c000000003c67e10] [c00000000000cf64] ret_from_kernel_thread+0x5c/0x64
      
      To solve this issue, we need to verify page pool allocations before
      freeing.
      
      Fixes: 4cace675 ("bnx2x: Alloc 4k fragment for each rx ring buffer element")
      Signed-off-by: default avatarThinh Tran <thinhtr@linux.ibm.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Link: https://lore.kernel.org/r/20240315205535.1321-1-thinhtr@linux.ibm.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d27e2da9
  3. 19 Mar, 2024 14 commits
  4. 18 Mar, 2024 6 commits
    • Abhishek Chauhan's avatar
      Revert "net: Re-use and set mono_delivery_time bit for userspace tstamp packets" · 35c3e279
      Abhishek Chauhan authored
      This reverts commit 885c36e5.
      
      The patch currently broke the bpf selftest test_tc_dtime because
      uapi field __sk_buff->tstamp_type depends on skb->mono_delivery_time which
      does not necessarily mean mono with the original fix as the bit was re-used
      for userspace timestamp as well to avoid tstamp reset in the forwarding
      path. To solve this we need to keep mono_delivery_time as is and
      introduce another bit called user_delivery_time and fall back to the
      initial proposal of setting the user_delivery_time bit based on
      sk_clockid set from userspace.
      
      Fixes: 885c36e5 ("net: Re-use and set mono_delivery_time bit for userspace tstamp packets")
      Link: https://lore.kernel.org/netdev/bc037db4-58bb-4861-ac31-a361a93841d3@linux.dev/Signed-off-by: default avatarAbhishek Chauhan <quic_abchauha@quicinc.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35c3e279
    • Arınç ÜNAL's avatar
      net: dsa: mt7530: prevent possible incorrect XTAL frequency selection · f490c492
      Arınç ÜNAL authored
      On MT7530, the HT_XTAL_FSEL field of the HWTRAP register stores a 2-bit
      value that represents the frequency of the crystal oscillator connected to
      the switch IC. The field is populated by the state of the ESW_P4_LED_0 and
      ESW_P4_LED_0 pins, which is done right after reset is deasserted.
      
        ESW_P4_LED_0    ESW_P3_LED_0    Frequency
        -----------------------------------------
        0               0               Reserved
        0               1               20MHz
        1               0               40MHz
        1               1               25MHz
      
      On MT7531, the XTAL25 bit of the STRAP register stores this. The LAN0LED0
      pin is used to populate the bit. 25MHz when the pin is high, 40MHz when
      it's low.
      
      These pins are also used with LEDs, therefore, their state can be set to
      something other than the bootstrapping configuration. For example, a link
      may be established on port 3 before the DSA subdriver takes control of the
      switch which would set ESW_P3_LED_0 to high.
      
      Currently on mt7530_setup() and mt7531_setup(), 1000 - 1100 usec delay is
      described between reset assertion and deassertion. Some switch ICs in real
      life conditions cannot always have these pins set back to the bootstrapping
      configuration before reset deassertion in this amount of delay. This causes
      wrong crystal frequency to be selected which puts the switch in a
      nonfunctional state after reset deassertion.
      
      The tests below are conducted on an MT7530 with a 40MHz crystal oscillator
      by Justin Swartz.
      
      With a cable from an active peer connected to port 3 before reset, an
      incorrect crystal frequency (0b11 = 25MHz) is selected:
      
                            [1]                  [3]     [5]
                            :                    :       :
                    _____________________________         __________________
      ESW_P4_LED_0                               |_______|
                    _____________________________
      ESW_P3_LED_0                               |__________________________
      
                             :                  : :     :
                             :                  : [4]...:
                             :                  :
                             [2]................:
      
      [1] Reset is asserted.
      [2] Period of 1000 - 1100 usec.
      [3] Reset is deasserted.
      [4] Period of 315 usec. HWTRAP register is populated with incorrect
          XTAL frequency.
      [5] Signals reflect the bootstrapped configuration.
      
      Increase the delay between reset_control_assert() and
      reset_control_deassert(), and gpiod_set_value_cansleep(priv->reset, 0) and
      gpiod_set_value_cansleep(priv->reset, 1) to 5000 - 5100 usec. This amount
      ensures a higher possibility that the switch IC will have these pins back
      to the bootstrapping configuration before reset deassertion.
      
      With a cable from an active peer connected to port 3 before reset, the
      correct crystal frequency (0b10 = 40MHz) is selected:
      
                            [1]        [2-1]     [3]     [5]
                            :          :         :       :
                    _____________________________         __________________
      ESW_P4_LED_0                               |_______|
                    ___________________           _______
      ESW_P3_LED_0                     |_________|       |__________________
      
                             :          :       : :     :
                             :          [2-2]...: [4]...:
                             [2]................:
      
      [1] Reset is asserted.
      [2] Period of 5000 - 5100 usec.
      [2-1] ESW_P3_LED_0 goes low.
      [2-2] Remaining period of 5000 - 5100 usec.
      [3] Reset is deasserted.
      [4] Period of 310 usec. HWTRAP register is populated with bootstrapped
          XTAL frequency.
      [5] Signals reflect the bootstrapped configuration.
      
      ESW_P3_LED_0 low period before reset deassertion:
      
                    5000 usec
                  - 5100 usec
          TEST     RESET HOLD
             #         (usec)
        ---------------------
             1           5410
             2           5440
             3           4375
             4           5490
             5           5475
             6           4335
             7           4370
             8           5435
             9           4205
            10           4335
            11           3750
            12           3170
            13           4395
            14           4375
            15           3515
            16           4335
            17           4220
            18           4175
            19           4175
            20           4350
      
           Min           3170
           Max           5490
      
        Median       4342.500
           Avg       4466.500
      
      Revert commit 2920dd92 ("net: dsa: mt7530: disable LEDs before reset").
      Changing the state of pins via reset assertion is simpler and more
      efficient than doing so by setting the LED controller off.
      
      Fixes: b8f126a8 ("net-next: dsa: add dsa support for Mediatek MT7530 switch")
      Fixes: c288575f ("net: dsa: mt7530: Add the support of MT7531 switch")
      Co-developed-by: default avatarJustin Swartz <justin.swartz@risingedge.co.za>
      Signed-off-by: default avatarJustin Swartz <justin.swartz@risingedge.co.za>
      Signed-off-by: default avatarArınç ÜNAL <arinc.unal@arinc9.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f490c492
    • David S. Miller's avatar
      Merge branch 'veth-xdp-gro' · ba77f6e2
      David S. Miller authored
      Ignat Korchagin says:
      
      ====================
      net: veth: ability to toggle GRO and XDP independently
      
      It is rather confusing that GRO is automatically enabled, when an XDP program
      is attached to a veth interface. Moreover, it is not possible to disable GRO
      on a veth, if an XDP program is attached (which might be desirable in some use
      cases).
      
      Make GRO and XDP independent for a veth interface.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba77f6e2
    • Ignat Korchagin's avatar
      selftests: net: veth: test the ability to independently manipulate GRO and XDP · ba5a6476
      Ignat Korchagin authored
      We should be able to independently flip either XDP or GRO states and toggling
      one should not affect the other.
      
      Adjust other tests as well that had implicit expectation that GRO would be
      automatically enabled.
      Signed-off-by: default avatarIgnat Korchagin <ignat@cloudflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba5a6476
    • Ignat Korchagin's avatar
      net: veth: do not manipulate GRO when using XDP · d7db7775
      Ignat Korchagin authored
      Commit d3256efd ("veth: allow enabling NAPI even without XDP") tried to fix
      the fact that GRO was not possible without XDP, because veth did not use NAPI
      without XDP. However, it also introduced the behaviour that GRO is always
      enabled, when XDP is enabled.
      
      While it might be desired for most cases, it is confusing for the user at best
      as the GRO flag suddenly changes, when an XDP program is attached. It also
      introduces some complexities in state management as was partially addressed in
      commit fe9f8013 ("net: veth: clear GRO when clearing XDP even when down").
      
      But the biggest problem is that it is not possible to disable GRO at all, when
      an XDP program is attached, which might be needed for some use cases.
      
      Fix this by not touching the GRO flag on XDP enable/disable as the code already
      supports switching to NAPI if either GRO or XDP is requested.
      
      Link: https://lore.kernel.org/lkml/20240311124015.38106-1-ignat@cloudflare.com/
      Fixes: d3256efd ("veth: allow enabling NAPI even without XDP")
      Fixes: fe9f8013 ("net: veth: clear GRO when clearing XDP even when down")
      Signed-off-by: default avatarIgnat Korchagin <ignat@cloudflare.com>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7db7775
    • Leon Romanovsky's avatar
      xfrm: Allow UDP encapsulation only in offload modes · 773bb766
      Leon Romanovsky authored
      The missing check of x->encap caused to the situation where GSO packets
      were created with UDP encapsulation.
      
      As a solution return the encap check for non-offloaded SA.
      
      Fixes: 983a73da ("xfrm: Pass UDP encapsulation in TX packet offload")
      Closes: https://lore.kernel.org/all/a650221ae500f0c7cf496c61c96c1b103dcb6f67.camel@redhat.comReported-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      773bb766