1. 12 Jul, 2023 8 commits
  2. 11 Jul, 2023 7 commits
    • Suman Ghosh's avatar
      octeontx2-pf: Add additional check for MCAM rules · 8278ee2a
      Suman Ghosh authored
      Due to hardware limitation, MCAM drop rule with
      ether_type == 802.1Q and vlan_id == 0 is not supported. Hence rejecting
      such rules.
      
      Fixes: dce677da ("octeontx2-pf: Add vlan-etype to ntuple filters")
      Signed-off-by: default avatarSuman Ghosh <sumang@marvell.com>
      Link: https://lore.kernel.org/r/20230710103027.2244139-1-sumang@marvell.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      8278ee2a
    • Lu Hongfei's avatar
      net: dsa: Removed unneeded of_node_put in felix_parse_ports_node · 04499f28
      Lu Hongfei authored
      Remove unnecessary of_node_put from the continue path to prevent
      child node from being released twice, which could avoid resource
      leak or other unexpected issues.
      Signed-off-by: default avatarLu Hongfei <luhongfei@vivo.com>
      Reviewed-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Fixes: de879a01 ("net: dsa: felix: add functionality when not all ports are supported")
      Link: https://lore.kernel.org/r/20230710031859.36784-1-luhongfei@vivo.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      04499f28
    • Paolo Abeni's avatar
      Merge branch 'net-fec-fix-some-issues-of-ndo_xdp_xmit' · c0dbbdf5
      Paolo Abeni authored
      Wei Fang says:
      
      ====================
      net: fec: fix some issues of ndo_xdp_xmit()
      
      We encountered some issues when testing the ndo_xdp_xmit() interface
      of the fec driver on i.MX8MP and i.MX93 platforms. These issues are
      easy to reproduce, and the specific reproduction steps are as follows.
      
      step1: The ethernet port of a board (board A) is connected to the EQOS
      port of i.MX8MP/i.MX93, and the FEC port of i.MX8MP/i.MX93 is connected
      to another ethernet port, such as a switch port.
      
      step2: Board A uses the pktgen_sample03_burst_single_flow.sh to generate
      and send packets to i.MX8MP/i.MX93. The command is shown below.
      ./pktgen_sample03_burst_single_flow.sh -i eth0 -d 192.168.6.8 -m \
      56:bf:0d:68:b0:9e -s 1500
      
      step3: i.MX8MP/i.MX93 use the xdp_redirect bfp program to redirect the
      XDP frames from EQOS port to FEC port. The command is shown below.
      ./xdp_redirect eth1 eth0
      
      After a few moments, the warning or error logs will be printed in the
      console, for more details, please refer to the commit message of each
      patch.
      ====================
      
      Link: https://lore.kernel.org/r/20230706081012.2278063-1-wei.fang@nxp.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      c0dbbdf5
    • Wei Fang's avatar
      net: fec: use netdev_err_once() instead of netdev_err() · 84a10947
      Wei Fang authored
      In the case of heavy XDP traffic to be transmitted, the console
      will print the error log continuously if there are lack of enough
      BDs to accommodate the frames. The log looks like below.
      
      [  160.013112] fec 30be0000.ethernet eth0: NOT enough BD for SG!
      [  160.023116] fec 30be0000.ethernet eth0: NOT enough BD for SG!
      [  160.028926] fec 30be0000.ethernet eth0: NOT enough BD for SG!
      [  160.038946] fec 30be0000.ethernet eth0: NOT enough BD for SG!
      [  160.044758] fec 30be0000.ethernet eth0: NOT enough BD for SG!
      
      Not only will this log be replicated and redundant, it will also
      degrade XDP performance. So we use netdev_err_once() instead of
      netdev_err() now.
      
      Fixes: 6d6b39f1 ("net: fec: add initial XDP support")
      Signed-off-by: default avatarWei Fang <wei.fang@nxp.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      84a10947
    • Wei Fang's avatar
      net: fec: increase the size of tx ring and update tx_wake_threshold · 56b3c6ba
      Wei Fang authored
      When the XDP feature is enabled and with heavy XDP frames to be
      transmitted, there is a considerable probability that available
      tx BDs are insufficient. This will lead to some XDP frames to be
      discarded and the "NOT enough BD for SG!" error log will appear
      in the console (as shown below).
      
      [  160.013112] fec 30be0000.ethernet eth0: NOT enough BD for SG!
      [  160.023116] fec 30be0000.ethernet eth0: NOT enough BD for SG!
      [  160.028926] fec 30be0000.ethernet eth0: NOT enough BD for SG!
      [  160.038946] fec 30be0000.ethernet eth0: NOT enough BD for SG!
      [  160.044758] fec 30be0000.ethernet eth0: NOT enough BD for SG!
      
      In the case of heavy XDP traffic, sometimes the speed of recycling
      tx BDs may be slower than the speed of sending XDP frames. There
      may be several specific reasons, such as the interrupt is not
      responsed in time, the efficiency of the NAPI callback function is
      too low due to all the queues (tx queues and rx queues) share the
      same NAPI, and so on.
      
      After trying various methods, I think that increase the size of tx
      BD ring is simple and effective. Maybe the best resolution is that
      allocate NAPI for each queue to improve the efficiency of the NAPI
      callback, but this change is a bit big and I didn't try this method.
      Perheps this method will be implemented in a future patch.
      
      This patch also updates the tx_wake_threshold of tx ring which is
      related to the size of tx ring in the previous logic. Otherwise,
      the tx_wake_threshold will be too high (403 BDs), which is more
      likely to impact the slow path in the case of heavy XDP traffic,
      because XDP path and slow path share the tx BD rings. According
      to Jakub's suggestion, the tx_wake_threshold is at least equal to
      tx_stop_threshold + 2 * MAX_SKB_FRAGS, if a queue of hundreds of
      entries is overflowing, we should be able to apply a hysteresis
      of a few tens of entries.
      
      Fixes: 6d6b39f1 ("net: fec: add initial XDP support")
      Signed-off-by: default avatarWei Fang <wei.fang@nxp.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      56b3c6ba
    • Wei Fang's avatar
      net: fec: recycle pages for transmitted XDP frames · 20f79739
      Wei Fang authored
      Once the XDP frames have been successfully transmitted through the
      ndo_xdp_xmit() interface, it's the driver responsibility to free
      the frames so that the page_pool can recycle the pages and reuse
      them. However, this action is not implemented in the fec driver.
      This leads to a user-visible problem that the console will print
      the following warning log.
      
      [  157.568851] page_pool_release_retry() stalled pool shutdown 1389 inflight 60 sec
      [  217.983446] page_pool_release_retry() stalled pool shutdown 1389 inflight 120 sec
      [  278.399006] page_pool_release_retry() stalled pool shutdown 1389 inflight 181 sec
      [  338.812885] page_pool_release_retry() stalled pool shutdown 1389 inflight 241 sec
      [  399.226946] page_pool_release_retry() stalled pool shutdown 1389 inflight 302 sec
      
      Therefore, to solve this issue, we free XDP frames via xdp_return_frame()
      while cleaning the tx BD ring.
      
      Fixes: 6d6b39f1 ("net: fec: add initial XDP support")
      Signed-off-by: default avatarWei Fang <wei.fang@nxp.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      20f79739
    • Wei Fang's avatar
      net: fec: dynamically set the NETDEV_XDP_ACT_NDO_XMIT feature of XDP · be7ecbe7
      Wei Fang authored
      When a XDP program is installed or uninstalled, fec_restart() will
      be invoked to reset MAC and buffer descriptor rings. It's reasonable
      not to transmit any packet during the process of reset. However, the
      NETDEV_XDP_ACT_NDO_XMIT bit of xdp_features is enabled by default,
      that is to say, it's possible that the fec_enet_xdp_xmit() will be
      invoked even if the process of reset is not finished. In this case,
      the redirected XDP frames might be dropped and available transmit BDs
      may be incorrectly deemed insufficient. So this patch disable the
      NETDEV_XDP_ACT_NDO_XMIT feature by default and dynamically configure
      this feature when the bpf program is installed or uninstalled.
      
      Fixes: e4ac7cc6 ("net: fec: turn on XDP features")
      Signed-off-by: default avatarWei Fang <wei.fang@nxp.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      be7ecbe7
  3. 10 Jul, 2023 9 commits
    • Florian Kauer's avatar
      igc: Fix inserting of empty frame for launchtime · 0bcc6285
      Florian Kauer authored
      The insertion of an empty frame was introduced with
      commit db0b124f ("igc: Enhance Qbv scheduling by using first flag bit")
      in order to ensure that the current cycle has at least one packet if
      there is some packet to be scheduled for the next cycle.
      
      However, the current implementation does not properly check if
      a packet is already scheduled for the current cycle. Currently,
      an empty packet is always inserted if and only if
      txtime >= end_of_cycle && txtime > last_tx_cycle
      but since last_tx_cycle is always either the end of the current
      cycle (end_of_cycle) or the end of a previous cycle, the
      second part (txtime > last_tx_cycle) is always true unless
      txtime == last_tx_cycle.
      
      What actually needs to be checked here is if the last_tx_cycle
      was already written within the current cycle, so an empty frame
      should only be inserted if and only if
      txtime >= end_of_cycle && end_of_cycle > last_tx_cycle.
      
      This patch does not only avoid an unnecessary insertion, but it
      can actually be harmful to insert an empty packet if packets
      are already scheduled in the current cycle, because it can lead
      to a situation where the empty packet is actually processed
      as the first packet in the upcoming cycle shifting the packet
      with the first_flag even one cycle into the future, finally leading
      to a TX hang.
      
      The TX hang can be reproduced on a i225 with:
      
          sudo tc qdisc replace dev enp1s0 parent root handle 100 taprio \
      	    num_tc 1 \
      	    map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \
      	    queues 1@0 \
      	    base-time 0 \
      	    sched-entry S 01 300000 \
      	    flags 0x1 \
      	    txtime-delay 500000 \
      	    clockid CLOCK_TAI
          sudo tc qdisc replace dev enp1s0 parent 100:1 etf \
      	    clockid CLOCK_TAI \
      	    delta 500000 \
      	    offload \
      	    skip_sock_check
      
      and traffic generator
      
          sudo trafgen -i traffic.cfg -o enp1s0 --cpp -n0 -q -t1400ns
      
      with traffic.cfg
      
          #define ETH_P_IP        0x0800
      
          {
            /* Ethernet Header */
            0x30, 0x1f, 0x9a, 0xd0, 0xf0, 0x0e,  # MAC Dest - adapt as needed
            0x24, 0x5e, 0xbe, 0x57, 0x2e, 0x36,  # MAC Src  - adapt as needed
            const16(ETH_P_IP),
      
            /* IPv4 Header */
            0b01000101, 0,   # IPv4 version, IHL, TOS
            const16(1028),   # IPv4 total length (UDP length + 20 bytes (IP header))
            const16(2),      # IPv4 ident
            0b01000000, 0,   # IPv4 flags, fragmentation off
            64,              # IPv4 TTL
            17,              # Protocol UDP
            csumip(14, 33),  # IPv4 checksum
      
            /* UDP Header */
            10,  0, 48, 1,   # IP Src - adapt as needed
            10,  0, 48, 10,  # IP Dest - adapt as needed
            const16(5555),   # UDP Src Port
            const16(6666),   # UDP Dest Port
            const16(1008),   # UDP length (UDP header 8 bytes + payload length)
            csumudp(14, 34), # UDP checksum
      
            /* Payload */
            fill('W', 1000),
          }
      
      and the observed message with that is for example
      
       igc 0000:01:00.0 enp1s0: Detected Tx Unit Hang
         Tx Queue             <0>
         TDH                  <32>
         TDT                  <3c>
         next_to_use          <3c>
         next_to_clean        <32>
       buffer_info[next_to_clean]
         time_stamp           <ffff26a8>
         next_to_watch        <00000000632a1828>
         jiffies              <ffff27f8>
         desc.status          <1048000>
      
      Fixes: db0b124f ("igc: Enhance Qbv scheduling by using first flag bit")
      Signed-off-by: default avatarFlorian Kauer <florian.kauer@linutronix.de>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Tested-by: default avatarNaama Meir <naamax.meir@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      0bcc6285
    • Florian Kauer's avatar
      igc: Fix launchtime before start of cycle · c1bca9ac
      Florian Kauer authored
      It is possible (verified on a running system) that frames are processed
      by igc_tx_launchtime with a txtime before the start of the cycle
      (baset_est).
      
      However, the result of txtime - baset_est is written into a u32,
      leading to a wrap around to a positive number. The following
      launchtime > 0 check will only branch to executing launchtime = 0
      if launchtime is already 0.
      
      Fix it by using a s32 before checking launchtime > 0.
      
      Fixes: db0b124f ("igc: Enhance Qbv scheduling by using first flag bit")
      Signed-off-by: default avatarFlorian Kauer <florian.kauer@linutronix.de>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Tested-by: default avatarNaama Meir <naamax.meir@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      c1bca9ac
    • Florian Kauer's avatar
      igc: No strict mode in pure launchtime/CBS offload · 8b86f10a
      Florian Kauer authored
      The flags IGC_TXQCTL_STRICT_CYCLE and IGC_TXQCTL_STRICT_END
      prevent the packet transmission over slot and cycle boundaries.
      This is important for taprio offload where the slots and
      cycles correspond to the slots and cycles configured for the
      network.
      
      However, the Qbv offload feature of the i225 is also used for
      enabling TX launchtime / ETF offload. In that case, however,
      the cycle has no meaning for the network and is only used
      internally to adapt the base time register after a second has
      passed.
      
      Enabling strict mode in this case would unnecessarily prevent
      the transmission of certain packets (i.e. at the boundary of a
      second) and thus interferes with the ETF qdisc that promises
      transmission at a certain point in time.
      
      Similar to ETF, this also applies to CBS offload that also should
      not be influenced by strict mode unless taprio offload would be
      enabled at the same time.
      
      This fully reverts
      commit d8f45be0 ("igc: Use strict cycles for Qbv scheduling")
      but its commit message only describes what was already implemented
      before that commit. The difference to a plain revert of that commit
      is that it now copes with the base_time = 0 case that was fixed with
      commit e17090eb ("igc: allow BaseTime 0 enrollment for Qbv")
      
      In particular, enabling strict mode leads to TX hang situations
      under high traffic if taprio is applied WITHOUT taprio offload
      but WITH ETF offload, e.g. as in
      
          sudo tc qdisc replace dev enp1s0 parent root handle 100 taprio \
      	    num_tc 1 \
      	    map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \
      	    queues 1@0 \
      	    base-time 0 \
      	    sched-entry S 01 300000 \
      	    flags 0x1 \
      	    txtime-delay 500000 \
      	    clockid CLOCK_TAI
          sudo tc qdisc replace dev enp1s0 parent 100:1 etf \
      	    clockid CLOCK_TAI \
      	    delta 500000 \
      	    offload \
      	    skip_sock_check
      
      and traffic generator
      
          sudo trafgen -i traffic.cfg -o enp1s0 --cpp -n0 -q -t1400ns
      
      with traffic.cfg
      
          #define ETH_P_IP        0x0800
      
          {
            /* Ethernet Header */
            0x30, 0x1f, 0x9a, 0xd0, 0xf0, 0x0e,  # MAC Dest - adapt as needed
            0x24, 0x5e, 0xbe, 0x57, 0x2e, 0x36,  # MAC Src  - adapt as needed
            const16(ETH_P_IP),
      
            /* IPv4 Header */
            0b01000101, 0,   # IPv4 version, IHL, TOS
            const16(1028),   # IPv4 total length (UDP length + 20 bytes (IP header))
            const16(2),      # IPv4 ident
            0b01000000, 0,   # IPv4 flags, fragmentation off
            64,              # IPv4 TTL
            17,              # Protocol UDP
            csumip(14, 33),  # IPv4 checksum
      
            /* UDP Header */
            10,  0, 48, 1,   # IP Src - adapt as needed
            10,  0, 48, 10,  # IP Dest - adapt as needed
            const16(5555),   # UDP Src Port
            const16(6666),   # UDP Dest Port
            const16(1008),   # UDP length (UDP header 8 bytes + payload length)
            csumudp(14, 34), # UDP checksum
      
            /* Payload */
            fill('W', 1000),
          }
      
      and the observed message with that is for example
      
       igc 0000:01:00.0 enp1s0: Detected Tx Unit Hang
         Tx Queue             <0>
         TDH                  <d0>
         TDT                  <f0>
         next_to_use          <f0>
         next_to_clean        <d0>
       buffer_info[next_to_clean]
         time_stamp           <ffff661f>
         next_to_watch        <00000000245a4efb>
         jiffies              <ffff6e48>
         desc.status          <1048000>
      
      Fixes: d8f45be0 ("igc: Use strict cycles for Qbv scheduling")
      Signed-off-by: default avatarFlorian Kauer <florian.kauer@linutronix.de>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Tested-by: default avatarNaama Meir <naamax.meir@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      8b86f10a
    • Florian Kauer's avatar
      igc: Handle already enabled taprio offload for basetime 0 · e5d88c53
      Florian Kauer authored
      Since commit e17090eb ("igc: allow BaseTime 0 enrollment for Qbv")
      it is possible to enable taprio offload with a basetime of 0.
      However, the check if taprio offload is already enabled (and thus -EALREADY
      should be returned for igc_save_qbv_schedule) still relied on
      adapter->base_time > 0.
      
      This can be reproduced as follows:
      
          # TAPRIO offload (flags == 0x2) and base-time = 0
          sudo tc qdisc replace dev enp1s0 parent root handle 100 stab overhead 24 taprio \
      	    num_tc 1 \
      	    map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \
      	    queues 1@0 \
      	    base-time 0 \
      	    sched-entry S 01 300000 \
      	    flags 0x2
      
          # The second call should fail with "Error: Device failed to setup taprio offload."
          # But that only happens if base-time was != 0
          sudo tc qdisc replace dev enp1s0 parent root handle 100 stab overhead 24 taprio \
      	    num_tc 1 \
      	    map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \
      	    queues 1@0 \
      	    base-time 0 \
      	    sched-entry S 01 300000 \
      	    flags 0x2
      
      Fixes: e17090eb ("igc: allow BaseTime 0 enrollment for Qbv")
      Signed-off-by: default avatarFlorian Kauer <florian.kauer@linutronix.de>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Tested-by: default avatarNaama Meir <naamax.meir@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      e5d88c53
    • Florian Kauer's avatar
      igc: Do not enable taprio offload for invalid arguments · 82ff5f29
      Florian Kauer authored
      Only set adapter->taprio_offload_enable after validating the arguments.
      Otherwise, it stays set even if the offload was not enabled.
      Since the subsequent code does not get executed in case of invalid
      arguments, it will not be read at first.
      However, by activating and then deactivating another offload
      (e.g. ETF/TX launchtime offload), taprio_offload_enable is read
      and erroneously keeps the offload feature of the NIC enabled.
      
      This can be reproduced as follows:
      
          # TAPRIO offload (flags == 0x2) and negative base-time leading to expected -ERANGE
          sudo tc qdisc replace dev enp1s0 parent root handle 100 stab overhead 24 taprio \
      	    num_tc 1 \
      	    map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \
      	    queues 1@0 \
      	    base-time -1000 \
      	    sched-entry S 01 300000 \
      	    flags 0x2
      
          # IGC_TQAVCTRL is 0x0 as expected (iomem=relaxed for reading register)
          sudo pcimem /sys/bus/pci/devices/0000:01:00.0/resource0 0x3570 w*1
      
          # Activate ETF offload
          sudo tc qdisc replace dev enp1s0 parent root handle 6666 mqprio \
      	    num_tc 3 \
      	    map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
      	    queues 1@0 1@1 2@2 \
      	    hw 0
          sudo tc qdisc add dev enp1s0 parent 6666:1 etf \
      	    clockid CLOCK_TAI \
      	    delta 500000 \
      	    offload
      
          # IGC_TQAVCTRL is 0x9 as expected
          sudo pcimem /sys/bus/pci/devices/0000:01:00.0/resource0 0x3570 w*1
      
          # Deactivate ETF offload again
          sudo tc qdisc delete dev enp1s0 parent 6666:1
      
          # IGC_TQAVCTRL should now be 0x0 again, but is observed as 0x9
          sudo pcimem /sys/bus/pci/devices/0000:01:00.0/resource0 0x3570 w*1
      
      Fixes: e17090eb ("igc: allow BaseTime 0 enrollment for Qbv")
      Signed-off-by: default avatarFlorian Kauer <florian.kauer@linutronix.de>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Tested-by: default avatarNaama Meir <naamax.meir@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      82ff5f29
    • Florian Kauer's avatar
      igc: Rename qbv_enable to taprio_offload_enable · 8046063d
      Florian Kauer authored
      In the current implementation the flags adapter->qbv_enable
      and IGC_FLAG_TSN_QBV_ENABLED have a similar name, but do not
      have the same meaning. The first one is used only to indicate
      taprio offload (i.e. when igc_save_qbv_schedule was called),
      while the second one corresponds to the Qbv mode of the hardware.
      However, the second one is also used to support the TX launchtime
      feature, i.e. ETF qdisc offload. This leads to situations where
      adapter->qbv_enable is false, but the flag IGC_FLAG_TSN_QBV_ENABLED
      is set. This is prone to confusion.
      
      The rename should reduce this confusion. Since it is a pure
      rename, it has no impact on functionality.
      
      Fixes: e17090eb ("igc: allow BaseTime 0 enrollment for Qbv")
      Signed-off-by: default avatarFlorian Kauer <florian.kauer@linutronix.de>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Tested-by: default avatarNaama Meir <naamax.meir@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      8046063d
    • Junfeng Guo's avatar
      gve: unify driver name usage · 9d0aba98
      Junfeng Guo authored
      Current codebase contained the usage of two different names for this
      driver (i.e., `gvnic` and `gve`), which is quite unfriendly for users
      to use, especially when trying to bind or unbind the driver manually.
      The corresponding kernel module is registered with the name of `gve`.
      It's more reasonable to align the name of the driver with the module.
      
      Fixes: 893ce44d ("gve: Add basic driver framework for Compute Engine Virtual NIC")
      Cc: csully@google.com
      Signed-off-by: default avatarJunfeng Guo <junfeng.guo@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9d0aba98
    • Azeem Shaikh's avatar
      net: sched: Replace strlcpy with strscpy · 989b52cd
      Azeem Shaikh authored
      strlcpy() reads the entire source buffer first.
      This read may exceed the destination size limit.
      This is both inefficient and can lead to linear read
      overflows if a source string is not NUL-terminated [1].
      In an effort to remove strlcpy() completely [2], replace
      strlcpy() here with strscpy().
      
      Direct replacement is safe here since return value of -errno
      is used to check for truncation instead of sizeof(dest).
      
      [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
      [2] https://github.com/KSPP/linux/issues/89Signed-off-by: default avatarAzeem Shaikh <azeemshaikh38@gmail.com>
      Reviewed-by: default avatarPavan Chebbi <pavan.chebbi@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      989b52cd
    • Jiasheng Jiang's avatar
      net: dsa: qca8k: Add check for skb_copy · 87355b7c
      Jiasheng Jiang authored
      Add check for the return value of skb_copy in order to avoid NULL pointer
      dereference.
      
      Fixes: 2cd54856 ("net: dsa: qca8k: add support for phy read/write with mgmt Ethernet")
      Signed-off-by: default avatarJiasheng Jiang <jiasheng@iscas.ac.cn>
      Reviewed-by: default avatarPavan Chebbi <pavan.chebbi@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      87355b7c
  4. 09 Jul, 2023 2 commits
  5. 08 Jul, 2023 8 commits
    • Eric Dumazet's avatar
      udp6: fix udp6_ehashfn() typo · 51d03e2f
      Eric Dumazet authored
      Amit Klein reported that udp6_ehash_secret was initialized but never used.
      
      Fixes: 1bbdceef ("inet: convert inet_ehash_secret and ipv6_hash_secret to net_get_random_once")
      Reported-by: default avatarAmit Klein <aksecurity@gmail.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
      Cc: David Ahern <dsahern@kernel.org>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      51d03e2f
    • Kuniyuki Iwashima's avatar
      icmp6: Fix null-ptr-deref of ip6_null_entry->rt6i_idev in icmp6_dev(). · 2aaa8a15
      Kuniyuki Iwashima authored
      With some IPv6 Ext Hdr (RPL, SRv6, etc.), we can send a packet that
      has the link-local address as src and dst IP and will be forwarded to
      an external IP in the IPv6 Ext Hdr.
      
      For example, the script below generates a packet whose src IP is the
      link-local address and dst is updated to 11::.
      
        # for f in $(find /proc/sys/net/ -name *seg6_enabled*); do echo 1 > $f; done
        # python3
        >>> from socket import *
        >>> from scapy.all import *
        >>>
        >>> SRC_ADDR = DST_ADDR = "fe80::5054:ff:fe12:3456"
        >>>
        >>> pkt = IPv6(src=SRC_ADDR, dst=DST_ADDR)
        >>> pkt /= IPv6ExtHdrSegmentRouting(type=4, addresses=["11::", "22::"], segleft=1)
        >>>
        >>> sk = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW)
        >>> sk.sendto(bytes(pkt), (DST_ADDR, 0))
      
      For such a packet, we call ip6_route_input() to look up a route for the
      next destination in these three functions depending on the header type.
      
        * ipv6_rthdr_rcv()
        * ipv6_rpl_srh_rcv()
        * ipv6_srh_rcv()
      
      If no route is found, ip6_null_entry is set to skb, and the following
      dst_input(skb) calls ip6_pkt_drop().
      
      Finally, in icmp6_dev(), we dereference skb_rt6_info(skb)->rt6i_idev->dev
      as the input device is the loopback interface.  Then, we have to check if
      skb_rt6_info(skb)->rt6i_idev is NULL or not to avoid NULL pointer deref
      for ip6_null_entry.
      
      BUG: kernel NULL pointer dereference, address: 0000000000000000
       PF: supervisor read access in kernel mode
       PF: error_code(0x0000) - not-present page
      PGD 0 P4D 0
      Oops: 0000 [#1] PREEMPT SMP PTI
      CPU: 0 PID: 157 Comm: python3 Not tainted 6.4.0-11996-gb121d614371c #35
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      RIP: 0010:icmp6_send (net/ipv6/icmp.c:436 net/ipv6/icmp.c:503)
      Code: fe ff ff 48 c7 40 30 c0 86 5d 83 e8 c6 44 1c 00 e9 c8 fc ff ff 49 8b 46 58 48 83 e0 fe 0f 84 4a fb ff ff 48 8b 80 d0 00 00 00 <48> 8b 00 44 8b 88 e0 00 00 00 e9 34 fb ff ff 4d 85 ed 0f 85 69 01
      RSP: 0018:ffffc90000003c70 EFLAGS: 00000286
      RAX: 0000000000000000 RBX: 0000000000000001 RCX: 00000000000000e0
      RDX: 0000000000000021 RSI: 0000000000000000 RDI: ffff888006d72a18
      RBP: ffffc90000003d80 R08: 0000000000000000 R09: 0000000000000001
      R10: ffffc90000003d98 R11: 0000000000000040 R12: ffff888006d72a10
      R13: 0000000000000000 R14: ffff8880057fb800 R15: ffffffff835d86c0
      FS:  00007f9dc72ee740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000000 CR3: 00000000057b2000 CR4: 00000000007506f0
      PKRU: 55555554
      Call Trace:
       <IRQ>
       ip6_pkt_drop (net/ipv6/route.c:4513)
       ipv6_rthdr_rcv (net/ipv6/exthdrs.c:640 net/ipv6/exthdrs.c:686)
       ip6_protocol_deliver_rcu (net/ipv6/ip6_input.c:437 (discriminator 5))
       ip6_input_finish (./include/linux/rcupdate.h:781 net/ipv6/ip6_input.c:483)
       __netif_receive_skb_one_core (net/core/dev.c:5455)
       process_backlog (./include/linux/rcupdate.h:781 net/core/dev.c:5895)
       __napi_poll (net/core/dev.c:6460)
       net_rx_action (net/core/dev.c:6529 net/core/dev.c:6660)
       __do_softirq (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/irq.h:142 kernel/softirq.c:554)
       do_softirq (kernel/softirq.c:454 kernel/softirq.c:441)
       </IRQ>
       <TASK>
       __local_bh_enable_ip (kernel/softirq.c:381)
       __dev_queue_xmit (net/core/dev.c:4231)
       ip6_finish_output2 (./include/net/neighbour.h:544 net/ipv6/ip6_output.c:135)
       rawv6_sendmsg (./include/net/dst.h:458 ./include/linux/netfilter.h:303 net/ipv6/raw.c:656 net/ipv6/raw.c:914)
       sock_sendmsg (net/socket.c:725 net/socket.c:748)
       __sys_sendto (net/socket.c:2134)
       __x64_sys_sendto (net/socket.c:2146 net/socket.c:2142 net/socket.c:2142)
       do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
       entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
      RIP: 0033:0x7f9dc751baea
      Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
      RSP: 002b:00007ffe98712c38 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 00007ffe98712cf8 RCX: 00007f9dc751baea
      RDX: 0000000000000060 RSI: 00007f9dc6460b90 RDI: 0000000000000003
      RBP: 00007f9dc56e8be0 R08: 00007ffe98712d70 R09: 000000000000001c
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007f9dc6af5d1b
       </TASK>
      Modules linked in:
      CR2: 0000000000000000
       ---[ end trace 0000000000000000 ]---
      RIP: 0010:icmp6_send (net/ipv6/icmp.c:436 net/ipv6/icmp.c:503)
      Code: fe ff ff 48 c7 40 30 c0 86 5d 83 e8 c6 44 1c 00 e9 c8 fc ff ff 49 8b 46 58 48 83 e0 fe 0f 84 4a fb ff ff 48 8b 80 d0 00 00 00 <48> 8b 00 44 8b 88 e0 00 00 00 e9 34 fb ff ff 4d 85 ed 0f 85 69 01
      RSP: 0018:ffffc90000003c70 EFLAGS: 00000286
      RAX: 0000000000000000 RBX: 0000000000000001 RCX: 00000000000000e0
      RDX: 0000000000000021 RSI: 0000000000000000 RDI: ffff888006d72a18
      RBP: ffffc90000003d80 R08: 0000000000000000 R09: 0000000000000001
      R10: ffffc90000003d98 R11: 0000000000000040 R12: ffff888006d72a10
      R13: 0000000000000000 R14: ffff8880057fb800 R15: ffffffff835d86c0
      FS:  00007f9dc72ee740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000000 CR3: 00000000057b2000 CR4: 00000000007506f0
      PKRU: 55555554
      Kernel panic - not syncing: Fatal exception in interrupt
      Kernel Offset: disabled
      
      Fixes: 4832c30d ("net: ipv6: put host and anycast routes on device with address")
      Reported-by: default avatarWang Yufen <wangyufen@huawei.com>
      Closes: https://lore.kernel.org/netdev/c41403a9-c2f6-3b7e-0c96-e1901e605cd0@huawei.com/Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2aaa8a15
    • David S. Miller's avatar
      Merge branch 's390-ism-fixes' · bbffab69
      David S. Miller authored
      Niklas Schnelle says:
      
      ====================
      s390/ism: Fixes to client handling
      
      This is v2 of the patch previously titled "s390/ism: Detangle ISM client
      IRQ and event forwarding". As suggested by Paolo Abeni I split the patch
      up. While doing so I noticed another problem that was fixed by this patch
      concerning the way the workqueues access the client structs. This means the
      second patch turning the workqueues into simple direct calls also fixes
      a problem. Finally I split off a third patch just for fixing
      ism_unregister_client()s error path.
      
      The code after these 3 patches is identical to the result of the v1 patch
      except that I also turned the dev_err() for still registered DMBs into
      a WARN().
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bbffab69
    • Niklas Schnelle's avatar
      s390/ism: Do not unregister clients with registered DMBs · 266deeea
      Niklas Schnelle authored
      When ism_unregister_client() is called but the client still has DMBs
      registered it returns -EBUSY and prints an error. This only happens
      after the client has already been unregistered however. This is
      unexpected as the unregister claims to have failed. Furthermore as this
      implies a client bug a WARN() is more appropriate. Thus move the
      deregistration after the check and use WARN().
      
      Fixes: 89e7d2ba ("net/ism: Add new API for client registration")
      Signed-off-by: default avatarNiklas Schnelle <schnelle@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      266deeea
    • Niklas Schnelle's avatar
      s390/ism: Fix and simplify add()/remove() callback handling · 76631ffa
      Niklas Schnelle authored
      Previously the clients_lock was protecting the clients array against
      concurrent addition/removal of clients but was also accessed from IRQ
      context. This meant that it had to be a spinlock and that the add() and
      remove() callbacks in which clients need to do allocation and take
      mutexes can't be called under the clients_lock. To work around this these
      callbacks were moved to workqueues. This not only introduced significant
      complexity but is also subtly broken in at least one way.
      
      In ism_dev_init() and ism_dev_exit() clients[i]->tgt_ism is used to
      communicate the added/removed ISM device to the work function. While
      write access to client[i]->tgt_ism is protected by the clients_lock and
      the code waits that there is no pending add/remove work before and after
      setting clients[i]->tgt_ism this is not enough. The problem is that the
      wait happens based on per ISM device counters. Thus a concurrent
      ism_dev_init()/ism_dev_exit() for a different ISM device may overwrite
      a clients[i]->tgt_ism between unlocking the clients_lock and the
      subsequent wait for the work to finnish.
      
      Thankfully with the clients_lock no longer held in IRQ context it can be
      turned into a mutex which can be held during the calls to add()/remove()
      completely removing the need for the workqueues and the associated
      broken housekeeping including the per ISM device counters and the
      clients[i]->tgt_ism.
      
      Fixes: 89e7d2ba ("net/ism: Add new API for client registration")
      Signed-off-by: default avatarNiklas Schnelle <schnelle@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      76631ffa
    • Niklas Schnelle's avatar
      s390/ism: Fix locking for forwarding of IRQs and events to clients · 6b5c13b5
      Niklas Schnelle authored
      The clients array references all registered clients and is protected by
      the clients_lock. Besides its use as general list of clients the clients
      array is accessed in ism_handle_irq() to forward ISM device events to
      clients.
      
      While the clients_lock is taken in the IRQ handler when calling
      handle_event() it is however incorrectly not held during the
      client->handle_irq() call and for the preceding clients[] access leaving
      it unprotected against concurrent client (un-)registration.
      
      Furthermore the accesses to ism->sba_client_arr[] in ism_register_dmb()
      and ism_unregister_dmb() are not protected by any lock. This is
      especially problematic as the client ID from the ism->sba_client_arr[]
      is not checked against NO_CLIENT and neither is the client pointer
      checked.
      
      Instead of expanding the use of the clients_lock further add a separate
      array in struct ism_dev which references clients subscribed to the
      device's events and IRQs. This array is protected by ism->lock which is
      already taken in ism_handle_irq() and can be taken outside the IRQ
      handler when adding/removing subscribers or the accessing
      ism->sba_client_arr[]. This also means that the clients_lock is no
      longer taken in IRQ context.
      
      Fixes: 89e7d2ba ("net/ism: Add new API for client registration")
      Signed-off-by: default avatarNiklas Schnelle <schnelle@linux.ibm.com>
      Reviewed-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6b5c13b5
    • Paolo Abeni's avatar
      net: prevent skb corruption on frag list segmentation · c329b261
      Paolo Abeni authored
      Ian reported several skb corruptions triggered by rx-gro-list,
      collecting different oops alike:
      
      [   62.624003] BUG: kernel NULL pointer dereference, address: 00000000000000c0
      [   62.631083] #PF: supervisor read access in kernel mode
      [   62.636312] #PF: error_code(0x0000) - not-present page
      [   62.641541] PGD 0 P4D 0
      [   62.644174] Oops: 0000 [#1] PREEMPT SMP NOPTI
      [   62.648629] CPU: 1 PID: 913 Comm: napi/eno2-79 Not tainted 6.4.0 #364
      [   62.655162] Hardware name: Supermicro Super Server/A2SDi-12C-HLN4F, BIOS 1.7a 10/13/2022
      [   62.663344] RIP: 0010:__udp_gso_segment (./include/linux/skbuff.h:2858
      ./include/linux/udp.h:23 net/ipv4/udp_offload.c:228 net/ipv4/udp_offload.c:261
      net/ipv4/udp_offload.c:277)
      [   62.687193] RSP: 0018:ffffbd3a83b4f868 EFLAGS: 00010246
      [   62.692515] RAX: 00000000000000ce RBX: 0000000000000000 RCX: 0000000000000000
      [   62.699743] RDX: ffffa124def8a000 RSI: 0000000000000079 RDI: ffffa125952a14d4
      [   62.706970] RBP: ffffa124def8a000 R08: 0000000000000022 R09: 00002000001558c9
      [   62.714199] R10: 0000000000000000 R11: 00000000be554639 R12: 00000000000000e2
      [   62.721426] R13: ffffa125952a1400 R14: ffffa125952a1400 R15: 00002000001558c9
      [   62.728654] FS:  0000000000000000(0000) GS:ffffa127efa40000(0000)
      knlGS:0000000000000000
      [   62.736852] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   62.742702] CR2: 00000000000000c0 CR3: 00000001034b0000 CR4: 00000000003526e0
      [   62.749948] Call Trace:
      [   62.752498]  <TASK>
      [   62.779267] inet_gso_segment (net/ipv4/af_inet.c:1398)
      [   62.787605] skb_mac_gso_segment (net/core/gro.c:141)
      [   62.791906] __skb_gso_segment (net/core/dev.c:3403 (discriminator 2))
      [   62.800492] validate_xmit_skb (./include/linux/netdevice.h:4862
      net/core/dev.c:3659)
      [   62.804695] validate_xmit_skb_list (net/core/dev.c:3710)
      [   62.809158] sch_direct_xmit (net/sched/sch_generic.c:330)
      [   62.813198] __dev_queue_xmit (net/core/dev.c:3805 net/core/dev.c:4210)
      net/netfilter/core.c:626)
      [   62.821093] br_dev_queue_push_xmit (net/bridge/br_forward.c:55)
      [   62.825652] maybe_deliver (net/bridge/br_forward.c:193)
      [   62.829420] br_flood (net/bridge/br_forward.c:233)
      [   62.832758] br_handle_frame_finish (net/bridge/br_input.c:215)
      [   62.837403] br_handle_frame (net/bridge/br_input.c:298
      net/bridge/br_input.c:416)
      [   62.851417] __netif_receive_skb_core.constprop.0 (net/core/dev.c:5387)
      [   62.866114] __netif_receive_skb_list_core (net/core/dev.c:5570)
      [   62.871367] netif_receive_skb_list_internal (net/core/dev.c:5638
      net/core/dev.c:5727)
      [   62.876795] napi_complete_done (./include/linux/list.h:37
      ./include/net/gro.h:434 ./include/net/gro.h:429 net/core/dev.c:6067)
      [   62.881004] ixgbe_poll (drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:3191)
      [   62.893534] __napi_poll (net/core/dev.c:6498)
      [   62.897133] napi_threaded_poll (./include/linux/netpoll.h:89
      net/core/dev.c:6640)
      [   62.905276] kthread (kernel/kthread.c:379)
      [   62.913435] ret_from_fork (arch/x86/entry/entry_64.S:314)
      [   62.917119]  </TASK>
      
      In the critical scenario, rx-gro-list GRO-ed packets are fed, via a
      bridge, both to the local input path and to an egress device (tun).
      
      The segmentation of such packets unsafely writes to the cloned skbs
      with shared heads.
      
      This change addresses the issue by uncloning as needed the
      to-be-segmented skbs.
      Reported-by: default avatarIan Kumlien <ian.kumlien@gmail.com>
      Tested-by: default avatarIan Kumlien <ian.kumlien@gmail.com>
      Fixes: 3a1296a3 ("net: Support GRO/GSO fraglist chaining.")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c329b261
    • Rafał Miłecki's avatar
      net: bgmac: postpone turning IRQs off to avoid SoC hangs · e7731194
      Rafał Miłecki authored
      Turning IRQs off is done by accessing Ethernet controller registers.
      That can't be done until device's clock is enabled. It results in a SoC
      hang otherwise.
      
      This bug remained unnoticed for years as most bootloaders keep all
      Ethernet interfaces turned on. It seems to only affect a niche SoC
      family BCM47189. It has two Ethernet controllers but CFE bootloader uses
      only the first one.
      
      Fixes: 34322615 ("net: bgmac: Mask interrupts during probe")
      Signed-off-by: default avatarRafał Miłecki <rafal@milecki.pl>
      Reviewed-by: default avatarMichal Kubiak <michal.kubiak@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7731194
  6. 07 Jul, 2023 6 commits