1. 15 Aug, 2023 13 commits
    • Xin Long's avatar
      netfilter: set default timeout to 3 secs for sctp shutdown send and recv state · 9bfab6d2
      Xin Long authored
      In SCTP protocol, it is using the same timer (T2 timer) for SHUTDOWN and
      SHUTDOWN_ACK retransmission. However in sctp conntrack the default timeout
      value for SCTP_CONNTRACK_SHUTDOWN_ACK_SENT state is 3 secs while it's 300
      msecs for SCTP_CONNTRACK_SHUTDOWN_SEND/RECV state.
      
      As Paolo Valerio noticed, this might cause unwanted expiration of the ct
      entry. In my test, with 1s tc netem delay set on the NAT path, after the
      SHUTDOWN is sent, the sctp ct entry enters SCTP_CONNTRACK_SHUTDOWN_SEND
      state. However, due to 300ms (too short) delay, when the SHUTDOWN_ACK is
      sent back from the peer, the sctp ct entry has expired and been deleted,
      and then the SHUTDOWN_ACK has to be dropped.
      
      Also, it is confusing these two sysctl options always show 0 due to all
      timeout values using sec as unit:
      
        net.netfilter.nf_conntrack_sctp_timeout_shutdown_recd = 0
        net.netfilter.nf_conntrack_sctp_timeout_shutdown_sent = 0
      
      This patch fixes it by also using 3 secs for sctp shutdown send and recv
      state in sctp conntrack, which is also RTO.initial value in SCTP protocol.
      
      Note that the very short time value for SCTP_CONNTRACK_SHUTDOWN_SEND/RECV
      was probably used for a rare scenario where SHUTDOWN is sent on 1st path
      but SHUTDOWN_ACK is replied on 2nd path, then a new connection started
      immediately on 1st path. So this patch also moves from SHUTDOWN_SEND/RECV
      to CLOSE when receiving INIT in the ORIGINAL direction.
      
      Fixes: 9fb9cbb1 ("[NETFILTER]: Add nf_conntrack subsystem.")
      Reported-by: default avatarPaolo Valerio <pvalerio@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      9bfab6d2
    • Florian Westphal's avatar
      netfilter: nf_tables: don't fail inserts if duplicate has expired · 7845914f
      Florian Westphal authored
      nftables selftests fail:
      run-tests.sh testcases/sets/0044interval_overlap_0
      Expected: 0-2 . 0-3, got:
      W: [FAILED]     ./testcases/sets/0044interval_overlap_0: got 1
      
      Insertion must ignore duplicate but expired entries.
      
      Moreover, there is a strange asymmetry in nft_pipapo_activate:
      
      It refetches the current element, whereas the other ->activate callbacks
      (bitmap, hash, rhash, rbtree) use elem->priv.
      Same for .remove: other set implementations take elem->priv,
      nft_pipapo_remove fetches elem->priv, then does a relookup,
      remove this.
      
      I suspect this was the reason for the change that prompted the
      removal of the expired check in pipapo_get() in the first place,
      but skipping exired elements there makes no sense to me, this helper
      is used for normal get requests, insertions (duplicate check)
      and deactivate callback.
      
      In first two cases expired elements must be skipped.
      
      For ->deactivate(), this gets called for DELSETELEM, so it
      seems to me that expired elements should be skipped as well, i.e.
      delete request should fail with -ENOENT error.
      
      Fixes: 24138933 ("netfilter: nf_tables: don't skip expired elements during walk")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      7845914f
    • Florian Westphal's avatar
      netfilter: nf_tables: deactivate catchall elements in next generation · 90e5b346
      Florian Westphal authored
      When flushing, individual set elements are disabled in the next
      generation via the ->flush callback.
      
      Catchall elements are not disabled.  This is incorrect and may lead to
      double-deactivations of catchall elements which then results in memory
      leaks:
      
      WARNING: CPU: 1 PID: 3300 at include/net/netfilter/nf_tables.h:1172 nft_map_deactivate+0x549/0x730
      CPU: 1 PID: 3300 Comm: nft Not tainted 6.5.0-rc5+ #60
      RIP: 0010:nft_map_deactivate+0x549/0x730
       [..]
       ? nft_map_deactivate+0x549/0x730
       nf_tables_delset+0xb66/0xeb0
      
      (the warn is due to nft_use_dec() detecting underflow).
      
      Fixes: aaa31047 ("netfilter: nftables: add catch-all set element support")
      Reported-by: default avatarlonial con <kongln9170@gmail.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      90e5b346
    • Florian Westphal's avatar
      netfilter: nf_tables: fix kdoc warnings after gc rework · 08713cb0
      Florian Westphal authored
      Jakub Kicinski says:
        We've got some new kdoc warnings here:
        net/netfilter/nft_set_pipapo.c:1557: warning: Function parameter or member '_set' not described in 'pipapo_gc'
        net/netfilter/nft_set_pipapo.c:1557: warning: Excess function parameter 'set' description in 'pipapo_gc'
        include/net/netfilter/nf_tables.h:577: warning: Function parameter or member 'dead' not described in 'nft_set'
      
      Fixes: 5f68718b ("netfilter: nf_tables: GC transaction API to avoid race with control plane")
      Fixes: f6c383b8 ("netfilter: nf_tables: adapt set backend to use GC transaction API")
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Closes: https://lore.kernel.org/netdev/20230810104638.746e46f1@kernel.org/Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      08713cb0
    • Florian Westphal's avatar
      netfilter: nf_tables: fix false-positive lockdep splat · b9f052dc
      Florian Westphal authored
      ->abort invocation may cause splat on debug kernels:
      
      WARNING: suspicious RCU usage
      net/netfilter/nft_set_pipapo.c:1697 suspicious rcu_dereference_check() usage!
      [..]
      rcu_scheduler_active = 2, debug_locks = 1
      1 lock held by nft/133554: [..] (nft_net->commit_mutex){+.+.}-{3:3}, at: nf_tables_valid_genid
      [..]
       lockdep_rcu_suspicious+0x1ad/0x260
       nft_pipapo_abort+0x145/0x180
       __nf_tables_abort+0x5359/0x63d0
       nf_tables_abort+0x24/0x40
       nfnetlink_rcv+0x1a0a/0x22c0
       netlink_unicast+0x73c/0x900
       netlink_sendmsg+0x7f0/0xc20
       ____sys_sendmsg+0x48d/0x760
      
      Transaction mutex is held, so parallel updates are not possible.
      Switch to _protected and check mutex is held for lockdep enabled builds.
      
      Fixes: 212ed75d ("netfilter: nf_tables: integrate pipapo into commit protocol")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      b9f052dc
    • Jason Xing's avatar
      net: fix the RTO timer retransmitting skb every 1ms if linear option is enabled · e4dd0d3a
      Jason Xing authored
      In the real workload, I encountered an issue which could cause the RTO
      timer to retransmit the skb per 1ms with linear option enabled. The amount
      of lost-retransmitted skbs can go up to 1000+ instantly.
      
      The root cause is that if the icsk_rto happens to be zero in the 6th round
      (which is the TCP_THIN_LINEAR_RETRIES value), then it will always be zero
      due to the changed calculation method in tcp_retransmit_timer() as follows:
      
      icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
      
      Above line could be converted to
      icsk->icsk_rto = min(0 << 1, TCP_RTO_MAX) = 0
      
      Therefore, the timer expires so quickly without any doubt.
      
      I read through the RFC 6298 and found that the RTO value can be rounded
      up to a certain value, in Linux, say TCP_RTO_MIN as default, which is
      regarded as the lower bound in this patch as suggested by Eric.
      
      Fixes: 36e31b0a ("net: TCP thin linear timeouts")
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJason Xing <kernelxing@tencent.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e4dd0d3a
    • Liang Chen's avatar
      net: veth: Page pool creation error handling for existing pools only · 8a519a57
      Liang Chen authored
      The failure handling procedure destroys page pools for all queues,
      including those that haven't had their page pool created yet. this patch
      introduces necessary adjustments to prevent potential risks and
      inconsistency with the error handling behavior.
      
      Fixes: 0ebab78c ("net: veth: add page_pool for page recycling")
      Acked-by: default avatarJesper Dangaard Brouer <hawk@kernel.org>
      Signed-off-by: default avatarLiang Chen <liangchen.linux@gmail.com>
      Link: https://lore.kernel.org/r/20230812023016.10553-1-liangchen.linux@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8a519a57
    • Jakub Kicinski's avatar
      Merge branch 'octeon_ep-fixes-for-error-and-remove-paths' · f6f978fc
      Jakub Kicinski authored
      Michal Schmidt says:
      
      ====================
      octeon_ep: fixes for error and remove paths
      
      I have an Octeon card that's misconfigured in a way that exposes a
      couple of bugs in the octeon_ep driver's error paths. It can reproduce
      the issues that patches 1 & 4 are fixing. Patches 2 & 3 are a result of
      reviewing the nearby code.
      ====================
      
      Link: https://lore.kernel.org/r/20230810150114.107765-1-mschmidt@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f6f978fc
    • Michal Schmidt's avatar
      octeon_ep: cancel queued works in probe error path · 758c9107
      Michal Schmidt authored
      If it fails to get the devices's MAC address, octep_probe exits while
      leaving the delayed work intr_poll_task queued. When the work later
      runs, it's a use after free.
      
      Move the cancelation of intr_poll_task from octep_remove into
      octep_device_cleanup. This does not change anything in the octep_remove
      flow, but octep_device_cleanup is called also in the octep_probe error
      path, where the cancelation is needed.
      
      Note that the cancelation of ctrl_mbox_task has to follow
      intr_poll_task's, because the ctrl_mbox_task may be queued by
      intr_poll_task.
      
      Fixes: 24d43332 ("octeon_ep: poll for control messages")
      Signed-off-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Link: https://lore.kernel.org/r/20230810150114.107765-5-mschmidt@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      758c9107
    • Michal Schmidt's avatar
      octeon_ep: cancel ctrl_mbox_task after intr_poll_task · 607a7a45
      Michal Schmidt authored
      intr_poll_task may queue ctrl_mbox_task. The function
      octep_poll_non_ioq_interrupts_cn93_pf does this.
      
      When removing the driver and canceling these two works, cancel
      ctrl_mbox_task last to guarantee it does not run anymore.
      
      Fixes: 24d43332 ("octeon_ep: poll for control messages")
      Signed-off-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Link: https://lore.kernel.org/r/20230810150114.107765-4-mschmidt@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      607a7a45
    • Michal Schmidt's avatar
      octeon_ep: cancel tx_timeout_task later in remove sequence · 28458c80
      Michal Schmidt authored
      tx_timeout_task is canceled too early when removing the driver. Nothing
      prevents .ndo_tx_timeout from triggering and queuing the work again.
      
      Better cancel it after the netdev is unregistered.
      It's harmless for octep_tx_timeout_task to run in the window between the
      unregistration and cancelation, because it checks netif_running.
      
      Fixes: 862cd659 ("octeon_ep: Add driver framework and device initialization")
      Signed-off-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Link: https://lore.kernel.org/r/20230810150114.107765-3-mschmidt@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      28458c80
    • Michal Schmidt's avatar
      octeon_ep: fix timeout value for waiting on mbox response · 519b2279
      Michal Schmidt authored
      The intention was to wait up to 500 ms for the mbox response.
      The third argument to wait_event_interruptible_timeout() is supposed to
      be the timeout duration. The driver mistakenly passed absolute time
      instead.
      
      Fixes: 577f0d1b ("octeon_ep: add separate mailbox command and response queues")
      Signed-off-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20230810150114.107765-2-mschmidt@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      519b2279
    • Radhey Shyam Pandey's avatar
      net: macb: In ZynqMP resume always configure PS GTR for non-wakeup source · 6c461e39
      Radhey Shyam Pandey authored
      On Zynq UltraScale+ MPSoC ubuntu platform when systemctl issues suspend,
      network manager bring down the interface and goes into suspend. When it
      wakes up it again enables the interface.
      
      This leads to xilinx-psgtr "PLL lock timeout" on interface bringup, as
      the power management controller power down the entire FPD (including
      SERDES) if none of the FPD devices are in use and serdes is not
      initialized on resume.
      
      $ sudo rtcwake -m no -s 120 -v
      $ sudo systemctl suspend  <this does ifconfig eth1 down>
      $ ifconfig eth1 up
      xilinx-psgtr fd400000.phy: lane 0 (type 10, protocol 5): PLL lock timeout
      phy phy-fd400000.phy.0: phy poweron failed --> -110
      
      macb driver is called in this way:
      1. macb_close: Stop network interface. In this function, it
         reset MACB IP and disables PHY and network interface.
      
      2. macb_suspend: It is called in kernel suspend flow. But because
         network interface has been disabled(netif_running(ndev) is
         false), it does nothing and returns directly;
      
      3. System goes into suspend state. Some time later, system is
         waken up by RTC wakeup device;
      
      4. macb_resume: It does nothing because network interface has
         been disabled;
      
      5. macb_open: It is called to enable network interface again. ethernet
         interface is initialized in this API but serdes which is power-off
         by PMUFW during FPD-off suspend is not initialized again and so
         we hit GT PLL lock issue on open.
      
      To resolve this PLL timeout issue always do PS GTR initialization
      when ethernet device is configured as non-wakeup source.
      
      Fixes: f22bd29b ("net: macb: Fix ZynqMP SGMII non-wakeup source resume failure")
      Fixes: 8b73fa3a ("net: macb: Added ZynqMP-specific initialization")
      Signed-off-by: default avatarRadhey Shyam Pandey <radhey.shyam.pandey@amd.com>
      Link: https://lore.kernel.org/r/1691414091-2260697-1-git-send-email-radhey.shyam.pandey@amd.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6c461e39
  2. 14 Aug, 2023 2 commits
  3. 13 Aug, 2023 1 commit
    • Russell King (Oracle)'s avatar
      net: phy: fix IRQ-based wake-on-lan over hibernate / power off · cc941e54
      Russell King (Oracle) authored
      Uwe reports:
      "Most PHYs signal WoL using an interrupt. So disabling interrupts [at
      shutdown] breaks WoL at least on PHYs covered by the marvell driver."
      
      Discussing with Ioana, the problem which was trying to be solved was:
      "The board in question is a LS1021ATSN which has two AR8031 PHYs that
      share an interrupt line. In case only one of the PHYs is probed and
      there are pending interrupts on the PHY#2 an IRQ storm will happen
      since there is no entity to clear the interrupt from PHY#2's registers.
      PHY#1's driver will get stuck in .handle_interrupt() indefinitely."
      
      Further confirmation that "the two AR8031 PHYs are on the same MDIO
      bus."
      
      With WoL using interrupts to wake the system, in such a case, the
      system will begin booting with an asserted interrupt. Thus, we need to
      cope with an interrupt asserted during boot.
      
      Solve this instead by disabling interrupts during PHY probe. This will
      ensure in Ioana's situation that both PHYs of the same type sharing an
      interrupt line on a common MDIO bus will have their interrupt outputs
      disabled when the driver probes the device, but before we hook in any
      interrupt handlers - thus avoiding the interrupt storm.
      
      A better fix would be for platform firmware to disable the interrupting
      devices at source during boot, before control is handed to the kernel.
      
      Fixes: e2f016cf ("net: phy: add a shutdown procedure")
      Link: 20230804071757.383971-1-u.kleine-koenig@pengutronix.de
      Reported-by: default avatarUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarFlorian Fainelli <florian.fainelli@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cc941e54
  4. 11 Aug, 2023 3 commits
  5. 10 Aug, 2023 21 commits