1. 20 Jun, 2023 7 commits
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: reject unbound anonymous set before commit phase · 938154b9
      Pablo Neira Ayuso authored
      Add a new list to track set transaction and to check for unbound
      anonymous sets before entering the commit phase.
      
      Bail out at the end of the transaction handling if an anonymous set
      remains unbound.
      
      Fixes: 96518518 ("netfilter: add nftables")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      938154b9
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: disallow element updates of bound anonymous sets · c88c535b
      Pablo Neira Ayuso authored
      Anonymous sets come with NFT_SET_CONSTANT from userspace. Although API
      allows to create anonymous sets without NFT_SET_CONSTANT, it makes no
      sense to allow to add and to delete elements for bound anonymous sets.
      
      Fixes: 96518518 ("netfilter: add nftables")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      c88c535b
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: fix underflow in object reference counter · d6b47866
      Pablo Neira Ayuso authored
      Since ("netfilter: nf_tables: drop map element references from
      preparation phase"), integration with commit protocol is better,
      therefore drop the workaround that b91d9036 ("netfilter: nf_tables:
      fix leaking object reference count") provides.
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      d6b47866
    • Pablo Neira Ayuso's avatar
      netfilter: nft_set_pipapo: .walk does not deal with generations · 2b84e215
      Pablo Neira Ayuso authored
      The .walk callback iterates over the current active set, but it might be
      useful to iterate over the next generation set. Use the generation mask
      to determine what set view (either current or next generation) is use
      for the walk iteration.
      
      Fixes: 3c4287f6 ("nf_tables: Add set type for arbitrary concatenation of ranges")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      2b84e215
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: drop map element references from preparation phase · 628bd3e4
      Pablo Neira Ayuso authored
      set .destroy callback releases the references to other objects in maps.
      This is very late and it results in spurious EBUSY errors. Drop refcount
      from the preparation phase instead, update set backend not to drop
      reference counter from set .destroy path.
      
      Exceptions: NFT_TRANS_PREPARE_ERROR does not require to drop the
      reference counter because the transaction abort path releases the map
      references for each element since the set is unbound. The abort path
      also deals with releasing reference counter for new elements added to
      unbound sets.
      
      Fixes: 59105446 ("netfilter: nf_tables: revisit chain/object refcounting from elements")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      628bd3e4
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain · 26b5a571
      Pablo Neira Ayuso authored
      Add a new state to deal with rule expressions deactivation from the
      newrule error path, otherwise the anonymous set remains in the list in
      inactive state for the next generation. Mark the set/chain transaction
      as unbound so the abort path releases this object, set it as inactive in
      the next generation so it is not reachable anymore from this transaction
      and reference counter is dropped.
      
      Fixes: 1240eb93 ("netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      26b5a571
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: fix chain binding transaction logic · 4bedf9ee
      Pablo Neira Ayuso authored
      Add bound flag to rule and chain transactions as in 6a0a8d10
      ("netfilter: nf_tables: use-after-free in failing rule with bound set")
      to skip them in case that the chain is already bound from the abort
      path.
      
      This patch fixes an imbalance in the chain use refcnt that triggers a
      WARN_ON on the table and chain destroy path.
      
      This patch also disallows nested chain bindings, which is not
      supported from userspace.
      
      The logic to deal with chain binding in nft_data_hold() and
      nft_data_release() is not correct. The NFT_TRANS_PREPARE state needs a
      special handling in case a chain is bound but next expressions in the
      same rule fail to initialize as described by 1240eb93 ("netfilter:
      nf_tables: incorrect error path handling with NFT_MSG_NEWRULE").
      
      The chain is left bound if rule construction fails, so the objects
      stored in this chain (and the chain itself) are released by the
      transaction records from the abort path, follow up patch ("netfilter:
      nf_tables: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain")
      completes this error handling.
      
      When deleting an existing rule, chain bound flag is set off so the
      rule expression .destroy path releases the objects.
      
      Fixes: d0e2c7de ("netfilter: nf_tables: add NFT_CHAIN_BINDING")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      4bedf9ee
  2. 19 Jun, 2023 3 commits
    • Terin Stock's avatar
      ipvs: align inner_mac_header for encapsulation · d7fce52f
      Terin Stock authored
      When using encapsulation the original packet's headers are copied to the
      inner headers. This preserves the space for an inner mac header, which
      is not used by the inner payloads for the encapsulation types supported
      by IPVS. If a packet is using GUE or GRE encapsulation and needs to be
      segmented, flow can be passed to __skb_udp_tunnel_segment() which
      calculates a negative tunnel header length. A negative tunnel header
      length causes pskb_may_pull() to fail, dropping the packet.
      
      This can be observed by attaching probes to ip_vs_in_hook(),
      __dev_queue_xmit(), and __skb_udp_tunnel_segment():
      
          perf probe --add '__dev_queue_xmit skb->inner_mac_header \
          skb->inner_network_header skb->mac_header skb->network_header'
          perf probe --add '__skb_udp_tunnel_segment:7 tnl_hlen'
          perf probe -m ip_vs --add 'ip_vs_in_hook skb->inner_mac_header \
          skb->inner_network_header skb->mac_header skb->network_header'
      
      These probes the headers and tunnel header length for packets which
      traverse the IPVS encapsulation path. A TCP packet can be forced into
      the segmentation path by being smaller than a calculated clamped MSS,
      but larger than the advertised MSS.
      
          probe:ip_vs_in_hook: inner_mac_header=0x0 inner_network_header=0x0 mac_header=0x44 network_header=0x52
          probe:ip_vs_in_hook: inner_mac_header=0x44 inner_network_header=0x52 mac_header=0x44 network_header=0x32
          probe:dev_queue_xmit: inner_mac_header=0x44 inner_network_header=0x52 mac_header=0x44 network_header=0x32
          probe:__skb_udp_tunnel_segment_L7: tnl_hlen=-2
      
      When using veth-based encapsulation, the interfaces are set to be
      mac-less, which does not preserve space for an inner mac header. This
      prevents this issue from occurring.
      
      In our real-world testing of sending a 32KB file we observed operation
      time increasing from ~75ms for veth-based encapsulation to over 1.5s
      using IPVS encapsulation due to retries from dropped packets.
      
      This changeset modifies the packet on the encapsulation path in
      ip_vs_tunnel_xmit() and ip_vs_tunnel_xmit_v6() to remove the inner mac
      header offset. This fixes UDP segmentation for both encapsulation types,
      and corrects the inner headers for any IPIP flows that may use it.
      
      Fixes: 84c0d5e9 ("ipvs: allow tunneling with gue encapsulation")
      Signed-off-by: default avatarTerin Stock <terin@cloudflare.com>
      Acked-by: default avatarJulian Anastasov <ja@ssi.bg>
      Acked-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      d7fce52f
    • David S. Miller's avatar
      Merge tag 'mlx5-fixes-2023-06-16' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 0dbcac3a
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      mlx5-fixes-2023-06-16
      
      This series provides bug fixes to mlx5 driver.
      Please pull and let me know if there is any problem.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0dbcac3a
    • Stefan Wahren's avatar
      net: qca_spi: Avoid high load if QCA7000 is not available · 92717c23
      Stefan Wahren authored
      In case the QCA7000 is not available via SPI (e.g. in reset),
      the driver will cause a high load. The reason for this is
      that the synchronization is never finished and schedule()
      is never called. Since the synchronization is not timing
      critical, it's safe to drop this from the scheduling condition.
      Signed-off-by: default avatarStefan Wahren <stefan.wahren@i2se.com>
      Fixes: 291ab06e ("net: qualcomm: new Ethernet over SPI driver for QCA7000")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      92717c23
  3. 18 Jun, 2023 3 commits
  4. 17 Jun, 2023 1 commit
    • Íñigo Huguet's avatar
      sfc: use budget for TX completions · 4aaf2c52
      Íñigo Huguet authored
      When running workloads heavy unbalanced towards TX (high TX, low RX
      traffic), sfc driver can retain the CPU during too long times. Although
      in many cases this is not enough to be visible, it can affect
      performance and system responsiveness.
      
      A way to reproduce it is to use a debug kernel and run some parallel
      netperf TX tests. In some systems, this will lead to this message being
      logged:
        kernel:watchdog: BUG: soft lockup - CPU#12 stuck for 22s!
      
      The reason is that sfc driver doesn't account any NAPI budget for the TX
      completion events work. With high-TX/low-RX traffic, this makes that the
      CPU is held for long time for NAPI poll.
      
      Documentations says "drivers can process completions for any number of Tx
      packets but should only process up to budget number of Rx packets".
      However, many drivers do limit the amount of TX completions that they
      process in a single NAPI poll.
      
      In the same way, this patch adds a limit for the TX work in sfc. With
      the patch applied, the watchdog warning never appears.
      
      Tested with netperf in different combinations: single process / parallel
      processes, TCP / UDP and different sizes of UDP messages. Repeated the
      tests before and after the patch, without any noticeable difference in
      network or CPU performance.
      
      Test hardware:
      Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz (4 cores, 2 threads/core)
      Solarflare Communications XtremeScale X2522-25G Network Adapter
      
      Fixes: 5227eccc ("sfc: remove tx and MCDI handling from NAPI budget consideration")
      Fixes: d19a5372 ("sfc_ef100: TX path for EF100 NICs")
      Reported-by: default avatarFei Liu <feliu@redhat.com>
      Signed-off-by: default avatarÍñigo Huguet <ihuguet@redhat.com>
      Acked-by: default avatarMartin Habets <habetsm.xilinx@gmail.com>
      Link: https://lore.kernel.org/r/20230615084929.10506-1-ihuguet@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4aaf2c52
  5. 16 Jun, 2023 25 commits
    • Leon Romanovsky's avatar
      net/mlx5e: Fix scheduling of IPsec ASO query while in atomic · a128f9d4
      Leon Romanovsky authored
      ASO query can be scheduled in atomic context as such it can't use usleep.
      Use udelay as recommended in Documentation/timers/timers-howto.rst.
      
      Fixes: 76e463f6 ("net/mlx5e: Overcome slow response for first IPsec ASO WQE")
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      a128f9d4
    • Leon Romanovsky's avatar
      net/mlx5e: Drop XFRM state lock when modifying flow steering · c75b9425
      Leon Romanovsky authored
      XFRM state which is changed to be XFRM_STATE_EXPIRED doesn't really
      need to hold lock while modifying flow steering rules to drop traffic.
      
      That state can be deleted only and as such mlx5e_ipsec_handle_tx_limit()
      work will be canceled anyway and won't run in parallel.
      
      Fixes: b2f7b01d ("net/mlx5e: Simulate missing IPsec TX limits hardware functionality")
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      c75b9425
    • Patrisious Haddad's avatar
      net/mlx5e: Fix ESN update kernel panic · fef06678
      Patrisious Haddad authored
      Previously during mlx5e_ipsec_handle_event the driver tried to execute
      an operation that could sleep, while holding a spinlock, which caused
      the kernel panic mentioned below.
      
      Move the function call that can sleep outside of the spinlock context.
      
       Call Trace:
       <TASK>
       dump_stack_lvl+0x49/0x6c
       __schedule_bug.cold+0x42/0x4e
       schedule_debug.constprop.0+0xe0/0x118
       __schedule+0x59/0x58a
       ? __mod_timer+0x2a1/0x3ef
       schedule+0x5e/0xd4
       schedule_timeout+0x99/0x164
       ? __pfx_process_timeout+0x10/0x10
       __wait_for_common+0x90/0x1da
       ? __pfx_schedule_timeout+0x10/0x10
       wait_func+0x34/0x142 [mlx5_core]
       mlx5_cmd_invoke+0x1f3/0x313 [mlx5_core]
       cmd_exec+0x1fe/0x325 [mlx5_core]
       mlx5_cmd_do+0x22/0x50 [mlx5_core]
       mlx5_cmd_exec+0x1c/0x40 [mlx5_core]
       mlx5_modify_ipsec_obj+0xb2/0x17f [mlx5_core]
       mlx5e_ipsec_update_esn_state+0x69/0xf0 [mlx5_core]
       ? wake_affine+0x62/0x1f8
       mlx5e_ipsec_handle_event+0xb1/0xc0 [mlx5_core]
       process_one_work+0x1e2/0x3e6
       ? __pfx_worker_thread+0x10/0x10
       worker_thread+0x54/0x3ad
       ? __pfx_worker_thread+0x10/0x10
       kthread+0xda/0x101
       ? __pfx_kthread+0x10/0x10
       ret_from_fork+0x29/0x37
       </TASK>
       BUG: workqueue leaked lock or atomic: kworker/u256:4/0x7fffffff/189754#012     last function: mlx5e_ipsec_handle_event [mlx5_core]
       CPU: 66 PID: 189754 Comm: kworker/u256:4 Kdump: loaded Tainted: G        W          6.2.0-2596.20230309201517_5.el8uek.rc1.x86_64 #2
       Hardware name: Oracle Corporation ORACLE SERVER X9-2/ASMMBX9-2, BIOS 61070300 08/17/2022
       Workqueue: mlx5e_ipsec: eth%d mlx5e_ipsec_handle_event [mlx5_core]
       Call Trace:
       <TASK>
       dump_stack_lvl+0x49/0x6c
       process_one_work.cold+0x2b/0x3c
       ? __pfx_worker_thread+0x10/0x10
       worker_thread+0x54/0x3ad
       ? __pfx_worker_thread+0x10/0x10
       kthread+0xda/0x101
       ? __pfx_kthread+0x10/0x10
       ret_from_fork+0x29/0x37
       </TASK>
       BUG: scheduling while atomic: kworker/u256:4/189754/0x00000000
      
      Fixes: cee137a6 ("net/mlx5e: Handle ESN update events")
      Signed-off-by: default avatarPatrisious Haddad <phaddad@nvidia.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      fef06678
    • Leon Romanovsky's avatar
      net/mlx5e: Don't delay release of hardware objects · cf5bb023
      Leon Romanovsky authored
      XFRM core provides two callbacks to release resources, one is .xdo_dev_policy_delete()
      and another is .xdo_dev_policy_free(). This separation allows delayed release so
      "ip xfrm policy free" commands won't starve. Unfortunately, mlx5 command interface
      can't run in .xdo_dev_policy_free() callbacks as the latter runs in ATOMIC context.
      
       BUG: scheduling while atomic: swapper/7/0/0x00000100
       Modules linked in: act_mirred act_tunnel_key cls_flower sch_ingress vxlan mlx5_vdpa vringh vhost_iotlb vdpa rpcrdma rdma_ucm ib_iser libiscsi ib_umad scsi_transport_iscsi rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay mlx5_core zram zsmalloc fuse
       CPU: 7 PID: 0 Comm: swapper/7 Not tainted 6.3.0+ #1
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       Call Trace:
        <IRQ>
        dump_stack_lvl+0x33/0x50
        __schedule_bug+0x4e/0x60
        __schedule+0x5d5/0x780
        ? __mod_timer+0x286/0x3d0
        schedule+0x50/0x90
        schedule_timeout+0x7c/0xf0
        ? __bpf_trace_tick_stop+0x10/0x10
        __wait_for_common+0x88/0x190
        ? usleep_range_state+0x90/0x90
        cmd_exec+0x42e/0xb40 [mlx5_core]
        mlx5_cmd_do+0x1e/0x40 [mlx5_core]
        mlx5_cmd_exec+0x18/0x30 [mlx5_core]
        mlx5_cmd_delete_fte+0xa8/0xd0 [mlx5_core]
        del_hw_fte+0x60/0x120 [mlx5_core]
        mlx5_del_flow_rules+0xec/0x270 [mlx5_core]
        ? default_send_IPI_single_phys+0x26/0x30
        mlx5e_accel_ipsec_fs_del_pol+0x1a/0x60 [mlx5_core]
        mlx5e_xfrm_free_policy+0x15/0x20 [mlx5_core]
        xfrm_policy_destroy+0x5a/0xb0
        xfrm4_dst_destroy+0x7b/0x100
        dst_destroy+0x37/0x120
        rcu_core+0x2d6/0x540
        __do_softirq+0xcd/0x273
        irq_exit_rcu+0x82/0xb0
        sysvec_apic_timer_interrupt+0x72/0x90
        </IRQ>
        <TASK>
        asm_sysvec_apic_timer_interrupt+0x16/0x20
       RIP: 0010:default_idle+0x13/0x20
       Code: c0 08 00 00 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 72 ff ff ff cc cc cc cc 8b 05 7a 4d ee 00 85 c0 7e 07 0f 00 2d 2f 98 2e 00 fb f4 <fa> c3 66 66 2e 0f 1f 84 00 00 00 00 00 65 48 8b 04 25 40 b4 02 00
       RSP: 0018:ffff888100843ee0 EFLAGS: 00000242
       RAX: 0000000000000001 RBX: ffff888100812b00 RCX: 4000000000000000
       RDX: 0000000000000001 RSI: 0000000000000083 RDI: 000000000002d2ec
       RBP: 0000000000000007 R08: 00000021daeded59 R09: 0000000000000001
       R10: 0000000000000000 R11: 000000000000000f R12: 0000000000000000
       R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
        default_idle_call+0x30/0xb0
        do_idle+0x1c1/0x1d0
        cpu_startup_entry+0x19/0x20
        start_secondary+0xfe/0x120
        secondary_startup_64_no_verify+0xf3/0xfb
        </TASK>
       bad: scheduling from the idle thread!
      
      Fixes: a5b8ca94 ("net/mlx5e: Add XFRM policy offload logic")
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      cf5bb023
    • Saeed Mahameed's avatar
      net/mlx5: Free IRQ rmap and notifier on kernel shutdown · 314ded53
      Saeed Mahameed authored
      The kernel IRQ system needs the irq affinity notifier to be clear
      before attempting to free the irq, see WARN_ON log below.
      
      On a normal driver unload we don't have this issue since we do the
      complete cleanup of the irq resources.
      
      To fix this, put the important resources cleanup in a helper function
      and use it in both normal driver unload and shutdown flows.
      
      [ 4497.498434] ------------[ cut here ]------------
      [ 4497.498726] WARNING: CPU: 0 PID: 9 at kernel/irq/manage.c:2034 free_irq+0x295/0x340
      [ 4497.499193] Modules linked in:
      [ 4497.499386] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G        W          6.4.0-rc4+ #10
      [ 4497.499876] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
      [ 4497.500518] Workqueue: events do_poweroff
      [ 4497.500849] RIP: 0010:free_irq+0x295/0x340
      [ 4497.501132] Code: 85 c0 0f 84 1d ff ff ff 48 89 ef ff d0 0f 1f 00 e9 10 ff ff ff 0f 0b e9 72 ff ff ff 49 8d 7f 28 ff d0 0f 1f 00 e9 df fd ff ff <0f> 0b 48 c7 80 c0 008
      [ 4497.502269] RSP: 0018:ffffc90000053da0 EFLAGS: 00010282
      [ 4497.502589] RAX: ffff888100949600 RBX: ffff88810330b948 RCX: 0000000000000000
      [ 4497.503035] RDX: ffff888100949600 RSI: ffff888100400490 RDI: 0000000000000023
      [ 4497.503472] RBP: ffff88810330c7e0 R08: ffff8881004005d0 R09: ffffffff8273a260
      [ 4497.503923] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8881009ae000
      [ 4497.504359] R13: ffff8881009ae148 R14: 0000000000000000 R15: ffff888100949600
      [ 4497.504804] FS:  0000000000000000(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
      [ 4497.505302] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 4497.505671] CR2: 00007fce98806298 CR3: 000000000262e005 CR4: 0000000000370ef0
      [ 4497.506104] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 4497.506540] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 4497.507002] Call Trace:
      [ 4497.507158]  <TASK>
      [ 4497.507299]  ? free_irq+0x295/0x340
      [ 4497.507522]  ? __warn+0x7c/0x130
      [ 4497.507740]  ? free_irq+0x295/0x340
      [ 4497.507963]  ? report_bug+0x171/0x1a0
      [ 4497.508197]  ? handle_bug+0x3c/0x70
      [ 4497.508417]  ? exc_invalid_op+0x17/0x70
      [ 4497.508662]  ? asm_exc_invalid_op+0x1a/0x20
      [ 4497.508926]  ? free_irq+0x295/0x340
      [ 4497.509146]  mlx5_irq_pool_free_irqs+0x48/0x90
      [ 4497.509421]  mlx5_irq_table_free_irqs+0x38/0x50
      [ 4497.509714]  mlx5_core_eq_free_irqs+0x27/0x40
      [ 4497.509984]  shutdown+0x7b/0x100
      [ 4497.510184]  pci_device_shutdown+0x30/0x60
      [ 4497.510440]  device_shutdown+0x14d/0x240
      [ 4497.510698]  kernel_power_off+0x30/0x70
      [ 4497.510938]  process_one_work+0x1e6/0x3e0
      [ 4497.511183]  worker_thread+0x49/0x3b0
      [ 4497.511407]  ? __pfx_worker_thread+0x10/0x10
      [ 4497.511679]  kthread+0xe0/0x110
      [ 4497.511879]  ? __pfx_kthread+0x10/0x10
      [ 4497.512114]  ret_from_fork+0x29/0x50
      [ 4497.512342]  </TASK>
      
      Fixes: 9c2d0801 ("net/mlx5: Free irqs only on shutdown callback")
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Reviewed-by: default avatarShay Drory <shayd@nvidia.com>
      314ded53
    • Yevgeny Kliteynik's avatar
      net/mlx5: DR, Fix wrong action data allocation in decap action · ef4c5afc
      Yevgeny Kliteynik authored
      When TUNNEL_L3_TO_L2 decap action was created, a pointer to a local
      variable was passed as its HW action data, resulting in attempt to
      free invalid address:
      
        BUG: KASAN: invalid-free in mlx5dr_action_destroy+0x318/0x410 [mlx5_core]
      
      Fixes: 4781df92 ("net/mlx5: DR, Move STEv0 modify header logic")
      Signed-off-by: default avatarYevgeny Kliteynik <kliteyn@nvidia.com>
      Reviewed-by: default avatarAlex Vesker <valex@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      ef4c5afc
    • Yevgeny Kliteynik's avatar
      net/mlx5: DR, Support SW created encap actions for FW table · 87cd0649
      Yevgeny Kliteynik authored
      In some cases, steering might need to use SW-created action in
      FW table, which results in wrong packet reformat being used:
      
        mlx5_core 0000:81:00.1: mlx5_cmd_check:756:(pid 1154):
            SET_FLOW_TABLE_ENTRY(0×936) op_mod(0×0) failed,
            status bad resource(0×5), syndrome (0xf2ff71)
      
      This patch adds support for usage of SW-created packet reformat (encap)
      actions in FW tables, and adds clear error flow for attempt to use
      SW-created modify header on FW tables.
      
      Fixes: 6a48faee ("net/mlx5: Add direct rule fs_cmd implementation")
      Signed-off-by: default avatarYevgeny Kliteynik <kliteyn@nvidia.com>
      Reviewed-by: default avatarErez Shitrit <erezsh@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      87cd0649
    • Chris Mi's avatar
      net/mlx5e: TC, Cleanup ct resources for nic flow · fb7be476
      Chris Mi authored
      The cited commit removes special handling of CT action. But it
      removes too much. Pre ct/ct_nat tables and some other resources
      are not destroyed due to the cited commit.
      
      Fix it by adding it back.
      
      Fixes: 08fe94ec ("net/mlx5e: TC, Remove special handling of CT action")
      Signed-off-by: default avatarChris Mi <cmi@nvidia.com>
      Reviewed-by: default avatarPaul Blakey <paulb@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      fb7be476
    • Chris Mi's avatar
      net/mlx5e: TC, Add null pointer check for hardware miss support · b100573a
      Chris Mi authored
      The cited commits add hardware miss support to tc action. But if
      the rules can't be offloaded, the pointers are null and system
      will panic when accessing them.
      
      Fix it by checking null pointer.
      
      Fixes: 08fe94ec ("net/mlx5e: TC, Remove special handling of CT action")
      Fixes: 67027828 ("net/mlx5e: TC, Set CT miss to the specific ct action instance")
      Signed-off-by: default avatarChris Mi <cmi@nvidia.com>
      Reviewed-by: default avatarPaul Blakey <paulb@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      b100573a
    • Eli Cohen's avatar
      net/mlx5: Fix driver load with single msix vector · 0ab999d4
      Eli Cohen authored
      When a PCI device has just one msix vector available, we want to share
      this vector between async and completion events. Current code fails to
      do that assuming it will always have at least one dedicated vector for
      completion events. Fix this by detecting when the pool contains just a
      single vector.
      
      Fixes: 3354822c ("net/mlx5: Use dynamic msix vectors allocation")
      Signed-off-by: default avatarEli Cohen <elic@nvidia.com>
      Reviewed-by: default avatarShay Drory <shayd@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      0ab999d4
    • Maxim Mikityanskiy's avatar
      net/mlx5e: xsk: Set napi_id to support busy polling on XSK RQ · 62a522d3
      Maxim Mikityanskiy authored
      The cited commit missed setting napi_id on XSK RQs, it only affected
      regular RQs. Add the missing part to support socket busy polling on XSK
      RQs.
      
      Fixes: a2740f52 ("net/mlx5e: xsk: Set napi_id to support busy polling")
      Signed-off-by: default avatarMaxim Mikityanskiy <maxtram95@gmail.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      62a522d3
    • Maxim Mikityanskiy's avatar
      net/mlx5e: XDP, Allow growing tail for XDP multi buffer · 4e7401fc
      Maxim Mikityanskiy authored
      The cited commits missed passing frag_size to __xdp_rxq_info_reg, which
      is required by bpf_xdp_adjust_tail to support growing the tail pointer
      in fragmented packets. Pass the missing parameter when the current RQ
      mode allows XDP multi buffer.
      
      Fixes: ea5d49bd ("net/mlx5e: Add XDP multi buffer support to the non-linear legacy RQ")
      Fixes: 9cb9482e ("net/mlx5e: Use fragments of the same size in non-linear legacy RQ with XDP")
      Signed-off-by: default avatarMaxim Mikityanskiy <maxtram95@gmail.com>
      Cc: Tariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      4e7401fc
    • Jakub Kicinski's avatar
      Merge branch 'check-if-fips-mode-is-enabled-when-running-selftests' · d4e06728
      Jakub Kicinski authored
      Magali Lemes says:
      
      ====================
      Check if FIPS mode is enabled when running selftests
      
      Some test cases from net/tls, net/fcnal-test and net/vrf-xfrm-tests
      that rely on cryptographic functions to work and use non-compliant FIPS
      algorithms fail in FIPS mode.
      
      In order to allow these tests to pass in a wider set of kernels,
       - for net/tls, skip the test variants that use the ChaCha20-Poly1305
      and SM4 algorithms, when FIPS mode is enabled;
       - for net/fcnal-test, skip the MD5 tests, when FIPS mode is enabled;
       - for net/vrf-xfrm-tests, replace the algorithms that are not
      FIPS-compliant with compliant ones.
      
      v1: https://lore.kernel.org/netdev/20230607174302.19542-1-magali.lemes@canonical.com/
      v2: https://lore.kernel.org/netdev/20230609164324.497813-1-magali.lemes@canonical.com/
      v3: https://lore.kernel.org/netdev/20230612125107.73795-1-magali.lemes@canonical.com/
      ====================
      
      Link: https://lore.kernel.org/r/20230613123222.631897-1-magali.lemes@canonical.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d4e06728
    • Magali Lemes's avatar
      selftests: net: fcnal-test: check if FIPS mode is enabled · d7a2fc14
      Magali Lemes authored
      There are some MD5 tests which fail when the kernel is in FIPS mode,
      since MD5 is not FIPS compliant. Add a check and only run those tests
      if FIPS mode is not enabled.
      
      Fixes: f0bee1eb ("fcnal-test: Add TCP MD5 tests")
      Fixes: 5cad8bce ("fcnal-test: Add TCP MD5 tests for VRF")
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarMagali Lemes <magali.lemes@canonical.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d7a2fc14
    • Magali Lemes's avatar
      selftests: net: vrf-xfrm-tests: change authentication and encryption algos · cb43c60e
      Magali Lemes authored
      The vrf-xfrm-tests tests use the hmac(md5) and cbc(des3_ede)
      algorithms for performing authentication and encryption, respectively.
      This causes the tests to fail when fips=1 is set, since these algorithms
      are not allowed in FIPS mode. Therefore, switch from hmac(md5) and
      cbc(des3_ede) to hmac(sha1) and cbc(aes), which are FIPS compliant.
      
      Fixes: 3f251d74 ("selftests: Add tests for vrf and xfrms")
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarMagali Lemes <magali.lemes@canonical.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cb43c60e
    • Magali Lemes's avatar
      selftests: net: tls: check if FIPS mode is enabled · d113c395
      Magali Lemes authored
      TLS selftests use the ChaCha20-Poly1305 and SM4 algorithms, which are not
      FIPS compliant. When fips=1, this set of tests fails. Add a check and only
      run these tests if not in FIPS mode.
      
      Fixes: 4f336e88 ("selftests/tls: add CHACHA20-POLY1305 to tls selftests")
      Fixes: e506342a ("selftests/tls: add SM4 GCM/CCM to tls selftests")
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarMagali Lemes <magali.lemes@canonical.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d113c395
    • Magali Lemes's avatar
      selftests/harness: allow tests to be skipped during setup · 372b304c
      Magali Lemes authored
      Before executing each test from a fixture, FIXTURE_SETUP is run once.
      When SKIP is used in FIXTURE_SETUP, the setup function returns early
      but the test still proceeds to run, unless another SKIP macro is used
      within the test definition, leading to some code repetition. Therefore,
      allow tests to be skipped directly from the setup function.
      Suggested-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarMagali Lemes <magali.lemes@canonical.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      372b304c
    • Linus Torvalds's avatar
      Merge tag 'net-6.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 40f71e7c
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from wireless, and netfilter.
      
        Selftests excluded - we have 58 patches and diff of +442/-199, which
        isn't really small but perhaps with the exception of the WiFi locking
        change it's old(ish) bugs.
      
        We have no known problems with v6.4.
      
        The selftest changes are rather large as MPTCP folks try to apply
        Greg's guidance that selftest from torvalds/linux should be able to
        run against stable kernels.
      
        Last thing I should call out is the DCCP/UDP-lite deprecation notices.
        We are fairly sure those are dead, but if we're wrong reverting them
        back in won't be fun.
      
        Current release - regressions:
      
         - wifi:
            - cfg80211: fix double lock bug in reg_wdev_chan_valid()
            - iwlwifi: mvm: spin_lock_bh() to fix lockdep regression
      
        Current release - new code bugs:
      
         - handshake: remove fput() that causes use-after-free
      
        Previous releases - regressions:
      
         - sched: cls_u32: fix reference counter leak leading to overflow
      
         - sched: cls_api: fix lockup on flushing explicitly created chain
      
        Previous releases - always broken:
      
         - nf_tables: integrate pipapo into commit protocol
      
         - nf_tables: incorrect error path handling with NFT_MSG_NEWRULE, fix
           dangling pointer on failure
      
         - ping6: fix send to link-local addresses with VRF
      
         - sched: act_pedit: parse L3 header for L4 offset, the skb may not
           have the offset saved
      
         - sched: act_ct: fix promotion of offloaded unreplied tuple
      
         - sched: refuse to destroy an ingress and clsact Qdiscs if there are
           lockless change operations in flight
      
         - wifi: mac80211: fix handful of bugs in multi-link operation
      
         - ipvlan: fix bound dev checking for IPv6 l3s mode
      
         - eth: enetc: correct the indexes of highest and 2nd highest TCs
      
         - eth: ice: fix XDP memory leak when NIC is brought up and down
      
        Misc:
      
         - add deprecation notices for UDP-lite and DCCP
      
         - selftests: mptcp: skip tests not supported by old kernels
      
         - sctp: handle invalid error codes without calling BUG()"
      
      * tag 'net-6.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits)
        dccp: Print deprecation notice.
        udplite: Print deprecation notice.
        octeon_ep: Add missing check for ioremap
        selftests/ptp: Fix timestamp printf format for PTP_SYS_OFFSET
        net: ethernet: stmicro: stmmac: fix possible memory leak in __stmmac_open
        net: tipc: resize nlattr array to correct size
        sfc: fix XDP queues mode with legacy IRQ
        net: macsec: fix double free of percpu stats
        net: lapbether: only support ethernet devices
        MAINTAINERS: add reviewers for SMC Sockets
        s390/ism: Fix trying to free already-freed IRQ by repeated ism_dev_exit()
        net: dsa: felix: fix taprio guard band overflow at 10Mbps with jumbo frames
        net/sched: cls_api: Fix lockup on flushing explicitly created chain
        ice: Fix ice module unload
        net/handshake: remove fput() that causes use-after-free
        selftests: forwarding: hw_stats_l3: Set addrgenmode in a separate step
        net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
        net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs
        net/sched: act_ct: Fix promotion of offloaded unreplied tuple
        wifi: iwlwifi: mvm: spin_lock_bh() to fix lockdep regression
        ...
      40f71e7c
    • Linus Torvalds's avatar
      Merge tag 'loongarch-fixes-6.4-1' of... · 627d8586
      Linus Torvalds authored
      Merge tag 'loongarch-fixes-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
      
      Pull LoongArch fixes from Huacai Chen:
       "Some trivial bug fixes for v6.4-rc7"
      
      * tag 'loongarch-fixes-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
        LoongArch: Fix debugfs_create_dir() error checking
        LoongArch: Avoid uninitialized alignment_mask
        LoongArch: Fix perf event id calculation
        LoongArch: Fix the write_fcsr() macro
        LoongArch: Let pmd_present() return true when splitting pmd
      627d8586
    • Linus Torvalds's avatar
      Merge tag 'for-6.4/dm-fixes' of... · 0e306952
      Linus Torvalds authored
      Merge tag 'for-6.4/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
      
      Pull device mapper fixes from Mike Snitzer:
      
       - Fix DM thinp discard performance regression introduced during this
         merge window where DM core was splitting large discards every 128K
         (max_sectors_kb) rather than every 64M (discard_max_bytes).
      
       - Extend DM core LOCKFS fix, made during 6.4 merge, to also fix race
         between do_mount and dm's do_suspend (in addition to the earlier
         fix's do_mount race with dm's do_resume).
      
       - Fix DM thin metadata operations to first check if the thin-pool is in
         "fail_io" mode; otherwise UAF can occur.
      
       - Fix DM thinp's call to __blkdev_issue_discard to use GFP_NOIO rather
         than GFP_NOWAIT (__blkdev_issue_discard cannot handle NULL return
         from bio_alloc).
      
      * tag 'for-6.4/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
        dm: use op specific max_sectors when splitting abnormal io
        dm thin: fix issue_discard to pass GFP_NOIO to __blkdev_issue_discard
        dm thin metadata: check fail_io before using data_sm
        dm: don't lock fs when the map is NULL during suspend or resume
      0e306952
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 93fd8eb0
      Linus Torvalds authored
      Pull rdma fixes from Jason Gunthorpe:
       "This is an unusually large bunch of bug fixes for the later rc cycle,
        rxe and mlx5 both dumped a lot of things at once. rxe continues to fix
        itself, and mlx5 is fixing a bunch of "queue counters" related bugs.
      
        There is one highly notable bug fix regarding the qkey. This small
        security check was missed in the original 2005 implementation and it
        allows some significant issues.
      
        Summary:
      
         - Two rtrs bug fixes for error unwind bugs
      
         - Several rxe bug fixes:
            * Incorrect Rx packet validation
            * Using memory without a refcount
            * Syzkaller found use before initialization
            * Regression fix for missing locking with the tasklet conversion
              from this merge window
      
         - Have bnxt report the correct link properties to userspace, this was
           a regression in v6.3
      
         - Several mlx5 bug fixes:
            * Kernel crash triggerable by userspace for the RAW ethernet
              profile
            * Defend against steering refcounting issues created by userspace
            * Incorrect change of QP port affinity parameters in some LAG
              configurations
      
         - Fix mlx5 Q counters:
            * Do not over allocate Q counters to allow userspace to use the
              full port capacity
            * Kernel crash triggered by eswitch due to mis-use of Q counters
            * Incorrect mlx5_device for Q counters in some LAG configurations
      
         - Properly implement the IBA spec restricting privileged qkeys to
           root
      
         - Always an error when reading from a disassociated device's event
           queue
      
         - isert bug fixes:
            * Avoid a deadlock with the CM handler and CM ID destruction
            * Correct list corruption due to incorrect locking
            * Fix a use after free around connection tear down"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
        RDMA/rxe: Fix rxe_cq_post
        IB/isert: Fix incorrect release of isert connection
        IB/isert: Fix possible list corruption in CMA handler
        IB/isert: Fix dead lock in ib_isert
        RDMA/mlx5: Fix affinity assignment
        IB/uverbs: Fix to consider event queue closing also upon non-blocking mode
        RDMA/uverbs: Restrict usage of privileged QKEYs
        RDMA/cma: Always set static rate to 0 for RoCE
        RDMA/mlx5: Fix Q-counters query in LAG mode
        RDMA/mlx5: Remove vport Q-counters dependency on normal Q-counters
        RDMA/mlx5: Fix Q-counters per vport allocation
        RDMA/mlx5: Create an indirect flow table for steering anchor
        RDMA/mlx5: Initiate dropless RQ for RAW Ethernet functions
        RDMA/rxe: Fix the use-before-initialization error of resp_pkts
        RDMA/bnxt_re: Fix reporting active_{speed,width} attributes
        RDMA/rxe: Fix ref count error in check_rkey()
        RDMA/rxe: Fix packet length checks
        RDMA/rtrs: Fix rxe_dealloc_pd warning
        RDMA/rtrs: Fix the last iu->buf leak in err path
      93fd8eb0
    • Linus Torvalds's avatar
      Merge tag 'spi-fix-v6.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi · b7feaa49
      Linus Torvalds authored
      Pull spi fixes from Mark Brown:
       "A few more driver specific fixes.
      
        The DesignWare fix is for an issue introduced by conversion to the
        chip select accessor functions and is pretty important but the other
        two are less severe"
      
      * tag 'spi-fix-v6.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
        spi: dw: Replace incorrect spi_get_chipselect with set
        spi: fsl-dspi: avoid SCK glitches with continuous transfers
        spi: cadence-quadspi: Add missing check for dma_set_mask
      b7feaa49
    • Linus Torvalds's avatar
      Merge tag 'regulator-fix-v6.4-rc6' of... · eee71c34
      Linus Torvalds authored
      Merge tag 'regulator-fix-v6.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
      
      Pull regulator fix from Mark Brown:
       "The set of regulators described for the Qualcomm PM8550 just seems to
        have been completely wrong and would likely not have worked at all if
        anything tried to actually configure anything except for enabling and
        disabling at runtime"
      
      * tag 'regulator-fix-v6.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
        regulator: qcom-rpmh: Fix regulators for PM8550
      eee71c34
    • Linus Torvalds's avatar
      Merge tag 'regmap-fix-v6.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap · 231a1e31
      Linus Torvalds authored
      Pull regmap fix from Mark Brown:
       "Another fix for the maple tree cache, Takashi noticed that unlike
        other caches the maple tree cache didn't check for read only registers
        before trying to sync which would result in spurious syncs for read
        only registers where we don't have a default.
      
        This was due to the check being open coded in the caches, we now check
        in the shared 'does this register need sync' function so that is fixed
        for this and future caches"
      
      * tag 'regmap-fix-v6.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
        regmap: regcache: Don't sync read-only registers
      231a1e31
    • Linus Torvalds's avatar
      Merge tag 'media/v6.4-6' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · c926a55f
      Linus Torvalds authored
      Pull media fixes from Mauro Carvalho Chehab:
       "A fix for dvb-core to avoid a race condition during DVB board
        registration"
      
      * tag 'media/v6.4-6' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
        Revert "media: dvb-core: Fix use-after-free on race condition at dvb_frontend"
      c926a55f
  6. 15 Jun, 2023 1 commit