1. 31 Jan, 2024 3 commits
    • Xin Long's avatar
      netfilter: conntrack: check SCTP_CID_SHUTDOWN_ACK for vtag setting in sctp_new · 6e348067
      Xin Long authored
      The annotation says in sctp_new(): "If it is a shutdown ack OOTB packet, we
      expect a return shutdown complete, otherwise an ABORT Sec 8.4 (5) and (8)".
      However, it does not check SCTP_CID_SHUTDOWN_ACK before setting vtag[REPLY]
      in the conntrack entry(ct).
      
      Because of that, if the ct in Router disappears for some reason in [1]
      with the packet sequence like below:
      
         Client > Server: sctp (1) [INIT] [init tag: 3201533963]
         Server > Client: sctp (1) [INIT ACK] [init tag: 972498433]
         Client > Server: sctp (1) [COOKIE ECHO]
         Server > Client: sctp (1) [COOKIE ACK]
         Client > Server: sctp (1) [DATA] (B)(E) [TSN: 3075057809]
         Server > Client: sctp (1) [SACK] [cum ack 3075057809]
         Server > Client: sctp (1) [HB REQ]
         (the ct in Router disappears somehow)  <-------- [1]
         Client > Server: sctp (1) [HB ACK]
         Client > Server: sctp (1) [DATA] (B)(E) [TSN: 3075057810]
         Client > Server: sctp (1) [DATA] (B)(E) [TSN: 3075057810]
         Client > Server: sctp (1) [HB REQ]
         Client > Server: sctp (1) [DATA] (B)(E) [TSN: 3075057810]
         Client > Server: sctp (1) [HB REQ]
         Client > Server: sctp (1) [ABORT]
      
      when processing HB ACK packet in Router it calls sctp_new() to initialize
      the new ct with vtag[REPLY] set to HB_ACK packet's vtag.
      
      Later when sending DATA from Client, all the SACKs from Server will get
      dropped in Router, as the SACK packet's vtag does not match vtag[REPLY]
      in the ct. The worst thing is the vtag in this ct will never get fixed
      by the upcoming packets from Server.
      
      This patch fixes it by checking SCTP_CID_SHUTDOWN_ACK before setting
      vtag[REPLY] in the ct in sctp_new() as the annotation says. With this
      fix, it will leave vtag[REPLY] in ct to 0 in the case above, and the
      next HB REQ/ACK from Server is able to fix the vtag as its value is 0
      in nf_conntrack_sctp_packet().
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      6e348067
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: restrict tunnel object to NFPROTO_NETDEV · 776d4516
      Pablo Neira Ayuso authored
      Bail out on using the tunnel dst template from other than netdev family.
      Add the infrastructure to check for the family in objects.
      
      Fixes: af308b94 ("netfilter: nf_tables: add tunnel support")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      776d4516
    • Ryan Schaefer's avatar
      netfilter: conntrack: correct window scaling with retransmitted SYN · fb366fc7
      Ryan Schaefer authored
      commit c7aab4f1 ("netfilter: nf_conntrack_tcp: re-init for syn packets
      only") introduces a bug where SYNs in ORIGINAL direction on reused 5-tuple
      result in incorrect window scale negotiation. This commit merged the SYN
      re-initialization and simultaneous open or SYN retransmits cases. Merging
      this block added the logic in tcp_init_sender() that performed window scale
      negotiation to the retransmitted syn case. Previously. this would only
      result in updating the sender's scale and flags. After the merge the
      additional logic results in improperly clearing the scale in ORIGINAL
      direction before any packets in the REPLY direction are received. This
      results in packets incorrectly being marked invalid for being
      out-of-window.
      
      This can be reproduced with the following trace:
      
      Packet Sequence:
      > Flags [S], seq 1687765604, win 62727, options [.. wscale 7], length 0
      > Flags [S], seq 1944817196, win 62727, options [.. wscale 7], length 0
      
      In order to fix the issue, only evaluate window negotiation for packets
      in the REPLY direction. This was tested with simultaneous open, fast
      open, and the above reproduction.
      
      Fixes: c7aab4f1 ("netfilter: nf_conntrack_tcp: re-init for syn packets only")
      Signed-off-by: default avatarRyan Schaefer <ryanschf@amazon.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      fb366fc7
  2. 25 Jan, 2024 3 commits
  3. 24 Jan, 2024 25 commits
    • Jakub Kicinski's avatar
      Merge branch 'fix-module_description-for-net-p2' · 77be2247
      Jakub Kicinski authored
      Breno Leitao says:
      
      ====================
      Fix MODULE_DESCRIPTION() for net (p2)
      
      There are hundreds of network modules that misses MODULE_DESCRIPTION(),
      causing a warnning when compiling with W=1. Example:
      
              WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/net/arcnet/com90io.o
              WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/net/arcnet/arc-rimi.o
              WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/net/arcnet/com20020.o
      
      This part2 of the patchset focus on the drivers/net/ethernet drivers.
      There are still some missing warnings in drivers/net/ethernet that will
      be fixed in an upcoming patchset.
      
      v1: https://lore.kernel.org/all/20240122184543.2501493-2-leitao@debian.org/
      ====================
      
      Link: https://lore.kernel.org/r/20240123190332.677489-1-leitao@debian.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      77be2247
    • Breno Leitao's avatar
      net: fill in MODULE_DESCRIPTION()s for rvu_mbox · bdc67341
      Breno Leitao authored
      W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
      Add descriptions to the Marvel RVU mbox driver.
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Link: https://lore.kernel.org/r/20240123190332.677489-11-leitao@debian.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bdc67341
    • Breno Leitao's avatar
      net: fill in MODULE_DESCRIPTION()s for litex · 07d1e0ce
      Breno Leitao authored
      W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
      Add descriptions to the LiteX Liteeth Ethernet device.
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Acked-by: default avatarGabriel Somlo <gsomlo@gmail.com>
      Link: https://lore.kernel.org/r/20240123190332.677489-10-leitao@debian.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      07d1e0ce
    • Breno Leitao's avatar
      net: fill in MODULE_DESCRIPTION()s for fsl_pq_mdio · 8183c470
      Breno Leitao authored
      W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
      Add descriptions to the Freescale PQ MDIO driver.
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Link: https://lore.kernel.org/r/20240123190332.677489-9-leitao@debian.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8183c470
    • Breno Leitao's avatar
      net: fill in MODULE_DESCRIPTION()s for fec · 2e875764
      Breno Leitao authored
      W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
      Add descriptions to the FEC (MPC8xx) Ethernet controller.
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Reviewed-by: default avatarWei Fang <wei.fang@nxp.com>
      Link: https://lore.kernel.org/r/20240123190332.677489-8-leitao@debian.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2e875764
    • Breno Leitao's avatar
      net: fill in MODULE_DESCRIPTION()s for enetc · 07c42d23
      Breno Leitao authored
      W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
      Add descriptions to the NXP ENETC Ethernet driver.
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Link: https://lore.kernel.org/r/20240123190332.677489-7-leitao@debian.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      07c42d23
    • Breno Leitao's avatar
      net: fill in MODULE_DESCRIPTION()s for nps_enet · 27881ca8
      Breno Leitao authored
      W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
      Add descriptions to the EZchip NPS ethernet driver.
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Link: https://lore.kernel.org/r/20240123190332.677489-6-leitao@debian.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      27881ca8
    • Breno Leitao's avatar
      net: fill in MODULE_DESCRIPTION()s for ep93xxx_eth · 53c83e2d
      Breno Leitao authored
      W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
      Add descriptions to the Cirrus EP93xx ethernet driver.
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Link: https://lore.kernel.org/r/20240123190332.677489-5-leitao@debian.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      53c83e2d
    • Breno Leitao's avatar
      net: fill in MODULE_DESCRIPTION()s for liquidio · bb567fbb
      Breno Leitao authored
      W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
      Add descriptions to the Cavium Liquidio.
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Link: https://lore.kernel.org/r/20240123190332.677489-4-leitao@debian.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bb567fbb
    • Breno Leitao's avatar
      net: fill in MODULE_DESCRIPTION()s for Broadcom bgmac · 39535d7f
      Breno Leitao authored
      W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
      Add descriptions to the Broadcom iProc GBit driver.
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Acked-by: default avatarFlorian Fainelli <florian.fainelli@broadcom.com>
      Link: https://lore.kernel.org/r/20240123190332.677489-3-leitao@debian.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      39535d7f
    • Breno Leitao's avatar
      net: fill in MODULE_DESCRIPTION()s for 8390 · f5e41416
      Breno Leitao authored
      W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
      Add descriptions to all the good old 8390 modules and drivers.
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      CC: geert@linux-m68k.org
      Link: https://lore.kernel.org/r/20240123190332.677489-2-leitao@debian.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f5e41416
    • Jakub Kicinski's avatar
      selftests: netdevsim: fix the udp_tunnel_nic test · 0879020a
      Jakub Kicinski authored
      This test is missing a whole bunch of checks for interface
      renaming and one ifup. Presumably it was only used on a system
      with renaming disabled and NetworkManager running.
      
      Fixes: 91f430b2 ("selftests: net: add a test for UDP tunnel info infra")
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240123060529.1033912-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0879020a
    • Jakub Kicinski's avatar
      selftests: net: fix rps_default_mask with >32 CPUs · 0719b533
      Jakub Kicinski authored
      If there is more than 32 cpus the bitmask will start to contain
      commas, leading to:
      
      ./rps_default_mask.sh: line 36: [: 00000000,00000000: integer expression expected
      
      Remove the commas, bash doesn't interpret leading zeroes as oct
      so that should be good enough. Switch to bash, Simon reports that
      not all shells support this type of substitution.
      
      Fixes: c12e0d5f ("self-tests: introduce self-tests for RPS default mask")
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240122195815.638997-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0719b533
    • Jenishkumar Maheshbhai Patel's avatar
      net: mvpp2: clear BM pool before initialization · 9f538b41
      Jenishkumar Maheshbhai Patel authored
      Register value persist after booting the kernel using
      kexec which results in kernel panic. Thus clear the
      BM pool registers before initialisation to fix the issue.
      
      Fixes: 3f518509 ("ethernet: Add new driver for Marvell Armada 375 network unit")
      Signed-off-by: default avatarJenishkumar Maheshbhai Patel <jpatel2@marvell.com>
      Reviewed-by: default avatarMaxime Chevallier <maxime.chevallier@bootlin.com>
      Link: https://lore.kernel.org/r/20240119035914.2595665-1-jpatel2@marvell.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9f538b41
    • Bernd Edlinger's avatar
      net: stmmac: Wait a bit for the reset to take effect · a5f5eee2
      Bernd Edlinger authored
      otherwise the synopsys_id value may be read out wrong,
      because the GMAC_VERSION register might still be in reset
      state, for at least 1 us after the reset is de-asserted.
      
      Add a wait for 10 us before continuing to be on the safe side.
      
      > From what have you got that delay value?
      
      Just try and error, with very old linux versions and old gcc versions
      the synopsys_id was read out correctly most of the time (but not always),
      with recent linux versions and recnet gcc versions it was read out
      wrongly most of the time, but again not always.
      I don't have access to the VHDL code in question, so I cannot
      tell why it takes so long to get the correct values, I also do not
      have more than a few hardware samples, so I cannot tell how long
      this timeout must be in worst case.
      Experimentally I can tell that the register is read several times
      as zero immediately after the reset is de-asserted, also adding several
      no-ops is not enough, adding a printk is enough, also udelay(1) seems to
      be enough but I tried that not very often, and I have not access to many
      hardware samples to be 100% sure about the necessary delay.
      And since the udelay here is only executed once per device instance,
      it seems acceptable to delay the boot for 10 us.
      
      BTW: my hardware's synopsys id is 0x37.
      
      Fixes: c5e4ddbd ("net: stmmac: Add support for optional reset control")
      Signed-off-by: default avatarBernd Edlinger <bernd.edlinger@hotmail.de>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarSerge Semin <fancer.lancer@gmail.com>
      Link: https://lore.kernel.org/r/AS8P193MB1285A810BD78C111E7F6AA34E4752@AS8P193MB1285.EURP193.PROD.OUTLOOK.COMSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a5f5eee2
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: validate NFPROTO_* family · d0009eff
      Pablo Neira Ayuso authored
      Several expressions explicitly refer to NF_INET_* hook definitions
      from expr->ops->validate, however, family is not validated.
      
      Bail out with EOPNOTSUPP in case they are used from unsupported
      families.
      
      Fixes: 0ca743a5 ("netfilter: nf_tables: add compatibility layer for x_tables")
      Fixes: a3c90f7a ("netfilter: nf_tables: flow offload expression")
      Fixes: 2fa84193 ("netfilter: nf_tables: introduce routing expression")
      Fixes: 554ced0a ("netfilter: nf_tables: add support for native socket matching")
      Fixes: ad49d86e ("netfilter: nf_tables: Add synproxy support")
      Fixes: 4ed8eb65 ("netfilter: nf_tables: Add native tproxy support")
      Fixes: 6c472602 ("netfilter: nf_tables: add xfrm expression")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      d0009eff
    • Florian Westphal's avatar
      netfilter: nf_tables: reject QUEUE/DROP verdict parameters · f342de4e
      Florian Westphal authored
      This reverts commit e0abdadc.
      
      core.c:nf_hook_slow assumes that the upper 16 bits of NF_DROP
      verdicts contain a valid errno, i.e. -EPERM, -EHOSTUNREACH or similar,
      or 0.
      
      Due to the reverted commit, its possible to provide a positive
      value, e.g. NF_ACCEPT (1), which results in use-after-free.
      
      Its not clear to me why this commit was made.
      
      NF_QUEUE is not used by nftables; "queue" rules in nftables
      will result in use of "nft_queue" expression.
      
      If we later need to allow specifiying errno values from userspace
      (do not know why), this has to call NF_DROP_GETERR and check that
      "err <= 0" holds true.
      
      Fixes: e0abdadc ("netfilter: nf_tables: accept QUEUE/DROP verdict parameters")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarNotselwyn <notselwyn@pwning.tech>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      f342de4e
    • Florian Westphal's avatar
      netfilter: nf_tables: restrict anonymous set and map names to 16 bytes · b462579b
      Florian Westphal authored
      nftables has two types of sets/maps, one where userspace defines the
      name, and anonymous sets/maps, where userspace defines a template name.
      
      For the latter, kernel requires presence of exactly one "%d".
      nftables uses "__set%d" and "__map%d" for this.  The kernel will
      expand the format specifier and replaces it with the smallest unused
      number.
      
      As-is, userspace could define a template name that allows to move
      the set name past the 256 bytes upperlimit (post-expansion).
      
      I don't see how this could be a problem, but I would prefer if userspace
      cannot do this, so add a limit of 16 bytes for the '%d' template name.
      
      16 bytes is the old total upper limit for set names that existed when
      nf_tables was merged initially.
      
      Fixes: 38745490 ("netfilter: nf_tables: Allow set names of up to 255 chars")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      b462579b
    • Florian Westphal's avatar
      netfilter: nft_limit: reject configurations that cause integer overflow · c9d9eb9c
      Florian Westphal authored
      Reject bogus configs where internal token counter wraps around.
      This only occurs with very very large requests, such as 17gbyte/s.
      
      Its better to reject this rather than having incorrect ratelimit.
      
      Fixes: d2168e84 ("netfilter: nft_limit: add per-byte limiting")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      c9d9eb9c
    • Pablo Neira Ayuso's avatar
      netfilter: nft_chain_filter: handle NETDEV_UNREGISTER for inet/ingress basechain · 01acb2e8
      Pablo Neira Ayuso authored
      Remove netdevice from inet/ingress basechain in case NETDEV_UNREGISTER
      event is reported, otherwise a stale reference to netdevice remains in
      the hook list.
      
      Fixes: 60a3815d ("netfilter: add inet ingress support")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      01acb2e8
    • George Guo's avatar
      netfilter: nf_tables: cleanup documentation · b253d87f
      George Guo authored
      - Correct comments for nlpid, family, udlen and udata in struct nft_table,
        and afinfo is no longer a member of enum nft_set_class.
      
      - Add comment for data in struct nft_set_elem.
      
      - Add comment for flags in struct nft_ctx.
      
      - Add comments for timeout in struct nft_set_iter, and flags is not a
        member of struct nft_set_iter, remove the comment for it.
      
      - Add comments for commit, abort, estimate and gc_init in struct
        nft_set_ops.
      
      - Add comments for pending_update, num_exprs, exprs and catchall_list
        in struct nft_set.
      
      - Add comment for ext_len in struct nft_set_ext_tmpl.
      
      - Add comment for inner_ops in struct nft_expr_type.
      
      - Add comments for clone, destroy_clone, reduce, gc, offload,
        offload_action, offload_stats in struct nft_expr_ops.
      
      - Add comments for blob_gen_0, blob_gen_1, bound, genmask, udlen, udata,
        blob_next in struct nft_chain.
      
      - Add comment for flags in struct nft_base_chain.
      
      - Add comments for udlen, udata in struct nft_object.
      
      - Add comment for type in struct nft_object_ops.
      
      - Add comment for hook_list in struct nft_flowtable, and remove comments
        for dev_name and ops which are not members of struct nft_flowtable.
      Signed-off-by: default avatarGeorge Guo <guodongtai@kylinos.cn>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      b253d87f
    • Ido Schimmel's avatar
      net/sched: flower: Fix chain template offload · 32f2a0af
      Ido Schimmel authored
      When a qdisc is deleted from a net device the stack instructs the
      underlying driver to remove its flow offload callback from the
      associated filter block using the 'FLOW_BLOCK_UNBIND' command. The stack
      then continues to replay the removal of the filters in the block for
      this driver by iterating over the chains in the block and invoking the
      'reoffload' operation of the classifier being used. In turn, the
      classifier in its 'reoffload' operation prepares and emits a
      'FLOW_CLS_DESTROY' command for each filter.
      
      However, the stack does not do the same for chain templates and the
      underlying driver never receives a 'FLOW_CLS_TMPLT_DESTROY' command when
      a qdisc is deleted. This results in a memory leak [1] which can be
      reproduced using [2].
      
      Fix by introducing a 'tmplt_reoffload' operation and have the stack
      invoke it with the appropriate arguments as part of the replay.
      Implement the operation in the sole classifier that supports chain
      templates (flower) by emitting the 'FLOW_CLS_TMPLT_{CREATE,DESTROY}'
      command based on whether a flow offload callback is being bound to a
      filter block or being unbound from one.
      
      As far as I can tell, the issue happens since cited commit which
      reordered tcf_block_offload_unbind() before tcf_block_flush_all_chains()
      in __tcf_block_put(). The order cannot be reversed as the filter block
      is expected to be freed after flushing all the chains.
      
      [1]
      unreferenced object 0xffff888107e28800 (size 2048):
        comm "tc", pid 1079, jiffies 4294958525 (age 3074.287s)
        hex dump (first 32 bytes):
          b1 a6 7c 11 81 88 ff ff e0 5b b3 10 81 88 ff ff  ..|......[......
          01 00 00 00 00 00 00 00 e0 aa b0 84 ff ff ff ff  ................
        backtrace:
          [<ffffffff81c06a68>] __kmem_cache_alloc_node+0x1e8/0x320
          [<ffffffff81ab374e>] __kmalloc+0x4e/0x90
          [<ffffffff832aec6d>] mlxsw_sp_acl_ruleset_get+0x34d/0x7a0
          [<ffffffff832bc195>] mlxsw_sp_flower_tmplt_create+0x145/0x180
          [<ffffffff832b2e1a>] mlxsw_sp_flow_block_cb+0x1ea/0x280
          [<ffffffff83a10613>] tc_setup_cb_call+0x183/0x340
          [<ffffffff83a9f85a>] fl_tmplt_create+0x3da/0x4c0
          [<ffffffff83a22435>] tc_ctl_chain+0xa15/0x1170
          [<ffffffff838a863c>] rtnetlink_rcv_msg+0x3cc/0xed0
          [<ffffffff83ac87f0>] netlink_rcv_skb+0x170/0x440
          [<ffffffff83ac6270>] netlink_unicast+0x540/0x820
          [<ffffffff83ac6e28>] netlink_sendmsg+0x8d8/0xda0
          [<ffffffff83793def>] ____sys_sendmsg+0x30f/0xa80
          [<ffffffff8379d29a>] ___sys_sendmsg+0x13a/0x1e0
          [<ffffffff8379d50c>] __sys_sendmsg+0x11c/0x1f0
          [<ffffffff843b9ce0>] do_syscall_64+0x40/0xe0
      unreferenced object 0xffff88816d2c0400 (size 1024):
        comm "tc", pid 1079, jiffies 4294958525 (age 3074.287s)
        hex dump (first 32 bytes):
          40 00 00 00 00 00 00 00 57 f6 38 be 00 00 00 00  @.......W.8.....
          10 04 2c 6d 81 88 ff ff 10 04 2c 6d 81 88 ff ff  ..,m......,m....
        backtrace:
          [<ffffffff81c06a68>] __kmem_cache_alloc_node+0x1e8/0x320
          [<ffffffff81ab36c1>] __kmalloc_node+0x51/0x90
          [<ffffffff81a8ed96>] kvmalloc_node+0xa6/0x1f0
          [<ffffffff82827d03>] bucket_table_alloc.isra.0+0x83/0x460
          [<ffffffff82828d2b>] rhashtable_init+0x43b/0x7c0
          [<ffffffff832aed48>] mlxsw_sp_acl_ruleset_get+0x428/0x7a0
          [<ffffffff832bc195>] mlxsw_sp_flower_tmplt_create+0x145/0x180
          [<ffffffff832b2e1a>] mlxsw_sp_flow_block_cb+0x1ea/0x280
          [<ffffffff83a10613>] tc_setup_cb_call+0x183/0x340
          [<ffffffff83a9f85a>] fl_tmplt_create+0x3da/0x4c0
          [<ffffffff83a22435>] tc_ctl_chain+0xa15/0x1170
          [<ffffffff838a863c>] rtnetlink_rcv_msg+0x3cc/0xed0
          [<ffffffff83ac87f0>] netlink_rcv_skb+0x170/0x440
          [<ffffffff83ac6270>] netlink_unicast+0x540/0x820
          [<ffffffff83ac6e28>] netlink_sendmsg+0x8d8/0xda0
          [<ffffffff83793def>] ____sys_sendmsg+0x30f/0xa80
      
      [2]
       # tc qdisc add dev swp1 clsact
       # tc chain add dev swp1 ingress proto ip chain 1 flower dst_ip 0.0.0.0/32
       # tc qdisc del dev swp1 clsact
       # devlink dev reload pci/0000:06:00.0
      
      Fixes: bbf73830 ("net: sched: traverse chains in block with tcf_get_next_chain()")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      32f2a0af
    • Jakub Kicinski's avatar
      selftests: fill in some missing configs for net · 04fe7c50
      Jakub Kicinski authored
      We are missing a lot of config options from net selftests,
      it seems:
      
      tun/tap:     CONFIG_TUN, CONFIG_MACVLAN, CONFIG_MACVTAP
      fib_tests:   CONFIG_NET_SCH_FQ_CODEL
      l2tp:        CONFIG_L2TP, CONFIG_L2TP_V3, CONFIG_L2TP_IP, CONFIG_L2TP_ETH
      sctp-vrf:    CONFIG_INET_DIAG
      txtimestamp: CONFIG_NET_CLS_U32
      vxlan_mdb:   CONFIG_BRIDGE_VLAN_FILTERING
      gre_gso:     CONFIG_NET_IPGRE_DEMUX, CONFIG_IP_GRE, CONFIG_IPV6_GRE
      srv6_end_dt*_l3vpn:   CONFIG_IPV6_SEG6_LWTUNNEL
      ip_local_port_range:  CONFIG_MPTCP
      fib_test:    CONFIG_NET_CLS_BASIC
      rtnetlink:   CONFIG_MACSEC, CONFIG_NET_SCH_HTB, CONFIG_XFRM_INTERFACE
                   CONFIG_NET_IPGRE, CONFIG_BONDING
      fib_nexthops: CONFIG_MPLS, CONFIG_MPLS_ROUTING
      vxlan_mdb:   CONFIG_NET_ACT_GACT
      tls:         CONFIG_TLS, CONFIG_CRYPTO_CHACHA20POLY1305
      psample:     CONFIG_PSAMPLE
      fcnal:       CONFIG_TCP_MD5SIG
      
      Try to add them in a semi-alphabetical order.
      
      Fixes: 62199e3f ("selftests: net: Add VXLAN MDB test")
      Fixes: c12e0d5f ("self-tests: introduce self-tests for RPS default mask")
      Fixes: 122db5e3 ("selftests/net: add MPTCP coverage for IP_LOCAL_PORT_RANGE")
      Link: https://lore.kernel.org/r/20240122203528.672004-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      04fe7c50
    • Michael Kelley's avatar
      hv_netvsc: Calculate correct ring size when PAGE_SIZE is not 4 Kbytes · 6941f67a
      Michael Kelley authored
      Current code in netvsc_drv_init() incorrectly assumes that PAGE_SIZE
      is 4 Kbytes, which is wrong on ARM64 with 16K or 64K page size. As a
      result, the default VMBus ring buffer size on ARM64 with 64K page size
      is 8 Mbytes instead of the expected 512 Kbytes. While this doesn't break
      anything, a typical VM with 8 vCPUs and 8 netvsc channels wastes 120
      Mbytes (8 channels * 2 ring buffers/channel * 7.5 Mbytes/ring buffer).
      
      Unfortunately, the module parameter specifying the ring buffer size
      is in units of 4 Kbyte pages. Ideally, it should be in units that
      are independent of PAGE_SIZE, but backwards compatibility prevents
      changing that now.
      
      Fix this by having netvsc_drv_init() hardcode 4096 instead of using
      PAGE_SIZE when calculating the ring buffer size in bytes. Also
      use the VMBUS_RING_SIZE macro to ensure proper alignment when running
      with page size larger than 4K.
      
      Cc: <stable@vger.kernel.org> # 5.15.x
      Fixes: 7aff79e2 ("Drivers: hv: Enable Hyper-V code to be built on ARM64")
      Signed-off-by: default avatarMichael Kelley <mhklinux@outlook.com>
      Link: https://lore.kernel.org/r/20240122162028.348885-1-mhklinux@outlook.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6941f67a
    • Rahul Rameshbabu's avatar
      Revert "net: macsec: use skb_ensure_writable_head_tail to expand the skb" · 3222bc99
      Rahul Rameshbabu authored
      This reverts commit b34ab352.
      
      Using skb_ensure_writable_head_tail without a call to skb_unshare causes
      the MACsec stack to operate on the original skb rather than a copy in the
      macsec_encrypt path. This causes the buffer to be exceeded in space, and
      leads to warnings generated by skb_put operations. Opting to revert this
      change since skb_copy_expand is more efficient than
      skb_ensure_writable_head_tail followed by a call to skb_unshare.
      
      Log:
        ------------[ cut here ]------------
        kernel BUG at net/core/skbuff.c:2464!
        invalid opcode: 0000 [#1] SMP KASAN
        CPU: 21 PID: 61997 Comm: iperf3 Not tainted 6.7.0-rc8_for_upstream_debug_2024_01_07_17_05 #1
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        RIP: 0010:skb_put+0x113/0x190
        Code: 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 70 3b 9d bc 00 00 00 77 0e 48 83 c4 08 4c 89 e8 5b 5d 41 5d c3 <0f> 0b 4c 8b 6c 24 20 89 74 24 04 e8 6d b7 f0 fe 8b 74 24 04 48 c7
        RSP: 0018:ffff8882694e7278 EFLAGS: 00010202
        RAX: 0000000000000025 RBX: 0000000000000100 RCX: 0000000000000001
        RDX: 0000000000000000 RSI: 0000000000000010 RDI: ffff88816ae0bad4
        RBP: ffff88816ae0ba60 R08: 0000000000000004 R09: 0000000000000004
        R10: 0000000000000001 R11: 0000000000000001 R12: ffff88811ba5abfa
        R13: ffff8882bdecc100 R14: ffff88816ae0ba60 R15: ffff8882bdecc0ae
        FS:  00007fe54df02740(0000) GS:ffff88881f080000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fe54d92e320 CR3: 000000010a345003 CR4: 0000000000370eb0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <TASK>
         ? die+0x33/0x90
         ? skb_put+0x113/0x190
         ? do_trap+0x1b4/0x3b0
         ? skb_put+0x113/0x190
         ? do_error_trap+0xb6/0x180
         ? skb_put+0x113/0x190
         ? handle_invalid_op+0x2c/0x30
         ? skb_put+0x113/0x190
         ? exc_invalid_op+0x2b/0x40
         ? asm_exc_invalid_op+0x16/0x20
         ? skb_put+0x113/0x190
         ? macsec_start_xmit+0x4e9/0x21d0
         macsec_start_xmit+0x830/0x21d0
         ? get_txsa_from_nl+0x400/0x400
         ? lock_downgrade+0x690/0x690
         ? dev_queue_xmit_nit+0x78b/0xae0
         dev_hard_start_xmit+0x151/0x560
         __dev_queue_xmit+0x1580/0x28f0
         ? check_chain_key+0x1c5/0x490
         ? netdev_core_pick_tx+0x2d0/0x2d0
         ? __ip_queue_xmit+0x798/0x1e00
         ? lock_downgrade+0x690/0x690
         ? mark_held_locks+0x9f/0xe0
         ip_finish_output2+0x11e4/0x2050
         ? ip_mc_finish_output+0x520/0x520
         ? ip_fragment.constprop.0+0x230/0x230
         ? __ip_queue_xmit+0x798/0x1e00
         __ip_queue_xmit+0x798/0x1e00
         ? __skb_clone+0x57a/0x760
         __tcp_transmit_skb+0x169d/0x3490
         ? lock_downgrade+0x690/0x690
         ? __tcp_select_window+0x1320/0x1320
         ? mark_held_locks+0x9f/0xe0
         ? lockdep_hardirqs_on_prepare+0x286/0x400
         ? tcp_small_queue_check.isra.0+0x120/0x3d0
         tcp_write_xmit+0x12b6/0x7100
         ? skb_page_frag_refill+0x1e8/0x460
         __tcp_push_pending_frames+0x92/0x320
         tcp_sendmsg_locked+0x1ed4/0x3190
         ? tcp_sendmsg_fastopen+0x650/0x650
         ? tcp_sendmsg+0x1a/0x40
         ? mark_held_locks+0x9f/0xe0
         ? lockdep_hardirqs_on_prepare+0x286/0x400
         tcp_sendmsg+0x28/0x40
         ? inet_send_prepare+0x1b0/0x1b0
         __sock_sendmsg+0xc5/0x190
         sock_write_iter+0x222/0x380
         ? __sock_sendmsg+0x190/0x190
         ? kfree+0x96/0x130
         vfs_write+0x842/0xbd0
         ? kernel_write+0x530/0x530
         ? __fget_light+0x51/0x220
         ? __fget_light+0x51/0x220
         ksys_write+0x172/0x1d0
         ? update_socket_protocol+0x10/0x10
         ? __x64_sys_read+0xb0/0xb0
         ? lockdep_hardirqs_on_prepare+0x286/0x400
         do_syscall_64+0x40/0xe0
         entry_SYSCALL_64_after_hwframe+0x46/0x4e
        RIP: 0033:0x7fe54d9018b7
        Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
        RSP: 002b:00007ffdbd4191d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
        RAX: ffffffffffffffda RBX: 0000000000000025 RCX: 00007fe54d9018b7
        RDX: 0000000000000025 RSI: 0000000000d9859c RDI: 0000000000000004
        RBP: 0000000000d9859c R08: 0000000000000004 R09: 0000000000000000
        R10: 00007fe54d80afe0 R11: 0000000000000246 R12: 0000000000000004
        R13: 0000000000000025 R14: 00007fe54e00ec00 R15: 0000000000d982a0
         </TASK>
        Modules linked in: 8021q garp mrp iptable_raw bonding vfio_pci rdma_ucm ib_umad mlx5_vfio_pci mlx5_ib vfio_pci_core vfio_iommu_type1 ib_uverbs vfio mlx5_core ip_gre nf_tables ipip tunnel4 ib_ipoib ip6_gre gre ip6_tunnel tunnel6 geneve openvswitch nsh xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core zram zsmalloc fuse [last unloaded: ib_uverbs]
        ---[ end trace 0000000000000000 ]---
      
      Cc: Radu Pirea (NXP OSS) <radu-nicolae.pirea@oss.nxp.com>
      Cc: Sabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Link: https://lore.kernel.org/r/20240118191811.50271-1-rrameshbabu@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3222bc99
  4. 23 Jan, 2024 5 commits
    • Jakub Kicinski's avatar
      Merge tag 'wireless-2024-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless · 1347775d
      Jakub Kicinski authored
      Kalle Valo says:
      
      ====================
      wireless fixes for v6.8-rc2
      
      The most visible fix here is the ath11k crash fix which was introduced
      in v6.7. We also have a fix for iwlwifi memory corruption and few
      smaller fixes in the stack.
      
      * tag 'wireless-2024-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
        wifi: mac80211: fix race condition on enabling fast-xmit
        wifi: iwlwifi: fix a memory corruption
        wifi: mac80211: fix potential sta-link leak
        wifi: cfg80211/mac80211: remove dependency on non-existing option
        wifi: cfg80211: fix missing interfaces when dumping
        wifi: ath11k: rely on mac80211 debugfs handling for vif
        wifi: p54: fix GCC format truncation warning with wiphy->fw_version
      ====================
      
      Link: https://lore.kernel.org/r/20240122153434.E0254C433C7@smtp.kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1347775d
    • Zhengchao Shao's avatar
      ipv6: init the accept_queue's spinlocks in inet6_create · 435e202d
      Zhengchao Shao authored
      In commit 198bc90e("tcp: make sure init the accept_queue's spinlocks
      once"), the spinlocks of accept_queue are initialized only when socket is
      created in the inet4 scenario. The locks are not initialized when socket
      is created in the inet6 scenario. The kernel reports the following error:
      INFO: trying to register non-static key.
      The code is fine but needs lockdep annotation, or maybe
      you didn't initialize this object before use?
      turning off the locking correctness validator.
      Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      Call Trace:
      <TASK>
      	dump_stack_lvl (lib/dump_stack.c:107)
      	register_lock_class (kernel/locking/lockdep.c:1289)
      	__lock_acquire (kernel/locking/lockdep.c:5015)
      	lock_acquire.part.0 (kernel/locking/lockdep.c:5756)
      	_raw_spin_lock_bh (kernel/locking/spinlock.c:178)
      	inet_csk_listen_stop (net/ipv4/inet_connection_sock.c:1386)
      	tcp_disconnect (net/ipv4/tcp.c:2981)
      	inet_shutdown (net/ipv4/af_inet.c:935)
      	__sys_shutdown (./include/linux/file.h:32 net/socket.c:2438)
      	__x64_sys_shutdown (net/socket.c:2445)
      	do_syscall_64 (arch/x86/entry/common.c:52)
      	entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
      RIP: 0033:0x7f52ecd05a3d
      Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7
      48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
      ff 73 01 c3 48 8b 0d ab a3 0e 00 f7 d8 64 89 01 48
      RSP: 002b:00007f52ecf5dde8 EFLAGS: 00000293 ORIG_RAX: 0000000000000030
      RAX: ffffffffffffffda RBX: 00007f52ecf5e640 RCX: 00007f52ecd05a3d
      RDX: 00007f52ecc8b188 RSI: 0000000000000000 RDI: 0000000000000004
      RBP: 00007f52ecf5de20 R08: 00007ffdae45c69f R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000293 R12: 00007f52ecf5e640
      R13: 0000000000000000 R14: 00007f52ecc8b060 R15: 00007ffdae45c6e0
      
      Fixes: 198bc90e ("tcp: make sure init the accept_queue's spinlocks once")
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240122102001.2851701-1-shaozhengchao@huawei.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      435e202d
    • Zhengchao Shao's avatar
      netlink: fix potential sleeping issue in mqueue_flush_file · 234ec0b6
      Zhengchao Shao authored
      I analyze the potential sleeping issue of the following processes:
      Thread A                                Thread B
      ...                                     netlink_create  //ref = 1
      do_mq_notify                            ...
        sock = netlink_getsockbyfilp          ...     //ref = 2
        info->notify_sock = sock;             ...
      ...                                     netlink_sendmsg
      ...                                       skb = netlink_alloc_large_skb  //skb->head is vmalloced
      ...                                       netlink_unicast
      ...                                         sk = netlink_getsockbyportid //ref = 3
      ...                                         netlink_sendskb
      ...                                           __netlink_sendskb
      ...                                             skb_queue_tail //put skb to sk_receive_queue
      ...                                         sock_put //ref = 2
      ...                                     ...
      ...                                     netlink_release
      ...                                       deferred_put_nlk_sk //ref = 1
      mqueue_flush_file
        spin_lock
        remove_notification
          netlink_sendskb
            sock_put  //ref = 0
              sk_free
                ...
                __sk_destruct
                  netlink_sock_destruct
                    skb_queue_purge  //get skb from sk_receive_queue
                      ...
                      __skb_queue_purge_reason
                        kfree_skb_reason
                          __kfree_skb
                          ...
                          skb_release_all
                            skb_release_head_state
                              netlink_skb_destructor
                                vfree(skb->head)  //sleeping while holding spinlock
      
      In netlink_sendmsg, if the memory pointed to by skb->head is allocated by
      vmalloc, and is put to sk_receive_queue queue, also the skb is not freed.
      When the mqueue executes flush, the sleeping bug will occur. Use
      vfree_atomic instead of vfree in netlink_skb_destructor to solve the issue.
      
      Fixes: c05cdb1b ("netlink: allow large data transfers from user-space")
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Link: https://lore.kernel.org/r/20240122011807.2110357-1-shaozhengchao@huawei.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      234ec0b6
    • Kuniyuki Iwashima's avatar
      selftest: Don't reuse port for SO_INCOMING_CPU test. · 97de5a15
      Kuniyuki Iwashima authored
      Jakub reported that ASSERT_EQ(cpu, i) in so_incoming_cpu.c seems to
      fire somewhat randomly.
      
        # #  RUN           so_incoming_cpu.before_reuseport.test3 ...
        # # so_incoming_cpu.c:191:test3:Expected cpu (32) == i (0)
        # # test3: Test terminated by assertion
        # #          FAIL  so_incoming_cpu.before_reuseport.test3
        # not ok 3 so_incoming_cpu.before_reuseport.test3
      
      When the test failed, not-yet-accepted CLOSE_WAIT sockets received
      SYN with a "challenging" SEQ number, which was sent from an unexpected
      CPU that did not create the receiver.
      
      The test basically does:
      
        1. for each cpu:
          1-1. create a server
          1-2. set SO_INCOMING_CPU
      
        2. for each cpu:
          2-1. set cpu affinity
          2-2. create some clients
          2-3. let clients connect() to the server on the same cpu
          2-4. close() clients
      
        3. for each server:
          3-1. accept() all child sockets
          3-2. check if all children have the same SO_INCOMING_CPU with the server
      
      The root cause was the close() in 2-4. and net.ipv4.tcp_tw_reuse.
      
      In a loop of 2., close() changed the client state to FIN_WAIT_2, and
      the peer transitioned to CLOSE_WAIT.
      
      In another loop of 2., connect() happened to select the same port of
      the FIN_WAIT_2 socket, and it was reused as the default value of
      net.ipv4.tcp_tw_reuse is 2.
      
      As a result, the new client sent SYN to the CLOSE_WAIT socket from
      a different CPU, and the receiver's sk_incoming_cpu was overwritten
      with unexpected CPU ID.
      
      Also, the SYN had a different SEQ number, so the CLOSE_WAIT socket
      responded with Challenge ACK.  The new client properly returned RST
      and effectively killed the CLOSE_WAIT socket.
      
      This way, all clients were created successfully, but the error was
      detected later by 3-2., ASSERT_EQ(cpu, i).
      
      To avoid the failure, let's make sure that (i) the number of clients
      is less than the number of available ports and (ii) such reuse never
      happens.
      
      Fixes: 6df96146 ("selftest: Add test for SO_INCOMING_CPU.")
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Tested-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20240120031642.67014-1-kuniyu@amazon.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      97de5a15
    • Salvatore Dipietro's avatar
      tcp: Add memory barrier to tcp_push() · 7267e8dc
      Salvatore Dipietro authored
      On CPUs with weak memory models, reads and updates performed by tcp_push
      to the sk variables can get reordered leaving the socket throttled when
      it should not. The tasklet running tcp_wfree() may also not observe the
      memory updates in time and will skip flushing any packets throttled by
      tcp_push(), delaying the sending. This can pathologically cause 40ms
      extra latency due to bad interactions with delayed acks.
      
      Adding a memory barrier in tcp_push removes the bug, similarly to the
      previous commit bf06200e ("tcp: tsq: fix nonagle handling").
      smp_mb__after_atomic() is used to not incur in unnecessary overhead
      on x86 since not affected.
      
      Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu
      22.04 and Apache Tomcat 9.0.83 running the basic servlet below:
      
      import java.io.IOException;
      import java.io.OutputStreamWriter;
      import java.io.PrintWriter;
      import javax.servlet.ServletException;
      import javax.servlet.http.HttpServlet;
      import javax.servlet.http.HttpServletRequest;
      import javax.servlet.http.HttpServletResponse;
      
      public class HelloWorldServlet extends HttpServlet {
          @Override
          protected void doGet(HttpServletRequest request, HttpServletResponse response)
            throws ServletException, IOException {
              response.setContentType("text/html;charset=utf-8");
              OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
              String s = "a".repeat(3096);
              osw.write(s,0,s.length());
              osw.flush();
          }
      }
      
      Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
      c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
      values is observed while, with the patch, the extra latency disappears.
      
      No patch and tcp_autocorking=1
      ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
        ...
       50.000%    0.91ms
       75.000%    1.13ms
       90.000%    1.46ms
       99.000%    1.74ms
       99.900%    1.89ms
       99.990%   41.95ms  <<< 40+ ms extra latency
       99.999%   48.32ms
      100.000%   48.96ms
      
      With patch and tcp_autocorking=1
      ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
        ...
       50.000%    0.90ms
       75.000%    1.13ms
       90.000%    1.45ms
       99.000%    1.72ms
       99.900%    1.83ms
       99.990%    2.11ms  <<< no 40+ ms extra latency
       99.999%    2.53ms
      100.000%    2.62ms
      
      Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
      affected by this issue and the patch doesn't introduce any additional
      delay.
      
      Fixes: 7aa5470c ("tcp: tsq: move tsq_flags close to sk_wmem_alloc")
      Signed-off-by: default avatarSalvatore Dipietro <dipiets@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240119190133.43698-1-dipiets@amazon.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      7267e8dc
  5. 22 Jan, 2024 4 commits
    • Sharath Srinivasan's avatar
      net/rds: Fix UBSAN: array-index-out-of-bounds in rds_cmsg_recv · 13e788de
      Sharath Srinivasan authored
      Syzcaller UBSAN crash occurs in rds_cmsg_recv(),
      which reads inc->i_rx_lat_trace[j + 1] with index 4 (3 + 1),
      but with array size of 4 (RDS_RX_MAX_TRACES).
      Here 'j' is assigned from rs->rs_rx_trace[i] and in-turn from
      trace.rx_trace_pos[i] in rds_recv_track_latency(),
      with both arrays sized 3 (RDS_MSG_RX_DGRAM_TRACE_MAX). So fix the
      off-by-one bounds check in rds_recv_track_latency() to prevent
      a potential crash in rds_cmsg_recv().
      
      Found by syzcaller:
      =================================================================
      UBSAN: array-index-out-of-bounds in net/rds/recv.c:585:39
      index 4 is out of range for type 'u64 [4]'
      CPU: 1 PID: 8058 Comm: syz-executor228 Not tainted 6.6.0-gd2f51b35 #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
      BIOS 1.15.0-1 04/01/2014
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0x136/0x150 lib/dump_stack.c:106
       ubsan_epilogue lib/ubsan.c:217 [inline]
       __ubsan_handle_out_of_bounds+0xd5/0x130 lib/ubsan.c:348
       rds_cmsg_recv+0x60d/0x700 net/rds/recv.c:585
       rds_recvmsg+0x3fb/0x1610 net/rds/recv.c:716
       sock_recvmsg_nosec net/socket.c:1044 [inline]
       sock_recvmsg+0xe2/0x160 net/socket.c:1066
       __sys_recvfrom+0x1b6/0x2f0 net/socket.c:2246
       __do_sys_recvfrom net/socket.c:2264 [inline]
       __se_sys_recvfrom net/socket.c:2260 [inline]
       __x64_sys_recvfrom+0xe0/0x1b0 net/socket.c:2260
       do_syscall_x64 arch/x86/entry/common.c:51 [inline]
       do_syscall_64+0x40/0x110 arch/x86/entry/common.c:82
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      ==================================================================
      
      Fixes: 3289025a ("RDS: add receive message trace used by application")
      Reported-by: default avatarChenyuan Yang <chenyuan0y@gmail.com>
      Closes: https://lore.kernel.org/linux-rdma/CALGdzuoVdq-wtQ4Az9iottBqC5cv9ZhcE5q8N7LfYFvkRsOVcw@mail.gmail.com/Signed-off-by: default avatarSharath Srinivasan <sharath.srinivasan@oracle.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      13e788de
    • Horatiu Vultur's avatar
      net: micrel: Fix PTP frame parsing for lan8814 · aaf632f7
      Horatiu Vultur authored
      The HW has the capability to check each frame if it is a PTP frame,
      which domain it is, which ptp frame type it is, different ip address in
      the frame. And if one of these checks fail then the frame is not
      timestamp. Most of these checks were disabled except checking the field
      minorVersionPTP inside the PTP header. Meaning that once a partner sends
      a frame compliant to 8021AS which has minorVersionPTP set to 1, then the
      frame was not timestamp because the HW expected by default a value of 0
      in minorVersionPTP. This is exactly the same issue as on lan8841.
      Fix this issue by removing this check so the userspace can decide on this.
      
      Fixes: ece19502 ("net: phy: micrel: 1588 support for LAN8814 phy")
      Signed-off-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Reviewed-by: default avatarMaxime Chevallier <maxime.chevallier@bootlin.com>
      Reviewed-by: default avatarDivya Koppera <divya.koppera@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aaf632f7
    • David S. Miller's avatar
      Merge branch 'dpll-fixes' · 94fa82b0
      David S. Miller authored
      Arkadiusz Kubalewski says:
      
      ====================
      dpll: fix unordered unbind/bind registerer issues
      
      Fix issues when performing unordered unbind/bind of a kernel modules
      which are using a dpll device with DPLL_PIN_TYPE_MUX pins.
      Currently only serialized bind/unbind of such use case works, fix
      the issues and allow for unserialized kernel module bind order.
      
      The issues are observed on the ice driver, i.e.,
      
      $ echo 0000:af:00.0 > /sys/bus/pci/drivers/ice/unbind
      $ echo 0000:af:00.1 > /sys/bus/pci/drivers/ice/unbind
      
      results in:
      
      ice 0000:af:00.0: Removed PTP clock
      BUG: kernel NULL pointer dereference, address: 0000000000000010
      PF: supervisor read access in kernel mode
      PF: error_code(0x0000) - not-present page
      PGD 0 P4D 0
      Oops: 0000 [#1] PREEMPT SMP PTI
      CPU: 7 PID: 71848 Comm: bash Kdump: loaded Not tainted 6.6.0-rc5_next-queue_19th-Oct-2023-01625-g039e5d15e451 #1
      Hardware name: Intel Corporation S2600STB/S2600STB, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
      RIP: 0010:ice_dpll_rclk_state_on_pin_get+0x2f/0x90 [ice]
      Code: 41 57 4d 89 cf 41 56 41 55 4d 89 c5 41 54 55 48 89 f5 53 4c 8b 66 08 48 89 cb 4d 8d b4 24 f0 49 00 00 4c 89 f7 e8 71 ec 1f c5 <0f> b6 5b 10 41 0f b6 84 24 30 4b 00 00 29 c3 41 0f b6 84 24 28 4b
      RSP: 0018:ffffc902b179fb60 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff8882c1398000 RSI: ffff888c7435cc60 RDI: ffff888c7435cb90
      RBP: ffff888c7435cc60 R08: ffffc902b179fbb0 R09: 0000000000000000
      R10: ffff888ef1fc8050 R11: fffffffffff82700 R12: ffff888c743581a0
      R13: ffffc902b179fbb0 R14: ffff888c7435cb90 R15: 0000000000000000
      FS:  00007fdc7dae0740(0000) GS:ffff888c105c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000010 CR3: 0000000132c24002 CR4: 00000000007706e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <TASK>
       ? __die+0x20/0x70
       ? page_fault_oops+0x76/0x170
       ? exc_page_fault+0x65/0x150
       ? asm_exc_page_fault+0x22/0x30
       ? ice_dpll_rclk_state_on_pin_get+0x2f/0x90 [ice]
       ? __pfx_ice_dpll_rclk_state_on_pin_get+0x10/0x10 [ice]
       dpll_msg_add_pin_parents+0x142/0x1d0
       dpll_pin_event_send+0x7d/0x150
       dpll_pin_on_pin_unregister+0x3f/0x100
       ice_dpll_deinit_pins+0xa1/0x230 [ice]
       ice_dpll_deinit+0x29/0xe0 [ice]
       ice_remove+0xcd/0x200 [ice]
       pci_device_remove+0x33/0xa0
       device_release_driver_internal+0x193/0x200
       unbind_store+0x9d/0xb0
       kernfs_fop_write_iter+0x128/0x1c0
       vfs_write+0x2bb/0x3e0
       ksys_write+0x5f/0xe0
       do_syscall_64+0x59/0x90
       ? filp_close+0x1b/0x30
       ? do_dup2+0x7d/0xd0
       ? syscall_exit_work+0x103/0x130
       ? syscall_exit_to_user_mode+0x22/0x40
       ? do_syscall_64+0x69/0x90
       ? syscall_exit_work+0x103/0x130
       ? syscall_exit_to_user_mode+0x22/0x40
       ? do_syscall_64+0x69/0x90
       entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      RIP: 0033:0x7fdc7d93eb97
      Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
      RSP: 002b:00007fff2aa91028 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fdc7d93eb97
      RDX: 000000000000000d RSI: 00005644814ec9b0 RDI: 0000000000000001
      RBP: 00005644814ec9b0 R08: 0000000000000000 R09: 00007fdc7d9b14e0
      R10: 00007fdc7d9b13e0 R11: 0000000000000246 R12: 000000000000000d
      R13: 00007fdc7d9fb780 R14: 000000000000000d R15: 00007fdc7d9f69e0
       </TASK>
      Modules linked in: uinput vfio_pci vfio_pci_core vfio_iommu_type1 vfio irqbypass ixgbevf snd_seq_dummy snd_hrtimer snd_seq snd_timer snd_seq_device snd soundcore overlay qrtr rfkill vfat fat xfs libcrc32c rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common isst_if_common skx_edac nfit libnvdimm ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp irdma rapl intel_cstate ib_uverbs iTCO_wdt iTCO_vendor_support acpi_ipmi intel_uncore mei_me ipmi_si pcspkr i2c_i801 ib_core mei ipmi_devintf intel_pch_thermal ioatdma i2c_smbus ipmi_msghandler lpc_ich joydev acpi_power_meter acpi_pad ext4 mbcache jbd2 sd_mod t10_pi sg ast i2c_algo_bit drm_shmem_helper drm_kms_helper ice crct10dif_pclmul ixgbe crc32_pclmul drm crc32c_intel ahci i40e libahci ghash_clmulni_intel libata mdio dca gnss wmi fuse [last unloaded: iavf]
      CR2: 0000000000000010
      
      v6:
      - fix memory corruption on error path in patch [v5 2/4]
      ====================
      Acked-by: default avatarVadim Fedorenko <vadim.fedorenko@linux.dev>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      94fa82b0
    • Arkadiusz Kubalewski's avatar
      dpll: fix register pin with unregistered parent pin · 7dc5b18f
      Arkadiusz Kubalewski authored
      In case of multiple kernel module instances using the same dpll device:
      if only one registers dpll device, then only that one can register
      directly connected pins with a dpll device. When unregistered parent is
      responsible for determining if the muxed pin can be registered with it
      or not, the drivers need to be loaded in serialized order to work
      correctly - first the driver instance which registers the direct pins
      needs to be loaded, then the other instances could register muxed type
      pins.
      
      Allow registration of a pin with a parent even if the parent was not
      yet registered, thus allow ability for unserialized driver instance
      load order.
      Do not WARN_ON notification for unregistered pin, which can be invoked
      for described case, instead just return error.
      
      Fixes: 9431063a ("dpll: core: Add DPLL framework base functions")
      Fixes: 9d71b54b ("dpll: netlink: Add DPLL framework base functions")
      Reviewed-by: default avatarJan Glaza <jan.glaza@intel.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarArkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7dc5b18f