1. 09 Mar, 2023 3 commits
    • Eric Dumazet's avatar
      af_unix: fix struct pid leaks in OOB support · 2aab4b96
      Eric Dumazet authored
      syzbot reported struct pid leak [1].
      
      Issue is that queue_oob() calls maybe_add_creds() which potentially
      holds a reference on a pid.
      
      But skb->destructor is not set (either directly or by calling
      unix_scm_to_skb())
      
      This means that subsequent kfree_skb() or consume_skb() would leak
      this reference.
      
      In this fix, I chose to fully support scm even for the OOB message.
      
      [1]
      BUG: memory leak
      unreferenced object 0xffff8881053e7f80 (size 128):
      comm "syz-executor242", pid 5066, jiffies 4294946079 (age 13.220s)
      hex dump (first 32 bytes):
      01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
      backtrace:
      [<ffffffff812ae26a>] alloc_pid+0x6a/0x560 kernel/pid.c:180
      [<ffffffff812718df>] copy_process+0x169f/0x26c0 kernel/fork.c:2285
      [<ffffffff81272b37>] kernel_clone+0xf7/0x610 kernel/fork.c:2684
      [<ffffffff812730cc>] __do_sys_clone+0x7c/0xb0 kernel/fork.c:2825
      [<ffffffff849ad699>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      [<ffffffff849ad699>] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
      [<ffffffff84a0008b>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Fixes: 314001f0 ("af_unix: Add OOB support")
      Reported-by: syzbot+7699d9e5635c10253a27@syzkaller.appspotmail.com
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Rao Shoaib <rao.shoaib@oracle.com>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20230307164530.771896-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2aab4b96
    • Jakub Kicinski's avatar
      eth: fealnx: bring back this old driver · 8f148208
      Jakub Kicinski authored
      This reverts commit d5e2d038.
      
      We have a report of this chip being used on a
      
        SURECOM EP-320X-S 100/10M Ethernet PCI Adapter
      
      which could still have been purchased in some parts
      of the world 3 years ago.
      
      Cc: stable@vger.kernel.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=217151
      Fixes: d5e2d038 ("eth: fealnx: delete the driver for Myson MTD-800")
      Link: https://lore.kernel.org/r/20230307171930.4008454-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8f148208
    • Vladimir Oltean's avatar
      net: dsa: mt7530: permit port 5 to work without port 6 on MT7621 SoC · c8b8a3c6
      Vladimir Oltean authored
      The MT7530 switch from the MT7621 SoC has 2 ports which can be set up as
      internal: port 5 and 6. Arınç reports that the GMAC1 attached to port 5
      receives corrupted frames, unless port 6 (attached to GMAC0) has been
      brought up by the driver. This is true regardless of whether port 5 is
      used as a user port or as a CPU port (carrying DSA tags).
      
      Offline debugging (blind for me) which began in the linked thread showed
      experimentally that the configuration done by the driver for port 6
      contains a step which is needed by port 5 as well - the write to
      CORE_GSWPLL_GRP2 (note that I've no idea as to what it does, apart from
      the comment "Set core clock into 500Mhz"). Prints put by Arınç show that
      the reset value of CORE_GSWPLL_GRP2 is RG_GSWPLL_POSDIV_500M(1) |
      RG_GSWPLL_FBKDIV_500M(40) (0x128), both on the MCM MT7530 from the
      MT7621 SoC, as well as on the standalone MT7530 from MT7623NI Bananapi
      BPI-R2. Apparently, port 5 on the standalone MT7530 can work under both
      values of the register, while on the MT7621 SoC it cannot.
      
      The call path that triggers the register write is:
      
      mt753x_phylink_mac_config() for port 6
      -> mt753x_pad_setup()
         -> mt7530_pad_clk_setup()
      
      so this fully explains the behavior noticed by Arınç, that bringing port
      6 up is necessary.
      
      The simplest fix for the problem is to extract the register writes which
      are needed for both port 5 and 6 into a common mt7530_pll_setup()
      function, which is called at mt7530_setup() time, immediately after
      switch reset. We can argue that this mirrors the code layout introduced
      in mt7531_setup() by commit 42bc4faf ("net: mt7531: only do PLL once
      after the reset"), in that the PLL setup has the exact same positioning,
      and further work to consolidate the separate setup() functions is not
      hindered.
      
      Testing confirms that:
      
      - the slight reordering of writes to MT7530_P6ECR and to
        CORE_GSWPLL_GRP1 / CORE_GSWPLL_GRP2 introduced by this change does not
        appear to cause problems for the operation of port 6 on MT7621 and on
        MT7623 (where port 5 also always worked)
      
      - packets sent through port 5 are not corrupted anymore, regardless of
        whether port 6 is enabled by phylink or not (or even present in the
        device tree)
      
      My algorithm for determining the Fixes: tag is as follows. Testing shows
      that some logic from mt7530_pad_clk_setup() is needed even for port 5.
      Prior to commit ca366d6c ("net: dsa: mt7530: Convert to PHYLINK
      API"), a call did exist for all phy_is_pseudo_fixed_link() ports - so
      port 5 included. That commit replaced it with a temporary "Port 5 is not
      supported!" comment, and the following commit 38f790a8 ("net: dsa:
      mt7530: Add support for port 5") replaced that comment with a
      configuration procedure in mt7530_setup_port5() which was insufficient
      for port 5 to work. I'm laying the blame on the patch that claimed
      support for port 5, although one would have also needed the change from
      commit c3b8e079 ("net: dsa: mt7530: setup core clock even in TRGMII
      mode") for the write to be performed completely independently from port
      6's configuration.
      
      Thanks go to Arınç for describing the problem, for debugging and for
      testing.
      Reported-by: default avatarArınç ÜNAL <arinc.unal@arinc9.com>
      Link: https://lore.kernel.org/netdev/f297c2c4-6e7c-57ac-2394-f6025d309b9d@arinc9.com/
      Fixes: 38f790a8 ("net: dsa: mt7530: Add support for port 5")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Tested-by: default avatarArınç ÜNAL <arinc.unal@arinc9.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Link: https://lore.kernel.org/r/20230307155411.868573-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c8b8a3c6
  2. 08 Mar, 2023 3 commits
  3. 07 Mar, 2023 9 commits
  4. 06 Mar, 2023 15 commits
    • Jakub Kicinski's avatar
      net: tls: fix device-offloaded sendpage straddling records · e539a105
      Jakub Kicinski authored
      Adrien reports that incorrect data is transmitted when a single
      page straddles multiple records. We would transmit the same
      data in all iterations of the loop.
      Reported-by: default avatarAdrien Moulin <amoulin@corp.free.fr>
      Link: https://lore.kernel.org/all/61481278.42813558.1677845235112.JavaMail.zimbra@corp.free.fr
      Fixes: c1318b39 ("tls: Add opt-in zerocopy mode of sendfile()")
      Tested-by: default avatarAdrien Moulin <amoulin@corp.free.fr>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Acked-by: default avatarMaxim Mikityanskiy <maxtram95@gmail.com>
      Link: https://lore.kernel.org/r/20230304192610.3818098-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e539a105
    • Daniel Golle's avatar
      net: ethernet: mtk_eth_soc: fix RX data corruption issue · 193250ac
      Daniel Golle authored
      Fix data corruption issue with SerDes connected PHYs operating at 1.25
      Gbps speed where we could previously observe about 30% packet loss while
      the bad packet counter was increasing.
      
      As almost all boards with MediaTek MT7622 or MT7986 use either the MT7531
      switch IC operating at 3.125Gbps SerDes rate or single-port PHYs using
      rate-adaptation to 2500Base-X mode, this issue only got exposed now when
      we started trying to use SFP modules operating with 1.25 Gbps with the
      BananaPi R3 board.
      
      The fix is to set bit 12 which disables the RX FIFO clear function when
      setting up MAC MCR, MediaTek SDK did the same change stating:
      "If without this patch, kernel might receive invalid packets that are
      corrupted by GMAC."[1]
      
      [1]: https://git01.mediatek.com/plugins/gitiles/openwrt/feeds/mtk-openwrt-feeds/+/d8a2975939a12686c4a95c40db21efdc3f821f63
      
      Fixes: 42c03844 ("net-next: mediatek: add support for MediaTek MT7622 SoC")
      Tested-by: default avatarBjørn Mork <bjorn@mork.no>
      Signed-off-by: default avatarDaniel Golle <daniel@makrotopia.org>
      Reviewed-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Link: https://lore.kernel.org/r/138da2735f92c8b6f8578ec2e5a794ee515b665f.1677937317.git.daniel@makrotopia.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      193250ac
    • Heiner Kallweit's avatar
      net: phy: smsc: fix link up detection in forced irq mode · 58aac3a2
      Heiner Kallweit authored
      Currently link up can't be detected in forced mode if polling
      isn't used. Only link up interrupt source we have is aneg
      complete which isn't applicable in forced mode. Therefore we
      have to use energy-on as link up indicator.
      
      Fixes: 73654945 ("net: phy: smsc: skip ENERGYON interrupt if disabled")
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      58aac3a2
    • Martin KaFai Lau's avatar
      Merge branch 'fix resolving VAR after DATASEC' · 32dfc59e
      Martin KaFai Lau authored
      Lorenz Bauer says:
      
      ====================
      
      See the first patch for a detailed explanation.
      
      v2:
      - Move RESOLVE_TBD assignment out of the loop (Martin)
      ====================
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      32dfc59e
    • Lorenz Bauer's avatar
      selftests/bpf: check that modifier resolves after pointer · dfdd608c
      Lorenz Bauer authored
      Add a regression test that ensures that a VAR pointing at a
      modifier which follows a PTR (or STRUCT or ARRAY) is resolved
      correctly by the datasec validator.
      Signed-off-by: default avatarLorenz Bauer <lmb@isovalent.com>
      Link: https://lore.kernel.org/r/20230306112138.155352-3-lmb@isovalent.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      dfdd608c
    • Lorenz Bauer's avatar
      btf: fix resolving BTF_KIND_VAR after ARRAY, STRUCT, UNION, PTR · 9b459804
      Lorenz Bauer authored
      btf_datasec_resolve contains a bug that causes the following BTF
      to fail loading:
      
          [1] DATASEC a size=2 vlen=2
              type_id=4 offset=0 size=1
              type_id=7 offset=1 size=1
          [2] INT (anon) size=1 bits_offset=0 nr_bits=8 encoding=(none)
          [3] PTR (anon) type_id=2
          [4] VAR a type_id=3 linkage=0
          [5] INT (anon) size=1 bits_offset=0 nr_bits=8 encoding=(none)
          [6] TYPEDEF td type_id=5
          [7] VAR b type_id=6 linkage=0
      
      This error message is printed during btf_check_all_types:
      
          [1] DATASEC a size=2 vlen=2
              type_id=7 offset=1 size=1 Invalid type
      
      By tracing btf_*_resolve we can pinpoint the problem:
      
          btf_datasec_resolve(depth: 1, type_id: 1, mode: RESOLVE_TBD) = 0
              btf_var_resolve(depth: 2, type_id: 4, mode: RESOLVE_TBD) = 0
                  btf_ptr_resolve(depth: 3, type_id: 3, mode: RESOLVE_PTR) = 0
              btf_var_resolve(depth: 2, type_id: 4, mode: RESOLVE_PTR) = 0
          btf_datasec_resolve(depth: 1, type_id: 1, mode: RESOLVE_PTR) = -22
      
      The last invocation of btf_datasec_resolve should invoke btf_var_resolve
      by means of env_stack_push, instead it returns EINVAL. The reason is that
      env_stack_push is never executed for the second VAR.
      
          if (!env_type_is_resolve_sink(env, var_type) &&
              !env_type_is_resolved(env, var_type_id)) {
              env_stack_set_next_member(env, i + 1);
              return env_stack_push(env, var_type, var_type_id);
          }
      
      env_type_is_resolve_sink() changes its behaviour based on resolve_mode.
      For RESOLVE_PTR, we can simplify the if condition to the following:
      
          (btf_type_is_modifier() || btf_type_is_ptr) && !env_type_is_resolved()
      
      Since we're dealing with a VAR the clause evaluates to false. This is
      not sufficient to trigger the bug however. The log output and EINVAL
      are only generated if btf_type_id_size() fails.
      
          if (!btf_type_id_size(btf, &type_id, &type_size)) {
              btf_verifier_log_vsi(env, v->t, vsi, "Invalid type");
              return -EINVAL;
          }
      
      Most types are sized, so for example a VAR referring to an INT is not a
      problem. The bug is only triggered if a VAR points at a modifier. Since
      we skipped btf_var_resolve that modifier was also never resolved, which
      means that btf_resolved_type_id returns 0 aka VOID for the modifier.
      This in turn causes btf_type_id_size to return NULL, triggering EINVAL.
      
      To summarise, the following conditions are necessary:
      
      - VAR pointing at PTR, STRUCT, UNION or ARRAY
      - Followed by a VAR pointing at TYPEDEF, VOLATILE, CONST, RESTRICT or
        TYPE_TAG
      
      The fix is to reset resolve_mode to RESOLVE_TBD before attempting to
      resolve a VAR from a DATASEC.
      
      Fixes: 1dc92851 ("bpf: kernel side support for BTF Var and DataSec")
      Signed-off-by: default avatarLorenz Bauer <lmb@isovalent.com>
      Link: https://lore.kernel.org/r/20230306112138.155352-2-lmb@isovalent.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      9b459804
    • Alexander Lobakin's avatar
      bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES · 294635a8
      Alexander Lobakin authored
      &xdp_buff and &xdp_frame are bound in a way that
      
      xdp_buff->data_hard_start == xdp_frame
      
      It's always the case and e.g. xdp_convert_buff_to_frame() relies on
      this.
      IOW, the following:
      
      	for (u32 i = 0; i < 0xdead; i++) {
      		xdpf = xdp_convert_buff_to_frame(&xdp);
      		xdp_convert_frame_to_buff(xdpf, &xdp);
      	}
      
      shouldn't ever modify @xdpf's contents or the pointer itself.
      However, "live packet" code wrongly treats &xdp_frame as part of its
      context placed *before* the data_hard_start. With such flow,
      data_hard_start is sizeof(*xdpf) off to the right and no longer points
      to the XDP frame.
      
      Instead of replacing `sizeof(ctx)` with `offsetof(ctx, xdpf)` in several
      places and praying that there are no more miscalcs left somewhere in the
      code, unionize ::frm with ::data in a flex array, so that both starts
      pointing to the actual data_hard_start and the XDP frame actually starts
      being a part of it, i.e. a part of the headroom, not the context.
      A nice side effect is that the maximum frame size for this mode gets
      increased by 40 bytes, as xdp_buff::frame_sz includes everything from
      data_hard_start (-> includes xdpf already) to the end of XDP/skb shared
      info.
      Also update %MAX_PKT_SIZE accordingly in the selftests code. Leave it
      hardcoded for 64 bit && 4k pages, it can be made more flexible later on.
      
      Minor: align `&head->data` with how `head->frm` is assigned for
      consistency.
      Minor #2: rename 'frm' to 'frame' in &xdp_page_head while at it for
      clarity.
      
      (was found while testing XDP traffic generator on ice, which calls
       xdp_convert_frame_to_buff() for each XDP frame)
      
      Fixes: b530e9e1 ("bpf: Add "live packet" mode for XDP in BPF_PROG_RUN")
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarAlexander Lobakin <aleksander.lobakin@intel.com>
      Link: https://lore.kernel.org/r/20230224163607.2994755-1-aleksander.lobakin@intel.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      294635a8
    • Bagas Sanjaya's avatar
      bpf, doc: Link to submitting-patches.rst for general patch submission info · b7abcd9c
      Bagas Sanjaya authored
      The link for patch submission information in general refers to index
      page for "Working with the kernel development community" section of
      kernel docs, whereas the link should have been
      Documentation/process/submitting-patches.rst instead.
      
      Fix it by replacing the index target with the appropriate doc.
      
      Fixes: 54222838 ("bpf, doc: convert bpf_devel_QA.rst to use RST formatting")
      Signed-off-by: default avatarBagas Sanjaya <bagasdotme@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230228074523.11493-3-bagasdotme@gmail.com
      b7abcd9c
    • Bagas Sanjaya's avatar
      bpf, doc: Do not link to docs.kernel.org for kselftest link · 32db18d6
      Bagas Sanjaya authored
      The question on how to run BPF selftests have a reference link to kernel
      selftest documentation (Documentation/dev-tools/kselftest.rst). However,
      it uses external link to the documentation at kernel.org/docs (aka
      docs.kernel.org) instead, which requires Internet access.
      
      Fix this and replace the link with internal linking, by using :doc: directive
      while keeping the anchor text.
      
      Fixes: b7a27c3a ("bpf, doc: howto use/run the BPF selftests")
      Signed-off-by: default avatarBagas Sanjaya <bagasdotme@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230228074523.11493-2-bagasdotme@gmail.com
      32db18d6
    • Florian Westphal's avatar
      netfilter: tproxy: fix deadlock due to missing BH disable · 4a024267
      Florian Westphal authored
      The xtables packet traverser performs an unconditional local_bh_disable(),
      but the nf_tables evaluation loop does not.
      
      Functions that are called from either xtables or nftables must assume
      that they can be called in process context.
      
      inet_twsk_deschedule_put() assumes that no softirq interrupt can occur.
      If tproxy is used from nf_tables its possible that we'll deadlock
      trying to aquire a lock already held in process context.
      
      Add a small helper that takes care of this and use it.
      
      Link: https://lore.kernel.org/netfilter-devel/401bd6ed-314a-a196-1cdc-e13c720cc8f2@balasys.hu/
      Fixes: 4ed8eb65 ("netfilter: nf_tables: Add native tproxy support")
      Reported-and-tested-by: default avatarMajor Dávid <major.david@balasys.hu>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      4a024267
    • Ivan Delalande's avatar
      netfilter: ctnetlink: revert to dumping mark regardless of event type · 9f7dd42f
      Ivan Delalande authored
      It seems that change was unintentional, we have userspace code that
      needs the mark while listening for events like REPLY, DESTROY, etc.
      Also include 0-marks in requested dumps, as they were before that fix.
      
      Fixes: 1feeae07 ("netfilter: ctnetlink: fix compilation warning after data race fixes in ct mark")
      Signed-off-by: default avatarIvan Delalande <colona@arista.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      9f7dd42f
    • Selvin Xavier's avatar
      bnxt_en: Fix the double free during device removal · 89b59a84
      Selvin Xavier authored
      Following warning reported by KASAN during driver unload
      
      ==================================================================
      BUG: KASAN: double-free in bnxt_remove_one+0x103/0x200 [bnxt_en]
      Free of addr ffff88814e8dd4c0 by task rmmod/17469
      CPU: 47 PID: 17469 Comm: rmmod Kdump: loaded Tainted: G S                 6.2.0-rc7+ #2
      Hardware name: Dell Inc. PowerEdge R740/01YM03, BIOS 2.3.10 08/15/2019
      Call Trace:
       <TASK>
       dump_stack_lvl+0x33/0x46
       print_report+0x17b/0x4b3
       ? __call_rcu_common.constprop.79+0x27e/0x8c0
       ? __pfx_free_object_rcu+0x10/0x10
       ? __virt_addr_valid+0xe3/0x160
       ? bnxt_remove_one+0x103/0x200 [bnxt_en]
       kasan_report_invalid_free+0x64/0xd0
       ? bnxt_remove_one+0x103/0x200 [bnxt_en]
       ? bnxt_remove_one+0x103/0x200 [bnxt_en]
       __kasan_slab_free+0x179/0x1c0
       ? bnxt_remove_one+0x103/0x200 [bnxt_en]
       __kmem_cache_free+0x194/0x350
       bnxt_remove_one+0x103/0x200 [bnxt_en]
       pci_device_remove+0x62/0x110
       device_release_driver_internal+0xf6/0x1c0
       driver_detach+0x76/0xe0
       bus_remove_driver+0x89/0x160
       pci_unregister_driver+0x26/0x110
       ? strncpy_from_user+0x188/0x1c0
       bnxt_exit+0xc/0x24 [bnxt_en]
       __x64_sys_delete_module+0x21f/0x390
       ? __pfx___x64_sys_delete_module+0x10/0x10
       ? __pfx_mem_cgroup_handle_over_high+0x10/0x10
       ? _raw_spin_lock+0x87/0xe0
       ? __pfx__raw_spin_lock+0x10/0x10
       ? __audit_syscall_entry+0x185/0x210
       ? ktime_get_coarse_real_ts64+0x51/0x80
       ? syscall_trace_enter.isra.18+0x126/0x1a0
       do_syscall_64+0x37/0x90
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
      RIP: 0033:0x7effcb6fd71b
      Code: 73 01 c3 48 8b 0d 6d 17 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 3d 17 2c 00 f7 d8 64 89 01 48
      RSP: 002b:00007ffeada270b8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
      RAX: ffffffffffffffda RBX: 00005623660e0750 RCX: 00007effcb6fd71b
      RDX: 000000000000000a RSI: 0000000000000800 RDI: 00005623660e07b8
      RBP: 0000000000000000 R08: 00007ffeada26031 R09: 0000000000000000
      R10: 00007effcb771280 R11: 0000000000000206 R12: 00007ffeada272e0
      R13: 00007ffeada28bc4 R14: 00005623660e02a0 R15: 00005623660e0750
       </TASK>
      
      Auxiliary device structures are freed in bnxt_aux_dev_release. So avoid
      calling kfree from bnxt_remove_one.
      
      Also, set bp->edev to NULL before freeing the auxilary private structure.
      
      Fixes: d80d88b0 ("bnxt_en: Add auxiliary driver support")
      Reviewed-by: default avatarAjit Khaparde <ajit.khaparde@broadcom.com>
      Reviewed-by: default avatarAndy Gospodarek <andrew.gospodarek@broadcom.com>
      Signed-off-by: default avatarSelvin Xavier <selvin.xavier@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      89b59a84
    • Michael Chan's avatar
      bnxt_en: Avoid order-5 memory allocation for TPA data · accd7e23
      Michael Chan authored
      The driver needs to keep track of all the possible concurrent TPA (GRO/LRO)
      completions on the aggregation ring.  On P5 chips, the maximum number
      of concurrent TPA is 256 and the amount of memory we allocate is order-5
      on systems using 4K pages.  Memory allocation failure has been reported:
      
      NetworkManager: page allocation failure: order:5, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0-1
      CPU: 15 PID: 2995 Comm: NetworkManager Kdump: loaded Not tainted 5.10.156 #1
      Hardware name: Dell Inc. PowerEdge R660/0M1CC5, BIOS 0.2.25 08/12/2022
      Call Trace:
       dump_stack+0x57/0x6e
       warn_alloc.cold.120+0x7b/0xdd
       ? _cond_resched+0x15/0x30
       ? __alloc_pages_direct_compact+0x15f/0x170
       __alloc_pages_slowpath.constprop.108+0xc58/0xc70
       __alloc_pages_nodemask+0x2d0/0x300
       kmalloc_order+0x24/0xe0
       kmalloc_order_trace+0x19/0x80
       bnxt_alloc_mem+0x1150/0x15c0 [bnxt_en]
       ? bnxt_get_func_stat_ctxs+0x13/0x60 [bnxt_en]
       __bnxt_open_nic+0x12e/0x780 [bnxt_en]
       bnxt_open+0x10b/0x240 [bnxt_en]
       __dev_open+0xe9/0x180
       __dev_change_flags+0x1af/0x220
       dev_change_flags+0x21/0x60
       do_setlink+0x35c/0x1100
      
      Instead of allocating this big chunk of memory and dividing it up for the
      concurrent TPA instances, allocate each small chunk separately for each
      TPA instance.  This will reduce it to order-0 allocations.
      
      Fixes: 79632e9b ("bnxt_en: Expand bnxt_tpa_info struct to support 57500 chips.")
      Reviewed-by: default avatarSomnath Kotur <somnath.kotur@broadcom.com>
      Reviewed-by: default avatarDamodharam Ammepalli <damodharam.ammepalli@broadcom.com>
      Reviewed-by: default avatarPavan Chebbi <pavan.chebbi@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      accd7e23
    • Russell King (Oracle)'s avatar
      net: phylib: get rid of unnecessary locking · f4b47a2e
      Russell King (Oracle) authored
      The locking in phy_probe() and phy_remove() does very little to prevent
      any races with e.g. phy_attach_direct(), but instead causes lockdep ABBA
      warnings. Remove it.
      
      ======================================================
      WARNING: possible circular locking dependency detected
      6.2.0-dirty #1108 Tainted: G        W   E
      ------------------------------------------------------
      ip/415 is trying to acquire lock:
      ffff5c268f81ef50 (&dev->lock){+.+.}-{3:3}, at: phy_attach_direct+0x17c/0x3a0 [libphy]
      
      but task is already holding lock:
      ffffaef6496cb518 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x154/0x560
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (rtnl_mutex){+.+.}-{3:3}:
             __lock_acquire+0x35c/0x6c0
             lock_acquire.part.0+0xcc/0x220
             lock_acquire+0x68/0x84
             __mutex_lock+0x8c/0x414
             mutex_lock_nested+0x34/0x40
             rtnl_lock+0x24/0x30
             sfp_bus_add_upstream+0x34/0x150
             phy_sfp_probe+0x4c/0x94 [libphy]
             mv3310_probe+0x148/0x184 [marvell10g]
             phy_probe+0x8c/0x200 [libphy]
             call_driver_probe+0xbc/0x15c
             really_probe+0xc0/0x320
             __driver_probe_device+0x84/0x120
             driver_probe_device+0x44/0x120
             __device_attach_driver+0xc4/0x160
             bus_for_each_drv+0x80/0xe0
             __device_attach+0xb0/0x1f0
             device_initial_probe+0x1c/0x2c
             bus_probe_device+0xa4/0xb0
             device_add+0x360/0x53c
             phy_device_register+0x60/0xa4 [libphy]
             fwnode_mdiobus_phy_device_register+0xc0/0x190 [fwnode_mdio]
             fwnode_mdiobus_register_phy+0x160/0xd80 [fwnode_mdio]
             of_mdiobus_register+0x140/0x340 [of_mdio]
             orion_mdio_probe+0x298/0x3c0 [mvmdio]
             platform_probe+0x70/0xe0
             call_driver_probe+0x34/0x15c
             really_probe+0xc0/0x320
             __driver_probe_device+0x84/0x120
             driver_probe_device+0x44/0x120
             __driver_attach+0x104/0x210
             bus_for_each_dev+0x78/0xdc
             driver_attach+0x2c/0x3c
             bus_add_driver+0x184/0x240
             driver_register+0x80/0x13c
             __platform_driver_register+0x30/0x3c
             xt_compat_calc_jump+0x28/0xa4 [x_tables]
             do_one_initcall+0x50/0x1b0
             do_init_module+0x50/0x1fc
             load_module+0x684/0x744
             __do_sys_finit_module+0xc4/0x140
             __arm64_sys_finit_module+0x28/0x34
             invoke_syscall+0x50/0x120
             el0_svc_common.constprop.0+0x6c/0x1b0
             do_el0_svc+0x34/0x44
             el0_svc+0x48/0xf0
             el0t_64_sync_handler+0xb8/0xc0
             el0t_64_sync+0x1a0/0x1a4
      
      -> #0 (&dev->lock){+.+.}-{3:3}:
             check_prev_add+0xb4/0xc80
             validate_chain+0x414/0x47c
             __lock_acquire+0x35c/0x6c0
             lock_acquire.part.0+0xcc/0x220
             lock_acquire+0x68/0x84
             __mutex_lock+0x8c/0x414
             mutex_lock_nested+0x34/0x40
             phy_attach_direct+0x17c/0x3a0 [libphy]
             phylink_fwnode_phy_connect.part.0+0x70/0xe4 [phylink]
             phylink_fwnode_phy_connect+0x48/0x60 [phylink]
             mvpp2_open+0xec/0x2e0 [mvpp2]
             __dev_open+0x104/0x214
             __dev_change_flags+0x1d4/0x254
             dev_change_flags+0x2c/0x7c
             do_setlink+0x254/0xa50
             __rtnl_newlink+0x430/0x514
             rtnl_newlink+0x58/0x8c
             rtnetlink_rcv_msg+0x17c/0x560
             netlink_rcv_skb+0x64/0x150
             rtnetlink_rcv+0x20/0x30
             netlink_unicast+0x1d4/0x2b4
             netlink_sendmsg+0x1a4/0x400
             ____sys_sendmsg+0x228/0x290
             ___sys_sendmsg+0x88/0xec
             __sys_sendmsg+0x70/0xd0
             __arm64_sys_sendmsg+0x2c/0x40
             invoke_syscall+0x50/0x120
             el0_svc_common.constprop.0+0x6c/0x1b0
             do_el0_svc+0x34/0x44
             el0_svc+0x48/0xf0
             el0t_64_sync_handler+0xb8/0xc0
             el0t_64_sync+0x1a0/0x1a4
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(rtnl_mutex);
                                     lock(&dev->lock);
                                     lock(rtnl_mutex);
        lock(&dev->lock);
      
       *** DEADLOCK ***
      
      Fixes: 298e54fa ("net: phy: add core phylib sfp support")
      Reported-by: default avatarMarc Zyngier <maz@kernel.org>
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f4b47a2e
    • Rongguang Wei's avatar
      net: stmmac: add to set device wake up flag when stmmac init phy · a9334b70
      Rongguang Wei authored
      When MAC is not support PMT, driver will check PHY's WoL capability
      and set device wakeup capability in stmmac_init_phy(). We can enable
      the WoL through ethtool, the driver would enable the device wake up
      flag. Now the device_may_wakeup() return true.
      
      But if there is a way which enable the PHY's WoL capability derectly,
      like in BIOS. The driver would not know the enable thing and would not
      set the device wake up flag. The phy_suspend may failed like this:
      
      [   32.409063] PM: dpm_run_callback(): mdio_bus_phy_suspend+0x0/0x50 returns -16
      [   32.409065] PM: Device stmmac-1:00 failed to suspend: error -16
      [   32.409067] PM: Some devices failed to suspend, or early wake event detected
      
      Add to set the device wakeup enable flag according to the get_wol
      function result in PHY can fix the error in this scene.
      
      v2: add a Fixes tag.
      
      Fixes: 1d8e5b0f ("net: stmmac: Support WOL with phy")
      Signed-off-by: default avatarRongguang Wei <weirongguang@kylinos.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a9334b70
  5. 03 Mar, 2023 10 commits
    • Liu Jian's avatar
      bpf, sockmap: Fix an infinite loop error when len is 0 in tcp_bpf_recvmsg_parser() · d900f3d2
      Liu Jian authored
      When the buffer length of the recvmsg system call is 0, we got the
      flollowing soft lockup problem:
      
      watchdog: BUG: soft lockup - CPU#3 stuck for 27s! [a.out:6149]
      CPU: 3 PID: 6149 Comm: a.out Kdump: loaded Not tainted 6.2.0+ #30
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
      RIP: 0010:remove_wait_queue+0xb/0xc0
      Code: 5e 41 5f c3 cc cc cc cc 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 57 <41> 56 41 55 41 54 55 48 89 fd 53 48 89 f3 4c 8d 6b 18 4c 8d 73 20
      RSP: 0018:ffff88811b5978b8 EFLAGS: 00000246
      RAX: 0000000000000000 RBX: ffff88811a7d3780 RCX: ffffffffb7a4d768
      RDX: dffffc0000000000 RSI: ffff88811b597908 RDI: ffff888115408040
      RBP: 1ffff110236b2f1b R08: 0000000000000000 R09: ffff88811a7d37e7
      R10: ffffed10234fa6fc R11: 0000000000000001 R12: ffff88811179b800
      R13: 0000000000000001 R14: ffff88811a7d38a8 R15: ffff88811a7d37e0
      FS:  00007f6fb5398740(0000) GS:ffff888237180000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020000000 CR3: 000000010b6ba002 CR4: 0000000000370ee0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       tcp_msg_wait_data+0x279/0x2f0
       tcp_bpf_recvmsg_parser+0x3c6/0x490
       inet_recvmsg+0x280/0x290
       sock_recvmsg+0xfc/0x120
       ____sys_recvmsg+0x160/0x3d0
       ___sys_recvmsg+0xf0/0x180
       __sys_recvmsg+0xea/0x1a0
       do_syscall_64+0x3f/0x90
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
      The logic in tcp_bpf_recvmsg_parser is as follows:
      
      msg_bytes_ready:
      	copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
      	if (!copied) {
      		wait data;
      		goto msg_bytes_ready;
      	}
      
      In this case, "copied" always is 0, the infinite loop occurs.
      
      According to the Linux system call man page, 0 should be returned in this
      case. Therefore, in tcp_bpf_recvmsg_parser(), if the length is 0, directly
      return. Also modify several other functions with the same problem.
      
      Fixes: 1f5be6b3 ("udp: Implement udp_bpf_recvmsg() for sockmap")
      Fixes: 9825d866 ("af_unix: Implement unix_dgram_bpf_recvmsg()")
      Fixes: c5d2177a ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: default avatarLiu Jian <liujian56@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Cc: Jakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230303080946.1146638-1-liujian56@huawei.com
      d900f3d2
    • David S. Miller's avatar
      Merge branch 'nfp-ipsec-csum' · 52812526
      David S. Miller authored
      Simon Horman says:
      
      ====================
      nfp: fix incorrect IPsec checksum handling
      
      this short series resolves two problems with IPsec checksum handling
      in the nfp driver.
      
      * PATCH 1/3, 2/3: Correct setting of checksum flags.
        One patch for each of the nfd3 and nfdk datapaths.
      
      * Patch 3/3: Correct configuration of NETIF_F_CSUM_MASK
        so that the stack does not unecessarily calculate csums for
        IPsec offload packets.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      52812526
    • Huanhuan Wang's avatar
      nfp: fix esp-tx-csum-offload doesn't take effect · 1cf78d4c
      Huanhuan Wang authored
      When esp-tx-csum-offload is set to on, the protocol stack shouldn't
      calculate the IPsec offload packet's csum, but it does. Because the
      callback `.ndo_features_check` incorrectly masked NETIF_F_CSUM_MASK bit.
      
      Fixes: 57f273ad ("nfp: add framework to support ipsec offloading")
      Signed-off-by: default avatarHuanhuan Wang <huanhuan.wang@corigine.com>
      Signed-off-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1cf78d4c
    • Huanhuan Wang's avatar
      nfp: fix incorrectly set csum flag for nfdk path · 8b46168c
      Huanhuan Wang authored
      The csum flag of IPsec packet are set repeatedly. Therefore, the csum
      flag set of IPsec and non-IPsec packet need to be distinguished.
      
      As the ipv6 header does not have a csum field, so l3-csum flag is not
      required to be set for ipv6 case.
      
      Fixes: 436396f2 ("nfp: support IPsec offloading for NFP3800")
      Signed-off-by: default avatarHuanhuan Wang <huanhuan.wang@corigine.com>
      Reviewed-by: default avatarLouis Peens <louis.peens@corigine.com>
      Signed-off-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b46168c
    • Huanhuan Wang's avatar
      nfp: fix incorrectly set csum flag for nfd3 path · 3e04419c
      Huanhuan Wang authored
      The csum flag of IPsec packet are set repeatedly. Therefore, the csum
      flag set of IPsec and non-IPsec packet need to be distinguished.
      
      As the ipv6 header does not have a csum field, so l3-csum flag is not
      required to be set for ipv6 case.
      
      L4-csum flag include the tcp csum flag and udp csum flag, we shouldn't
      set the udp and tcp csum flag at the same time for one packet, should
      set l4-csum flag according to the transport layer is tcp or udp.
      
      Fixes: 57f273ad ("nfp: add framework to support ipsec offloading")
      Signed-off-by: default avatarHuanhuan Wang <huanhuan.wang@corigine.com>
      Reviewed-by: default avatarLouis Peens <louis.peens@corigine.com>
      Signed-off-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3e04419c
    • Petr Oros's avatar
      ice: copy last block omitted in ice_get_module_eeprom() · 84cba184
      Petr Oros authored
      ice_get_module_eeprom() is broken since commit e9c9692c ("ice:
      Reimplement module reads used by ethtool") In this refactor,
      ice_get_module_eeprom() reads the eeprom in blocks of size 8.
      But the condition that should protect the buffer overflow
      ignores the last block. The last block always contains zeros.
      
      Bug uncovered by ethtool upstream commit 9538f384b535
      ("netlink: eeprom: Defer page requests to individual parsers")
      After this commit, ethtool reads a block with length = 1;
      to read the SFF-8024 identifier value.
      
      unpatched driver:
      $ ethtool -m enp65s0f0np0 offset 0x90 length 8
      Offset          Values
      ------          ------
      0x0090:         00 00 00 00 00 00 00 00
      $ ethtool -m enp65s0f0np0 offset 0x90 length 12
      Offset          Values
      ------          ------
      0x0090:         00 00 01 a0 4d 65 6c 6c 00 00 00 00
      $
      
      $ ethtool -m enp65s0f0np0
      Offset          Values
      ------          ------
      0x0000:         11 06 06 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x0010:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x0020:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x0030:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x0040:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x0050:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x0060:         00 00 00 00 00 00 00 00 00 00 00 00 00 01 08 00
      0x0070:         00 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      
      patched driver:
      $ ethtool -m enp65s0f0np0 offset 0x90 length 8
      Offset          Values
      ------          ------
      0x0090:         00 00 01 a0 4d 65 6c 6c
      $ ethtool -m enp65s0f0np0 offset 0x90 length 12
      Offset          Values
      ------          ------
      0x0090:         00 00 01 a0 4d 65 6c 6c 61 6e 6f 78
      $ ethtool -m enp65s0f0np0
          Identifier                                : 0x11 (QSFP28)
          Extended identifier                       : 0x00
          Extended identifier description           : 1.5W max. Power consumption
          Extended identifier description           : No CDR in TX, No CDR in RX
          Extended identifier description           : High Power Class (> 3.5 W) not enabled
          Connector                                 : 0x23 (No separable connector)
          Transceiver codes                         : 0x88 0x00 0x00 0x00 0x00 0x00 0x00 0x00
          Transceiver type                          : 40G Ethernet: 40G Base-CR4
          Transceiver type                          : 25G Ethernet: 25G Base-CR CA-N
          Encoding                                  : 0x05 (64B/66B)
          BR, Nominal                               : 25500Mbps
          Rate identifier                           : 0x00
          Length (SMF,km)                           : 0km
          Length (OM3 50um)                         : 0m
          Length (OM2 50um)                         : 0m
          Length (OM1 62.5um)                       : 0m
          Length (Copper or Active cable)           : 1m
          Transmitter technology                    : 0xa0 (Copper cable unequalized)
          Attenuation at 2.5GHz                     : 4db
          Attenuation at 5.0GHz                     : 5db
          Attenuation at 7.0GHz                     : 7db
          Attenuation at 12.9GHz                    : 10db
          ........
          ....
      
      Fixes: e9c9692c ("ice: Reimplement module reads used by ethtool")
      Signed-off-by: default avatarPetr Oros <poros@redhat.com>
      Reviewed-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      84cba184
    • David S. Miller's avatar
      Merge branch 'net-tools-ynl-fixes' · 8f632a0a
      David S. Miller authored
      Jakub Kicinski says:
      
      ====================
      tools: ynl: fix subset use and change default value for attrs/ops
      
      Fix a problem in subsetting, which will become apparent when
      the devlink family comes after the merge window. Even tho none
      of the existing families need this, we don't want someone to
      get "inspired" by the current, incorrect code when using specs
      in other languages.
      
      Change the default value for the first attr/op. This is a slight
      behavior change so needs to go in now. The diffstat of the last
      patch should serve as the clearest justification there..
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8f632a0a
    • Jakub Kicinski's avatar
      netlink: specs: update for codegen enumerating from 1 · bcec7171
      Jakub Kicinski authored
      Now that the codegen rules had been changed we can update
      the specs to reflect the new default.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bcec7171
    • Jakub Kicinski's avatar
      tools: ynl: use 1 as the default for first entry in attrs/ops · ad4fafcd
      Jakub Kicinski authored
      Pretty much all families use value: 1 or reserve as unspec
      the first entry in attribute set and the first operation.
      Make this the default. Update documentation (the doc for
      values of operations just refers back to doc for attrs
      so updating only attrs).
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ad4fafcd
    • Jakub Kicinski's avatar
      tools: ynl: fully inherit attrs in subsets · 7cf93538
      Jakub Kicinski authored
      To avoid having to repeat the entire definition of an attribute
      (including the value) use the Attr object from the original set.
      In fact this is already the documented expectation.
      
      Fixes: be5bea1c ("net: add basic C code generators for Netlink")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7cf93538