1. 19 Jan, 2022 10 commits
    • Robert Hancock's avatar
      net: axienet: Wait for PhyRstCmplt after core reset · b400c2f4
      Robert Hancock authored
      When resetting the device, wait for the PhyRstCmplt bit to be set
      in the interrupt status register before continuing initialization, to
      ensure that the core is actually ready. When using an external PHY, this
      also ensures we do not start trying to access the PHY while it is still
      in reset. The PHY reset is initiated by the core reset which is
      triggered just above, but remains asserted for 5ms after the core is
      reset according to the documentation.
      
      The MgtRdy bit could also be waited for, but unfortunately when using
      7-series devices, the bit does not appear to work as documented (it
      seems to behave as some sort of link state indication and not just an
      indication the transceiver is ready) so it can't really be relied on for
      this purpose.
      
      Fixes: 8a3b7a25 ("drivers/net/ethernet/xilinx: added Xilinx AXI Ethernet driver")
      Signed-off-by: default avatarRobert Hancock <robert.hancock@calian.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b400c2f4
    • Robert Hancock's avatar
      net: axienet: increase reset timeout · 2e5644b1
      Robert Hancock authored
      The previous timeout of 1ms was too short to handle some cases where the
      core is reset just after the input clocks were started, which will
      be introduced in an upcoming patch. Increase the timeout to 50ms. Also
      simplify the reset timeout checking to use read_poll_timeout.
      
      Fixes: 8a3b7a25 ("drivers/net/ethernet/xilinx: added Xilinx AXI Ethernet driver")
      Signed-off-by: default avatarRobert Hancock <robert.hancock@calian.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2e5644b1
    • Jakub Kicinski's avatar
      Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 99845220
      Jakub Kicinski authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2022-01-19
      
      We've added 12 non-merge commits during the last 8 day(s) which contain
      a total of 12 files changed, 262 insertions(+), 64 deletions(-).
      
      The main changes are:
      
      1) Various verifier fixes mainly around register offset handling when
         passed to helper functions, from Daniel Borkmann.
      
      2) Fix XDP BPF link handling to assert program type,
         from Toke Høiland-Jørgensen.
      
      3) Fix regression in mount parameter handling for BPF fs,
         from Yafang Shao.
      
      4) Fix incorrect integer literal when marking scratched stack slots
         in verifier, from Christy Lee.
      
      * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        bpf, selftests: Add ringbuf memory type confusion test
        bpf, selftests: Add various ringbuf tests with invalid offset
        bpf: Fix ringbuf memory type confusion when passing to helpers
        bpf: Fix out of bounds access for ringbuf helpers
        bpf: Generally fix helper register offset check
        bpf: Mark PTR_TO_FUNC register initially with zero offset
        bpf: Generalize check_ctx_reg for reuse with other types
        bpf: Fix incorrect integer literal used for marking scratched stack.
        bpf/selftests: Add check for updating XDP bpf_link with wrong program type
        bpf/selftests: convert xdp_link test to ASSERT_* macros
        xdp: check prog type before updating BPF link
        bpf: Fix mount source show for bpffs
      ====================
      
      Link: https://lore.kernel.org/r/20220119011825.9082-1-daniel@iogearbox.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      99845220
    • Daniel Borkmann's avatar
      bpf, selftests: Add ringbuf memory type confusion test · 37c8d480
      Daniel Borkmann authored
      Add two tests, one which asserts that ring buffer memory can be passed to
      other helpers for populating its entry area, and another one where verifier
      rejects different type of memory passed to bpf_ringbuf_submit().
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      37c8d480
    • Daniel Borkmann's avatar
      bpf, selftests: Add various ringbuf tests with invalid offset · 722e4db3
      Daniel Borkmann authored
      Assert that the verifier is rejecting invalid offsets on the ringbuf entries:
      
        # ./test_verifier | grep ring
        #947/u ringbuf: invalid reservation offset 1 OK
        #947/p ringbuf: invalid reservation offset 1 OK
        #948/u ringbuf: invalid reservation offset 2 OK
        #948/p ringbuf: invalid reservation offset 2 OK
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      722e4db3
    • Daniel Borkmann's avatar
      bpf: Fix ringbuf memory type confusion when passing to helpers · a672b2e3
      Daniel Borkmann authored
      The bpf_ringbuf_submit() and bpf_ringbuf_discard() have ARG_PTR_TO_ALLOC_MEM
      in their bpf_func_proto definition as their first argument, and thus both expect
      the result from a prior bpf_ringbuf_reserve() call which has a return type of
      RET_PTR_TO_ALLOC_MEM_OR_NULL.
      
      While the non-NULL memory from bpf_ringbuf_reserve() can be passed to other
      helpers, the two sinks (bpf_ringbuf_submit(), bpf_ringbuf_discard()) right now
      only enforce a register type of PTR_TO_MEM.
      
      This can lead to potential type confusion since it would allow other PTR_TO_MEM
      memory to be passed into the two sinks which did not come from bpf_ringbuf_reserve().
      
      Add a new MEM_ALLOC composable type attribute for PTR_TO_MEM, and enforce that:
      
       - bpf_ringbuf_reserve() returns NULL or PTR_TO_MEM | MEM_ALLOC
       - bpf_ringbuf_submit() and bpf_ringbuf_discard() only take PTR_TO_MEM | MEM_ALLOC
         but not plain PTR_TO_MEM arguments via ARG_PTR_TO_ALLOC_MEM
       - however, other helpers might treat PTR_TO_MEM | MEM_ALLOC as plain PTR_TO_MEM
         to populate the memory area when they use ARG_PTR_TO_{UNINIT_,}MEM in their
         func proto description
      
      Fixes: 457f4436 ("bpf: Implement BPF ring buffer and verifier support for it")
      Reported-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a672b2e3
    • Daniel Borkmann's avatar
      bpf: Fix out of bounds access for ringbuf helpers · 64620e0a
      Daniel Borkmann authored
      Both bpf_ringbuf_submit() and bpf_ringbuf_discard() have ARG_PTR_TO_ALLOC_MEM
      in their bpf_func_proto definition as their first argument. They both expect
      the result from a prior bpf_ringbuf_reserve() call which has a return type of
      RET_PTR_TO_ALLOC_MEM_OR_NULL.
      
      Meaning, after a NULL check in the code, the verifier will promote the register
      type in the non-NULL branch to a PTR_TO_MEM and in the NULL branch to a known
      zero scalar. Generally, pointer arithmetic on PTR_TO_MEM is allowed, so the
      latter could have an offset.
      
      The ARG_PTR_TO_ALLOC_MEM expects a PTR_TO_MEM register type. However, the non-
      zero result from bpf_ringbuf_reserve() must be fed into either bpf_ringbuf_submit()
      or bpf_ringbuf_discard() but with the original offset given it will then read
      out the struct bpf_ringbuf_hdr mapping.
      
      The verifier missed to enforce a zero offset, so that out of bounds access
      can be triggered which could be used to escalate privileges if unprivileged
      BPF was enabled (disabled by default in kernel).
      
      Fixes: 457f4436 ("bpf: Implement BPF ring buffer and verifier support for it")
      Reported-by: <tr3e.wang@gmail.com> (SecCoder Security Lab)
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      64620e0a
    • Daniel Borkmann's avatar
      bpf: Generally fix helper register offset check · 6788ab23
      Daniel Borkmann authored
      Right now the assertion on check_ptr_off_reg() is only enforced for register
      types PTR_TO_CTX (and open coded also for PTR_TO_BTF_ID), however, this is
      insufficient since many other PTR_TO_* register types such as PTR_TO_FUNC do
      not handle/expect register offsets when passed to helper functions.
      
      Given this can slip-through easily when adding new types, make this an explicit
      allow-list and reject all other current and future types by default if this is
      encountered.
      
      Also, extend check_ptr_off_reg() to handle PTR_TO_BTF_ID as well instead of
      duplicating it. For PTR_TO_BTF_ID, reg->off is used for BTF to match expected
      BTF ids if struct offset is used. This part still needs to be allowed, but the
      dynamic off from the tnum must be rejected.
      
      Fixes: 69c087ba ("bpf: Add bpf_for_each_map_elem() helper")
      Fixes: eaa6bcb7 ("bpf: Introduce bpf_per_cpu_ptr()")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      6788ab23
    • Daniel Borkmann's avatar
      bpf: Mark PTR_TO_FUNC register initially with zero offset · d400a6cf
      Daniel Borkmann authored
      Similar as with other pointer types where we use ldimm64, clear the register
      content to zero first, and then populate the PTR_TO_FUNC type and subprogno
      number. Currently this is not done, and leads to reuse of stale register
      tracking data.
      
      Given for special ldimm64 cases we always clear the register offset, make it
      common for all cases, so it won't be forgotten in future.
      
      Fixes: 69c087ba ("bpf: Add bpf_for_each_map_elem() helper")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d400a6cf
    • Daniel Borkmann's avatar
      bpf: Generalize check_ctx_reg for reuse with other types · be80a1d3
      Daniel Borkmann authored
      Generalize the check_ctx_reg() helper function into a more generic named one
      so that it can be reused for other register types as well to check whether
      their offset is non-zero. No functional change.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      be80a1d3
  2. 18 Jan, 2022 1 commit
    • Eric Dumazet's avatar
      netns: add schedule point in ops_exit_list() · 2836615a
      Eric Dumazet authored
      When under stress, cleanup_net() can have to dismantle
      netns in big numbers. ops_exit_list() currently calls
      many helpers [1] that have no schedule point, and we can
      end up with soft lockups, particularly on hosts
      with many cpus.
      
      Even for moderate amount of netns processed by cleanup_net()
      this patch avoids latency spikes.
      
      [1] Some of these helpers like fib_sync_up() and fib_sync_down_dev()
      are very slow because net/ipv4/fib_semantics.c uses host-wide hash tables,
      and ifindex is used as the only input of two hash functions.
          ifindexes tend to be the same for all netns (lo.ifindex==1 per instance)
          This will be fixed in a separate patch.
      
      Fixes: 72ad937a ("net: Add support for batching network namespace cleanups")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2836615a
  3. 17 Jan, 2022 5 commits
  4. 16 Jan, 2022 3 commits
    • Moshe Tal's avatar
      bonding: Fix extraction of ports from the packet headers · 429e3d12
      Moshe Tal authored
      Wrong hash sends single stream to multiple output interfaces.
      
      The offset calculation was relative to skb->head, fix it to be relative
      to skb->data.
      
      Fixes: a815bde5 ("net, bonding: Refactor bond_xmit_hash for use with
      xdp_buff")
      Reviewed-by: default avatarJussi Maki <joamaki@gmail.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Reviewed-by: default avatarGal Pressman <gal@nvidia.com>
      Signed-off-by: default avatarMoshe Tal <moshet@nvidia.com>
      Acked-by: default avatarJay Vosburgh <jay.vosburgh@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      429e3d12
    • Wen Gu's avatar
      net/smc: Fix hung_task when removing SMC-R devices · 56d99e81
      Wen Gu authored
      A hung_task is observed when removing SMC-R devices. Suppose that
      a link group has two active links(lnk_A, lnk_B) associated with two
      different SMC-R devices(dev_A, dev_B). When dev_A is removed, the
      link group will be removed from smc_lgr_list and added into
      lgr_linkdown_list. lnk_A will be cleared and smcibdev(A)->lnk_cnt
      will reach to zero. However, when dev_B is removed then, the link
      group can't be found in smc_lgr_list and lnk_B won't be cleared,
      making smcibdev->lnk_cnt never reaches zero, which causes a hung_task.
      
      This patch fixes this issue by restoring the implementation of
      smc_smcr_terminate_all() to what it was before commit 349d4312
      ("net/smc: fix kernel panic caused by race of smc_sock"). The original
      implementation also satisfies the intention that make sure QP destroy
      earlier than CQ destroy because we will always wait for smcibdev->lnk_cnt
      reaches zero, which guarantees QP has been destroyed.
      
      Fixes: 349d4312 ("net/smc: fix kernel panic caused by race of smc_sock")
      Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      56d99e81
    • Eric Dumazet's avatar
      ipv4: update fib_info_cnt under spinlock protection · 0a6e6b3c
      Eric Dumazet authored
      In the past, free_fib_info() was supposed to be called
      under RTNL protection.
      
      This eventually was no longer the case.
      
      Instead of enforcing RTNL it seems we simply can
      move fib_info_cnt changes to occur when fib_info_lock
      is held.
      
      v2: David Laight suggested to update fib_info_cnt
      only when an entry is added/deleted to/from the hash table,
      as fib_info_cnt is used to make sure hash table size
      is optimal.
      
      BUG: KCSAN: data-race in fib_create_info / free_fib_info
      
      write to 0xffffffff86e243a0 of 4 bytes by task 26429 on cpu 0:
       fib_create_info+0xe78/0x3440 net/ipv4/fib_semantics.c:1428
       fib_table_insert+0x148/0x10c0 net/ipv4/fib_trie.c:1224
       fib_magic+0x195/0x1e0 net/ipv4/fib_frontend.c:1087
       fib_add_ifaddr+0xd0/0x2e0 net/ipv4/fib_frontend.c:1109
       fib_netdev_event+0x178/0x510 net/ipv4/fib_frontend.c:1466
       notifier_call_chain kernel/notifier.c:83 [inline]
       raw_notifier_call_chain+0x53/0xb0 kernel/notifier.c:391
       __dev_notify_flags+0x1d3/0x3b0
       dev_change_flags+0xa2/0xc0 net/core/dev.c:8872
       do_setlink+0x810/0x2410 net/core/rtnetlink.c:2719
       rtnl_group_changelink net/core/rtnetlink.c:3242 [inline]
       __rtnl_newlink net/core/rtnetlink.c:3396 [inline]
       rtnl_newlink+0xb10/0x13b0 net/core/rtnetlink.c:3506
       rtnetlink_rcv_msg+0x745/0x7e0 net/core/rtnetlink.c:5571
       netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2496
       rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5589
       netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
       netlink_unicast+0x5fc/0x6c0 net/netlink/af_netlink.c:1345
       netlink_sendmsg+0x726/0x840 net/netlink/af_netlink.c:1921
       sock_sendmsg_nosec net/socket.c:704 [inline]
       sock_sendmsg net/socket.c:724 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2409
       ___sys_sendmsg net/socket.c:2463 [inline]
       __sys_sendmsg+0x195/0x230 net/socket.c:2492
       __do_sys_sendmsg net/socket.c:2501 [inline]
       __se_sys_sendmsg net/socket.c:2499 [inline]
       __x64_sys_sendmsg+0x42/0x50 net/socket.c:2499
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffffffff86e243a0 of 4 bytes by task 31505 on cpu 1:
       free_fib_info+0x35/0x80 net/ipv4/fib_semantics.c:252
       fib_info_put include/net/ip_fib.h:575 [inline]
       nsim_fib4_rt_destroy drivers/net/netdevsim/fib.c:294 [inline]
       nsim_fib4_rt_replace drivers/net/netdevsim/fib.c:403 [inline]
       nsim_fib4_rt_insert drivers/net/netdevsim/fib.c:431 [inline]
       nsim_fib4_event drivers/net/netdevsim/fib.c:461 [inline]
       nsim_fib_event drivers/net/netdevsim/fib.c:881 [inline]
       nsim_fib_event_work+0x15ca/0x2cf0 drivers/net/netdevsim/fib.c:1477
       process_one_work+0x3fc/0x980 kernel/workqueue.c:2298
       process_scheduled_works kernel/workqueue.c:2361 [inline]
       worker_thread+0x7df/0xa70 kernel/workqueue.c:2447
       kthread+0x2c7/0x2e0 kernel/kthread.c:327
       ret_from_fork+0x1f/0x30
      
      value changed: 0x00000d2d -> 0x00000d2e
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 31505 Comm: kworker/1:21 Not tainted 5.16.0-rc6-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: events nsim_fib_event_work
      
      Fixes: 48bb9eb4 ("netdevsim: fib: Add dummy implementation for FIB offload")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Ido Schimmel <idosch@mellanox.com>
      Cc: Jiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a6e6b3c
  5. 15 Jan, 2022 7 commits
    • Wen Gu's avatar
      net/smc: Remove unused function declaration · 9404bc1e
      Wen Gu authored
      The declaration of smc_wr_tx_dismiss_slots() is unused.
      So remove it.
      
      Fixes: 349d4312 ("net/smc: fix kernel panic caused by race of smc_sock")
      Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Reviewed-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9404bc1e
    • Slark Xiao's avatar
      net: wwan: Fix MRU mismatch issue which may lead to data connection lost · f542cdfa
      Slark Xiao authored
      In pci_generic.c there is a 'mru_default' in struct mhi_pci_dev_info.
      This value shall be used for whole mhi if it's given a value for a specific product.
      But in function mhi_net_rx_refill_work(), it's still using hard code value MHI_DEFAULT_MRU.
      'mru_default' shall have higher priority than MHI_DEFAULT_MRU.
      And after checking, this change could help fix a data connection lost issue.
      
      Fixes: 5c2c8531 ("bus: mhi: pci-generic: configurable network interface MRU")
      Signed-off-by: default avatarShujun Wang <wsj20369@163.com>
      Signed-off-by: default avatarSlark Xiao <slark_xiao@163.com>
      Reviewed-by: default avatarLoic Poulain <loic.poulain@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f542cdfa
    • Mohammad Athari Bin Ismail's avatar
      net: phy: marvell: add Marvell specific PHY loopback · 020a45af
      Mohammad Athari Bin Ismail authored
      Existing genphy_loopback() is not applicable for Marvell PHY. Besides
      configuring bit-6 and bit-13 in Page 0 Register 0 (Copper Control
      Register), it is also required to configure same bits  in Page 2
      Register 21 (MAC Specific Control Register 2) according to speed of
      the loopback is operating.
      
      Tested working on Marvell88E1510 PHY for all speeds (1000/100/10Mbps).
      
      FIXME: Based on trial and error test, it seem 1G need to have delay between
      soft reset and loopback enablement.
      
      Fixes: 014068dc ("net: phy: genphy_loopback: add link speed configuration")
      Cc: <stable@vger.kernel.org> # 5.15.x
      Signed-off-by: default avatarMohammad Athari Bin Ismail <mohammad.athari.ismail@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      020a45af
    • Christophe JAILLET's avatar
      net: ethernet: sun4i-emac: Fix an error handling path in emac_probe() · 9a9acdcc
      Christophe JAILLET authored
      A dma_request_chan() call is hidden in emac_configure_dma().
      It must be released in the probe if an error occurs, as already done in
      the remove function.
      
      Add the corresponding dma_release_channel() call.
      
      Fixes: 47869e82 ("sun4i-emac.c: add dma support")
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9a9acdcc
    • Tom Rix's avatar
      net: ethernet: mtk_eth_soc: fix error checking in mtk_mac_config() · 214b3369
      Tom Rix authored
      Clang static analysis reports this problem
      mtk_eth_soc.c:394:7: warning: Branch condition evaluates
        to a garbage value
                      if (err)
                          ^~~
      
      err is not initialized and only conditionally set.
      So intitialize err.
      
      Fixes: 7e538372 ("net: ethernet: mediatek: Re-add support SGMII")
      Signed-off-by: default avatarTom Rix <trix@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      214b3369
    • Vladimir Oltean's avatar
      net: mscc: ocelot: don't dereference NULL pointers with shared tc filters · 80f15f3b
      Vladimir Oltean authored
      The following command sequence:
      
      tc qdisc del dev swp0 clsact
      tc qdisc add dev swp0 ingress_block 1 clsact
      tc qdisc add dev swp1 ingress_block 1 clsact
      tc filter add block 1 flower action drop
      tc qdisc del dev swp0 clsact
      
      produces the following NPD:
      
      Unable to handle kernel NULL pointer dereference at virtual address 0000000000000014
      pc : vcap_entry_set+0x14/0x70
      lr : ocelot_vcap_filter_del+0x198/0x234
      Call trace:
       vcap_entry_set+0x14/0x70
       ocelot_vcap_filter_del+0x198/0x234
       ocelot_cls_flower_destroy+0x94/0xe4
       felix_cls_flower_del+0x70/0x84
       dsa_slave_setup_tc_block_cb+0x13c/0x60c
       dsa_slave_setup_tc_block_cb_ig+0x20/0x30
       tc_setup_cb_reoffload+0x44/0x120
       fl_reoffload+0x280/0x320
       tcf_block_playback_offloads+0x6c/0x184
       tcf_block_unbind+0x80/0xe0
       tcf_block_setup+0x174/0x214
       tcf_block_offload_cmd.isra.0+0x100/0x13c
       tcf_block_offload_unbind+0x5c/0xa0
       __tcf_block_put+0x54/0x174
       tcf_block_put_ext+0x5c/0x74
       clsact_destroy+0x40/0x60
       qdisc_destroy+0x4c/0x150
       qdisc_put+0x70/0x90
       qdisc_graft+0x3f0/0x4c0
       tc_get_qdisc+0x1cc/0x364
       rtnetlink_rcv_msg+0x124/0x340
      
      The reason is that the driver isn't prepared to receive two tc filters
      with the same cookie. It unconditionally creates a new struct
      ocelot_vcap_filter for each tc filter, and it adds all filters with the
      same identifier (cookie) to the ocelot_vcap_block.
      
      The problem is here, in ocelot_vcap_filter_del():
      
      	/* Gets index of the filter */
      	index = ocelot_vcap_block_get_filter_index(block, filter);
      	if (index < 0)
      		return index;
      
      	/* Delete filter */
      	ocelot_vcap_block_remove_filter(ocelot, block, filter);
      
      	/* Move up all the blocks over the deleted filter */
      	for (i = index; i < block->count; i++) {
      		struct ocelot_vcap_filter *tmp;
      
      		tmp = ocelot_vcap_block_find_filter_by_index(block, i);
      		vcap_entry_set(ocelot, i, tmp);
      	}
      
      what will happen is ocelot_vcap_block_get_filter_index() will return the
      index (@index) of the first filter found with that cookie. This is _not_
      the index of _this_ filter, but the other one with the same cookie,
      because ocelot_vcap_filter_equal() gets fooled.
      
      Then later, ocelot_vcap_block_remove_filter() is coded to remove all
      filters that are ocelot_vcap_filter_equal() with the passed @filter.
      So unexpectedly, both filters get deleted from the list.
      
      Then ocelot_vcap_filter_del() will attempt to move all the other filters
      up, again finding them by index (@i). The block count is 2, @index was 0,
      so it will attempt to move up filter @i=0 and @i=1. It assigns tmp =
      ocelot_vcap_block_find_filter_by_index(block, i), which is now a NULL
      pointer because ocelot_vcap_block_remove_filter() has removed more than
      one filter.
      
      As far as I can see, this problem has been there since the introduction
      of tc offload support, however I cannot test beyond the blamed commit
      due to hardware availability. In any case, any fix cannot be backported
      that far, due to lots of changes to the code base.
      
      Therefore, let's go for the correct solution, which is to not call
      ocelot_vcap_filter_add() and ocelot_vcap_filter_del(), unless the filter
      is actually unique and not shared. For the shared filters, we should
      just modify the ingress port mask and call ocelot_vcap_filter_replace(),
      a function introduced by commit 95706be1 ("net: mscc: ocelot: create
      a function that replaces an existing VCAP filter"). This way,
      block->rules will only contain filters with unique cookies, by design.
      
      Fixes: 07d985ee ("net: dsa: felix: Wire up the ocelot cls_flower methods")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      80f15f3b
    • Eric Dumazet's avatar
      af_unix: annote lockless accesses to unix_tot_inflight & gc_in_progress · 9d6d7f1c
      Eric Dumazet authored
      wait_for_unix_gc() reads unix_tot_inflight & gc_in_progress
      without synchronization.
      
      Adds READ_ONCE()/WRITE_ONCE() and their associated comments
      to better document the intent.
      
      BUG: KCSAN: data-race in unix_inflight / wait_for_unix_gc
      
      write to 0xffffffff86e2b7c0 of 4 bytes by task 9380 on cpu 0:
       unix_inflight+0x1e8/0x260 net/unix/scm.c:63
       unix_attach_fds+0x10c/0x1e0 net/unix/scm.c:121
       unix_scm_to_skb net/unix/af_unix.c:1674 [inline]
       unix_dgram_sendmsg+0x679/0x16b0 net/unix/af_unix.c:1817
       unix_seqpacket_sendmsg+0xcc/0x110 net/unix/af_unix.c:2258
       sock_sendmsg_nosec net/socket.c:704 [inline]
       sock_sendmsg net/socket.c:724 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2409
       ___sys_sendmsg net/socket.c:2463 [inline]
       __sys_sendmmsg+0x267/0x4c0 net/socket.c:2549
       __do_sys_sendmmsg net/socket.c:2578 [inline]
       __se_sys_sendmmsg net/socket.c:2575 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2575
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffffffff86e2b7c0 of 4 bytes by task 9375 on cpu 1:
       wait_for_unix_gc+0x24/0x160 net/unix/garbage.c:196
       unix_dgram_sendmsg+0x8e/0x16b0 net/unix/af_unix.c:1772
       unix_seqpacket_sendmsg+0xcc/0x110 net/unix/af_unix.c:2258
       sock_sendmsg_nosec net/socket.c:704 [inline]
       sock_sendmsg net/socket.c:724 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2409
       ___sys_sendmsg net/socket.c:2463 [inline]
       __sys_sendmmsg+0x267/0x4c0 net/socket.c:2549
       __do_sys_sendmmsg net/socket.c:2578 [inline]
       __se_sys_sendmmsg net/socket.c:2575 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2575
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x00000002 -> 0x00000004
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 9375 Comm: syz-executor.1 Not tainted 5.16.0-rc7-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 9915672d ("af_unix: limit unix_tot_inflight")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Link: https://lore.kernel.org/r/20220114164328.2038499-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9d6d7f1c
  6. 14 Jan, 2022 7 commits
  7. 13 Jan, 2022 7 commits
    • Kevin Bracey's avatar
      net_sched: restore "mpu xxx" handling · fb80445c
      Kevin Bracey authored
      commit 56b765b7 ("htb: improved accuracy at high rates") broke
      "overhead X", "linklayer atm" and "mpu X" attributes.
      
      "overhead X" and "linklayer atm" have already been fixed. This restores
      the "mpu X" handling, as might be used by DOCSIS or Ethernet shaping:
      
          tc class add ... htb rate X overhead 4 mpu 64
      
      The code being fixed is used by htb, tbf and act_police. Cake has its
      own mpu handling. qdisc_calculate_pkt_len still uses the size table
      containing values adjusted for mpu by user space.
      
      iproute2 tc has always passed mpu into the kernel via a tc_ratespec
      structure, but the kernel never directly acted on it, merely stored it
      so that it could be read back by `tc class show`.
      
      Rather, tc would generate length-to-time tables that included the mpu
      (and linklayer) in their construction, and the kernel used those tables.
      
      Since v3.7, the tables were no longer used. Along with "mpu", this also
      broke "overhead" and "linklayer" which were fixed in 01cb71d2
      ("net_sched: restore "overhead xxx" handling", v3.10) and 8a8e3d84
      ("net_sched: restore "linklayer atm" handling", v3.11).
      
      "overhead" was fixed by simply restoring use of tc_ratespec::overhead -
      this had originally been used by the kernel but was initially omitted
      from the new non-table-based calculations.
      
      "linklayer" had been handled in the table like "mpu", but the mode was
      not originally passed in tc_ratespec. The new implementation was made to
      handle it by getting new versions of tc to pass the mode in an extended
      tc_ratespec, and for older versions of tc the table contents were analysed
      at load time to deduce linklayer.
      
      As "mpu" has always been given to the kernel in tc_ratespec,
      accompanying the mpu-based table, we can restore system functionality
      with no userspace change by making the kernel act on the tc_ratespec
      value.
      
      Fixes: 56b765b7 ("htb: improved accuracy at high rates")
      Signed-off-by: default avatarKevin Bracey <kevin@bracey.fi>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Cc: Vimalkumar <j.vimal@gmail.com>
      Link: https://lore.kernel.org/r/20220112170210.1014351-1-kevin@bracey.fiSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fb80445c
    • Kyoungkyu Park's avatar
      net: qmi_wwan: Add Hucom Wireless HM-211S/K · a6fadfd7
      Kyoungkyu Park authored
      The Hucom Wireless HM-211S/K is an LTE module based on Qualcomm MDM9207.
      This module supports LTE Band 1, 3, 5, 7, 8 and WCDMA Band 1.
      
      Manual testing showed that only interface
      number two replies to QMI messages.
      
      T:  Bus=01 Lev=02 Prnt=02 Port=01 Cnt=01 Dev#=  3 Spd=480  MxCh= 0
      D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
      P:  Vendor=22de ProdID=9051 Rev= 3.18
      S:  Manufacturer=Android
      S:  Product=Android
      S:  SerialNumber=0123456789ABCDEF
      C:* #Ifs= 4 Cfg#= 1 Atr=80 MxPwr=500mA
      I:* If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=(none)
      E:  Ad=81(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
      E:  Ad=01(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
      I:* If#= 1 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=(none)
      E:  Ad=83(I) Atr=03(Int.) MxPS=  10 Ivl=32ms
      E:  Ad=82(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
      E:  Ad=02(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
      I:* If#= 2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=qmi_wwan
      E:  Ad=85(I) Atr=03(Int.) MxPS=   8 Ivl=32ms
      E:  Ad=84(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
      E:  Ad=03(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
      I:* If#= 3 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=42 Prot=01 Driver=(none)
      E:  Ad=04(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
      E:  Ad=86(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
      Signed-off-by: default avatarKyoungkyu Park <choryu.park@choryu.space>
      Acked-by: default avatarBjørn Mork <bjorn@mork.no>
      Link: https://lore.kernel.org/r/Yd+nxAA6KorDpQFv@choryu-tfx5470hSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a6fadfd7
    • Wen Gu's avatar
      net/smc: Resolve the race between SMC-R link access and clear · 20c9398d
      Wen Gu authored
      We encountered some crashes caused by the race between SMC-R
      link access and link clear that triggered by abnormal link
      group termination, such as port error.
      
      Here is an example of this kind of crashes:
      
       BUG: kernel NULL pointer dereference, address: 0000000000000000
       Workqueue: smc_hs_wq smc_listen_work [smc]
       RIP: 0010:smc_llc_flow_initiate+0x44/0x190 [smc]
       Call Trace:
        <TASK>
        ? __smc_buf_create+0x75a/0x950 [smc]
        smcr_lgr_reg_rmbs+0x2a/0xbf [smc]
        smc_listen_work+0xf72/0x1230 [smc]
        ? process_one_work+0x25c/0x600
        process_one_work+0x25c/0x600
        worker_thread+0x4f/0x3a0
        ? process_one_work+0x600/0x600
        kthread+0x15d/0x1a0
        ? set_kthread_struct+0x40/0x40
        ret_from_fork+0x1f/0x30
        </TASK>
      
      smc_listen_work()                     __smc_lgr_terminate()
      ---------------------------------------------------------------
                                          | smc_lgr_free()
                                          |  |- smcr_link_clear()
                                          |      |- memset(lnk, 0)
      smc_listen_rdma_reg()               |
       |- smcr_lgr_reg_rmbs()             |
           |- smc_llc_flow_initiate()     |
               |- access lnk->lgr (panic) |
      
      These crashes are similarly caused by clearing SMC-R link
      resources when some functions is still accessing to them.
      This patch tries to fix the issue by introducing reference
      count of SMC-R links and ensuring that the sensitive resources
      of links won't be cleared until reference count reaches zero.
      
      The operation to the SMC-R link reference count can be concluded
      as follows:
      
      object          [hold or initialized as 1]         [put]
      --------------------------------------------------------------------
      links           smcr_link_init()                   smcr_link_clear()
      connections     smc_conn_create()                  smc_conn_free()
      
      Through this way, the clear of SMC-R links is later than the
      free of all the smc connections above it, thus avoiding the
      unsafe reference to SMC-R links.
      Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20c9398d
    • Wen Gu's avatar
      net/smc: Introduce a new conn->lgr validity check helper · ea89c6c0
      Wen Gu authored
      It is no longer suitable to identify whether a smc connection
      is registered in a link group through checking if conn->lgr
      is NULL, because conn->lgr won't be reset even the connection
      is unregistered from a link group.
      
      So this patch introduces a new helper smc_conn_lgr_valid() and
      replaces all the check of conn->lgr in original implementation
      with the new helper to judge if conn->lgr is valid to use.
      Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ea89c6c0
    • Eric Dumazet's avatar
      inet: frags: annotate races around fqdir->dead and fqdir->high_thresh · 91341fa0
      Eric Dumazet authored
      Both fields can be read/written without synchronization,
      add proper accessors and documentation.
      
      Fixes: d5dd8879 ("inet: fix various use-after-free in defrags units")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91341fa0
    • David S. Miller's avatar
      Merge branch 'smc-race-fixes' · 3ba8c625
      David S. Miller authored
      Wen Gu says:
      
      ====================
      net/smc: Fixes for race in smc link group termination
      
      We encountered some crashes recently and they are caused by the
      race between the access and free of link/link group in abnormal
      smc link group termination. The crashes can be reproduced in
      frequent abnormal link group termination, like setting RNICs up/down.
      
      This set of patches tries to fix this by extending the life cycle
      of link/link group to ensure that they won't be referred to after
      cleared or freed.
      
      v1 -> v2:
      - Improve some comments.
      
      - Move codes of waking up lgrs_deleted wait queue from smc_lgr_free()
        to __smc_lgr_free().
      
      - Move codes of waking up links_deleted wait queue from smcr_link_clear()
        to __smcr_link_clear().
      
      - Move codes of smc_ibdev_cnt_dec() and put_device() from smcr_link_clear()
        to __smcr_link_clear()
      
      - Move smc_lgr_put() to the end of __smcr_link_clear().
      
      - Call smc_lgr_put() after 'out' tag in smcr_link_init() when link
        initialization fails.
      
      - Modify the location where smc connection holds the lgr or link.
      
          before:
            * hold lgr in smc_lgr_register_conn().
            * hold link in smcr_lgr_conn_assign_link().
          after:
            * hold both lgr and link in smc_conn_create().
      
        Modify the location to symmetrical with the place where smc connections
        put the lgr or link, which is smc_conn_free().
      
      - Initialize conn->freed as zero in smc_conn_create().
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ba8c625
    • Wen Gu's avatar
      net/smc: Resolve the race between link group access and termination · 61f434b0
      Wen Gu authored
      We encountered some crashes caused by the race between the access
      and the termination of link groups.
      
      Here are some of panic stacks we met:
      
      1) Race between smc_clc_wait_msg() and __smc_lgr_terminate()
      
       BUG: kernel NULL pointer dereference, address: 00000000000002f0
       Workqueue: smc_hs_wq smc_listen_work [smc]
       RIP: 0010:smc_clc_wait_msg+0x3eb/0x5c0 [smc]
       Call Trace:
        <TASK>
        ? smc_clc_send_accept+0x45/0xa0 [smc]
        ? smc_clc_send_accept+0x45/0xa0 [smc]
        smc_listen_work+0x783/0x1220 [smc]
        ? finish_task_switch+0xc4/0x2e0
        ? process_one_work+0x1ad/0x3c0
        process_one_work+0x1ad/0x3c0
        worker_thread+0x4c/0x390
        ? rescuer_thread+0x320/0x320
        kthread+0x149/0x190
        ? set_kthread_struct+0x40/0x40
        ret_from_fork+0x1f/0x30
        </TASK>
      
      smc_listen_work()                abnormal case like port error
      ---------------------------------------------------------------
                                      | __smc_lgr_terminate()
                                      |  |- smc_conn_kill()
                                      |      |- smc_lgr_unregister_conn()
                                      |          |- set conn->lgr = NULL
      smc_clc_wait_msg()              |
       |- access conn->lgr (panic)    |
      
      2) Race between smc_setsockopt() and __smc_lgr_terminate()
      
       BUG: kernel NULL pointer dereference, address: 00000000000002e8
       RIP: 0010:smc_setsockopt+0x17a/0x280 [smc]
       Call Trace:
        <TASK>
        __sys_setsockopt+0xfc/0x190
        __x64_sys_setsockopt+0x20/0x30
        do_syscall_64+0x34/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae
        </TASK>
      
      smc_setsockopt()                 abnormal case like port error
      --------------------------------------------------------------
                                      | __smc_lgr_terminate()
                                      |  |- smc_conn_kill()
                                      |      |- smc_lgr_unregister_conn()
                                      |          |- set conn->lgr = NULL
      mod_delayed_work()              |
       |- access conn->lgr (panic)    |
      
      There are some other panic places and they are caused by the
      similar reason as described above, which is accessing link
      group after termination, thus getting a NULL pointer or invalid
      resource.
      
      Currently, there seems to be no synchronization between the
      link group access and a sudden termination of it. This patch
      tries to fix this by introducing reference count of link group
      and not freeing link group until reference count is zero.
      
      Link group might be referred to by links or smc connections. So
      the operation to the link group reference count can be concluded
      as follows:
      
      object          [hold or initialized as 1]       [put]
      -------------------------------------------------------------------
      link group      smc_lgr_create()                 smc_lgr_free()
      connections     smc_conn_create()                smc_conn_free()
      links           smcr_link_init()                 smcr_link_clear()
      
      Througth this way, we extend the life cycle of link group and
      ensure it is longer than the life cycle of connections and links
      above it, so that avoid invalid access to link group after its
      termination.
      Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      61f434b0