1. 30 Apr, 2021 15 commits
  2. 29 Apr, 2021 25 commits
    • Oleksij Rempel's avatar
      net: dsa: ksz: ksz8863_smi_probe: set proper return value for ksz_switch_alloc() · d4eecfb2
      Oleksij Rempel authored
      ksz_switch_alloc() will return NULL only if allocation is failed. So,
      the proper return value is -ENOMEM.
      
      Fixes: 60a36476 ("net: dsa: microchip: Add Microchip KSZ8863 SMI based driver support")
      Signed-off-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d4eecfb2
    • Oleksij Rempel's avatar
      net: dsa: ksz: ksz8795_spi_probe: fix possible NULL pointer dereference · ba46b576
      Oleksij Rempel authored
      Fix possible NULL pointer dereference in case devm_kzalloc() failed to
      allocate memory
      
      Fixes: cc13e52c ("net: dsa: microchip: Add Microchip KSZ8863 SPI based driver support")
      Reported-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba46b576
    • Oleksij Rempel's avatar
      net: dsa: ksz: ksz8863_smi_probe: fix possible NULL pointer dereference · d27f0201
      Oleksij Rempel authored
      Fix possible NULL pointer dereference in case devm_kzalloc() failed to
      allocate memory.
      
      Fixes: 60a36476 ("net: dsa: microchip: Add Microchip KSZ8863 SMI based driver support")
      Reported-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d27f0201
    • Yang Li's avatar
      bnx2x: Remove redundant assignment to err · 8343b1f8
      Yang Li authored
      Variable 'err' is set to -EIO but this value is never read as it is
      overwritten with a new value later on, hence it is a redundant
      assignment and can be removed.
      
      Clean up the following clang-analyzer warning:
      drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c:1195:2: warning: Value
      stored to 'err' is never read [clang-analyzer-deadcode.DeadStores]
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: default avatarYang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8343b1f8
    • Jiapeng Chong's avatar
      net: macb: Remove redundant assignment to queue · bbf6acea
      Jiapeng Chong authored
      Variable queue is set to bp->queues but these values is not used as it
      is overwritten later on, hence redundant assignment  can be removed.
      
      Cleans up the following clang-analyzer warning:
      
      drivers/net/ethernet/cadence/macb_main.c:4919:21: warning: Value stored
      to 'queue' during its initialization is never read
      [clang-analyzer-deadcode.DeadStores].
      
      drivers/net/ethernet/cadence/macb_main.c:4832:21: warning: Value stored
      to 'queue' during its initialization is never read
      [clang-analyzer-deadcode.DeadStores].
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: default avatarJiapeng Chong <jiapeng.chong@linux.alibaba.com>
      Acked-by: default avatarNicolas Ferre <nicolas.ferre@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bbf6acea
    • Michael Walle's avatar
      MAINTAINERS: move Murali Karicheri to credits · 57e1d820
      Michael Walle authored
      His email bounces with permanent error "550 Invalid recipient". His last
      email was from 2020-09-09 on the LKML and he seems to have left TI.
      Signed-off-by: default avatarMichael Walle <michael@walle.cc>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      57e1d820
    • Michael Walle's avatar
      MAINTAINERS: remove Wingman Kwok · 1c7600b7
      Michael Walle authored
      His email bounces with permanent error "550 Invalid recipient". His last
      email on the LKML was from 2015-10-22 on the LKML.
      Signed-off-by: default avatarMichael Walle <michael@walle.cc>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1c7600b7
    • David S. Miller's avatar
      Merge branch 'hns3-fixes' · 2ce960f8
      David S. Miller authored
      Huazhong Tan says:
      
      ====================
      net: hns3: add some fixes for -net
      
      This series adds some fixes for the HNS3 ethernet driver.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2ce960f8
    • Jian Shen's avatar
      net: hns3: add check for HNS3_NIC_STATE_INITED in hns3_reset_notify_up_enet() · b4047aac
      Jian Shen authored
      In some cases, the device is not initialized because reset failed.
      If another task calls hns3_reset_notify_up_enet() before reset
      retry, it will cause an error since uninitialized pointer access.
      So add check for HNS3_NIC_STATE_INITED before calling
      hns3_nic_net_open() in hns3_reset_notify_up_enet().
      
      Fixes: bb6b94a8 ("net: hns3: Add reset interface implementation in client")
      Signed-off-by: default avatarJian Shen <shenjian15@huawei.com>
      Signed-off-by: default avatarHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b4047aac
    • Yufeng Mo's avatar
      net: hns3: initialize the message content in hclge_get_link_mode() · 568a54bd
      Yufeng Mo authored
      The message sent to VF should be initialized, otherwise random
      value of some contents may cause improper processing by the target.
      So add a initialization to message in hclge_get_link_mode().
      
      Fixes: 9194d18b ("net: hns3: fix the problem that the supported port is empty")
      Signed-off-by: default avatarYufeng Mo <moyufeng@huawei.com>
      Signed-off-by: default avatarHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      568a54bd
    • Yufeng Mo's avatar
      net: hns3: fix incorrect configuration for igu_egu_hw_err · 2867298d
      Yufeng Mo authored
      According to the UM, the type and enable status of igu_egu_hw_err
      should be configured separately. Currently, the type field is
      incorrect when disable this error. So fix it by configuring these
      two fields separately.
      
      Fixes: bf1faf94 ("net: hns3: Add enable and process hw errors from IGU, EGU and NCSI")
      Signed-off-by: default avatarYufeng Mo <moyufeng@huawei.com>
      Signed-off-by: default avatarHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2867298d
    • Yang Li's avatar
      net: Remove redundant assignment to err · 1a70f659
      Yang Li authored
      Variable 'err' is set to -ENOMEM but this value is never read as it is
      overwritten with a new value later on, hence the 'If statements' and
      assignments are redundantand and can be removed.
      
      Cleans up the following clang-analyzer warning:
      
      net/ipv6/seg6.c:126:4: warning: Value stored to 'err' is never read
      [clang-analyzer-deadcode.DeadStores]
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: default avatarYang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a70f659
    • Zhang Zhengming's avatar
      bridge: Fix possible races between assigning rx_handler_data and setting IFF_BRIDGE_PORT bit · 59259ff7
      Zhang Zhengming authored
      There is a crash in the function br_get_link_af_size_filtered,
      as the port_exists(dev) is true and the rx_handler_data of dev is NULL.
      But the rx_handler_data of dev is correct saved in vmcore.
      
      The oops looks something like:
       ...
       pc : br_get_link_af_size_filtered+0x28/0x1c8 [bridge]
       ...
       Call trace:
        br_get_link_af_size_filtered+0x28/0x1c8 [bridge]
        if_nlmsg_size+0x180/0x1b0
        rtnl_calcit.isra.12+0xf8/0x148
        rtnetlink_rcv_msg+0x334/0x370
        netlink_rcv_skb+0x64/0x130
        rtnetlink_rcv+0x28/0x38
        netlink_unicast+0x1f0/0x250
        netlink_sendmsg+0x310/0x378
        sock_sendmsg+0x4c/0x70
        __sys_sendto+0x120/0x150
        __arm64_sys_sendto+0x30/0x40
        el0_svc_common+0x78/0x130
        el0_svc_handler+0x38/0x78
        el0_svc+0x8/0xc
      
      In br_add_if(), we found there is no guarantee that
      assigning rx_handler_data to dev->rx_handler_data
      will before setting the IFF_BRIDGE_PORT bit of priv_flags.
      So there is a possible data competition:
      
      CPU 0:                                                        CPU 1:
      (RCU read lock)                                               (RTNL lock)
      rtnl_calcit()                                                 br_add_slave()
        if_nlmsg_size()                                               br_add_if()
          br_get_link_af_size_filtered()                              -> netdev_rx_handler_register
                                                                          ...
                                                                          // The order is not guaranteed
            ...                                                           -> dev->priv_flags |= IFF_BRIDGE_PORT;
            // The IFF_BRIDGE_PORT bit of priv_flags has been set
            -> if (br_port_exists(dev)) {
              // The dev->rx_handler_data has NOT been assigned
              -> p = br_port_get_rcu(dev);
              ....
                                                                          -> rcu_assign_pointer(dev->rx_handler_data, rx_handler_data);
                                                                           ...
      
      Fix it in br_get_link_af_size_filtered, using br_port_get_check_rcu() and checking the return value.
      Signed-off-by: default avatarZhang Zhengming <zhangzhengming@huawei.com>
      Reviewed-by: default avatarZhao Lei <zhaolei69@huawei.com>
      Reviewed-by: default avatarWang Xiaogang <wangxiaogang3@huawei.com>
      Suggested-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      59259ff7
    • David S. Miller's avatar
      Merge branch 'fragment-stack-oob-read' · 0ab1fa1c
      David S. Miller authored
      Davide Caratti says:
      
      ====================
      fix stack OOB read while fragmenting IPv4 packets
      
      - patch 1/2 fixes openvswitch IPv4 fragmentation, that does a stack OOB
      read after commit d52e5a7e ("ipv4: lock mtu in fnhe when received
      PMTU < net.ipv4.route.min_pmt")
      - patch 2/2 fixes the same issue in TC 'sch_frag' code
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0ab1fa1c
    • Davide Caratti's avatar
      net/sched: sch_frag: fix stack OOB read while fragmenting IPv4 packets · 31fe34a0
      Davide Caratti authored
      when 'act_mirred' tries to fragment IPv4 packets that had been previously
      re-assembled using 'act_ct', splats like the following can be observed on
      kernels built with KASAN:
      
       BUG: KASAN: stack-out-of-bounds in ip_do_fragment+0x1b03/0x1f60
       Read of size 1 at addr ffff888147009574 by task ping/947
      
       CPU: 0 PID: 947 Comm: ping Not tainted 5.12.0-rc6+ #418
       Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
       Call Trace:
        <IRQ>
        dump_stack+0x92/0xc1
        print_address_description.constprop.7+0x1a/0x150
        kasan_report.cold.13+0x7f/0x111
        ip_do_fragment+0x1b03/0x1f60
        sch_fragment+0x4bf/0xe40
        tcf_mirred_act+0xc3d/0x11a0 [act_mirred]
        tcf_action_exec+0x104/0x3e0
        fl_classify+0x49a/0x5e0 [cls_flower]
        tcf_classify_ingress+0x18a/0x820
        __netif_receive_skb_core+0xae7/0x3340
        __netif_receive_skb_one_core+0xb6/0x1b0
        process_backlog+0x1ef/0x6c0
        __napi_poll+0xaa/0x500
        net_rx_action+0x702/0xac0
        __do_softirq+0x1e4/0x97f
        do_softirq+0x71/0x90
        </IRQ>
        __local_bh_enable_ip+0xdb/0xf0
        ip_finish_output2+0x760/0x2120
        ip_do_fragment+0x15a5/0x1f60
        __ip_finish_output+0x4c2/0xea0
        ip_output+0x1ca/0x4d0
        ip_send_skb+0x37/0xa0
        raw_sendmsg+0x1c4b/0x2d00
        sock_sendmsg+0xdb/0x110
        __sys_sendto+0x1d7/0x2b0
        __x64_sys_sendto+0xdd/0x1b0
        do_syscall_64+0x33/0x40
        entry_SYSCALL_64_after_hwframe+0x44/0xae
       RIP: 0033:0x7f82e13853eb
       Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 75 42 2c 00 41 89 ca 8b 00 85 c0 75 14 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 75 c3 0f 1f 40 00 41 57 4d 89 c7 41 56 41 89
       RSP: 002b:00007ffe01fad888 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
       RAX: ffffffffffffffda RBX: 00005571aac13700 RCX: 00007f82e13853eb
       RDX: 0000000000002330 RSI: 00005571aac13700 RDI: 0000000000000003
       RBP: 0000000000002330 R08: 00005571aac10500 R09: 0000000000000010
       R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe01faefb0
       R13: 00007ffe01fad890 R14: 00007ffe01fad980 R15: 00005571aac0f0a0
      
       The buggy address belongs to the page:
       page:000000001dff2e03 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x147009
       flags: 0x17ffffc0001000(reserved)
       raw: 0017ffffc0001000 ffffea00051c0248 ffffea00051c0248 0000000000000000
       raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
       page dumped because: kasan: bad access detected
      
       Memory state around the buggy address:
        ffff888147009400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        ffff888147009480: f1 f1 f1 f1 04 f2 f2 f2 f2 f2 f2 f2 00 00 00 00
       >ffff888147009500: 00 00 00 00 00 00 00 00 00 00 f2 f2 f2 f2 f2 f2
                                                                    ^
        ffff888147009580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        ffff888147009600: 00 00 00 00 00 00 00 00 00 00 00 00 00 f2 f2 f2
      
      for IPv4 packets, sch_fragment() uses a temporary struct dst_entry. Then,
      in the following call graph:
      
        ip_do_fragment()
          ip_skb_dst_mtu()
            ip_dst_mtu_maybe_forward()
              ip_mtu_locked()
      
      the pointer to struct dst_entry is used as pointer to struct rtable: this
      turns the access to struct members like rt_mtu_locked into an OOB read in
      the stack. Fix this changing the temporary variable used for IPv4 packets
      in sch_fragment(), similarly to what is done for IPv6 few lines below.
      
      Fixes: c129412f ("net/sched: sch_frag: add generic packet fragment support.")
      Cc: <stable@vger.kernel.org> # 5.11
      Reported-by: default avatarShuang Li <shuali@redhat.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: default avatarCong Wang <cong.wang@bytedance.com>
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      31fe34a0
    • Davide Caratti's avatar
      openvswitch: fix stack OOB read while fragmenting IPv4 packets · 7c0ea593
      Davide Caratti authored
      running openvswitch on kernels built with KASAN, it's possible to see the
      following splat while testing fragmentation of IPv4 packets:
      
       BUG: KASAN: stack-out-of-bounds in ip_do_fragment+0x1b03/0x1f60
       Read of size 1 at addr ffff888112fc713c by task handler2/1367
      
       CPU: 0 PID: 1367 Comm: handler2 Not tainted 5.12.0-rc6+ #418
       Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
       Call Trace:
        dump_stack+0x92/0xc1
        print_address_description.constprop.7+0x1a/0x150
        kasan_report.cold.13+0x7f/0x111
        ip_do_fragment+0x1b03/0x1f60
        ovs_fragment+0x5bf/0x840 [openvswitch]
        do_execute_actions+0x1bd5/0x2400 [openvswitch]
        ovs_execute_actions+0xc8/0x3d0 [openvswitch]
        ovs_packet_cmd_execute+0xa39/0x1150 [openvswitch]
        genl_family_rcv_msg_doit.isra.15+0x227/0x2d0
        genl_rcv_msg+0x287/0x490
        netlink_rcv_skb+0x120/0x380
        genl_rcv+0x24/0x40
        netlink_unicast+0x439/0x630
        netlink_sendmsg+0x719/0xbf0
        sock_sendmsg+0xe2/0x110
        ____sys_sendmsg+0x5ba/0x890
        ___sys_sendmsg+0xe9/0x160
        __sys_sendmsg+0xd3/0x170
        do_syscall_64+0x33/0x40
        entry_SYSCALL_64_after_hwframe+0x44/0xae
       RIP: 0033:0x7f957079db07
       Code: c3 66 90 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 eb ec ff ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 24 ed ff ff 48
       RSP: 002b:00007f956ce35a50 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
       RAX: ffffffffffffffda RBX: 0000000000000019 RCX: 00007f957079db07
       RDX: 0000000000000000 RSI: 00007f956ce35ae0 RDI: 0000000000000019
       RBP: 00007f956ce35ae0 R08: 0000000000000000 R09: 00007f9558006730
       R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
       R13: 00007f956ce37308 R14: 00007f956ce35f80 R15: 00007f956ce35ae0
      
       The buggy address belongs to the page:
       page:00000000af2a1d93 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x112fc7
       flags: 0x17ffffc0000000()
       raw: 0017ffffc0000000 0000000000000000 dead000000000122 0000000000000000
       raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
       page dumped because: kasan: bad access detected
      
       addr ffff888112fc713c is located in stack of task handler2/1367 at offset 180 in frame:
        ovs_fragment+0x0/0x840 [openvswitch]
      
       this frame has 2 objects:
        [32, 144) 'ovs_dst'
        [192, 424) 'ovs_rt'
      
       Memory state around the buggy address:
        ffff888112fc7000: f3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        ffff888112fc7080: 00 f1 f1 f1 f1 00 00 00 00 00 00 00 00 00 00 00
       >ffff888112fc7100: 00 00 00 f2 f2 f2 f2 f2 f2 00 00 00 00 00 00 00
                                               ^
        ffff888112fc7180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        ffff888112fc7200: 00 00 00 00 00 00 f2 f2 f2 00 00 00 00 00 00 00
      
      for IPv4 packets, ovs_fragment() uses a temporary struct dst_entry. Then,
      in the following call graph:
      
        ip_do_fragment()
          ip_skb_dst_mtu()
            ip_dst_mtu_maybe_forward()
              ip_mtu_locked()
      
      the pointer to struct dst_entry is used as pointer to struct rtable: this
      turns the access to struct members like rt_mtu_locked into an OOB read in
      the stack. Fix this changing the temporary variable used for IPv4 packets
      in ovs_fragment(), similarly to what is done for IPv6 few lines below.
      
      Fixes: d52e5a7e ("ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmt")
      Cc: <stable@vger.kernel.org>
      Acked-by: default avatarEelco Chaudron <echaudro@redhat.com>
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7c0ea593
    • Andrea Mayer's avatar
      seg6: add counters support for SRv6 Behaviors · 94604548
      Andrea Mayer authored
      This patch provides counters for SRv6 Behaviors as defined in [1],
      section 6. For each SRv6 Behavior instance, counters defined in [1] are:
      
       - the total number of packets that have been correctly processed;
       - the total amount of traffic in bytes of all packets that have been
         correctly processed;
      
      In addition, this patch introduces a new counter that counts the number of
      packets that have NOT been properly processed (i.e. errors) by an SRv6
      Behavior instance.
      
      Counters are not only interesting for network monitoring purposes (i.e.
      counting the number of packets processed by a given behavior) but they also
      provide a simple tool for checking whether a behavior instance is working
      as we expect or not.
      Counters can be useful for troubleshooting misconfigured SRv6 networks.
      Indeed, an SRv6 Behavior can silently drop packets for very different
      reasons (i.e. wrong SID configuration, interfaces set with SID addresses,
      etc) without any notification/message to the user.
      
      Due to the nature of SRv6 networks, diagnostic tools such as ping and
      traceroute may be ineffective: paths used for reaching a given router can
      be totally different from the ones followed by probe packets. In addition,
      paths are often asymmetrical and this makes it even more difficult to keep
      up with the journey of the packets and to understand which behaviors are
      actually processing our traffic.
      
      When counters are enabled on an SRv6 Behavior instance, it is possible to
      verify if packets are actually processed by such behavior and what is the
      outcome of the processing. Therefore, the counters for SRv6 Behaviors offer
      an non-invasive observability point which can be leveraged for both traffic
      monitoring and troubleshooting purposes.
      
      [1] https://www.rfc-editor.org/rfc/rfc8986.html#name-counters
      
      Troubleshooting using SRv6 Behavior counters
      --------------------------------------------
      
      Let's make a brief example to see how helpful counters can be for SRv6
      networks. Let's consider a node where an SRv6 End Behavior receives an SRv6
      packet whose Segment Left (SL) is equal to 0. In this case, the End
      Behavior (which accepts only packets with SL >= 1) discards the packet and
      increases the error counter.
      This information can be leveraged by the network operator for
      troubleshooting. Indeed, the error counter is telling the user that the
      packet:
      
        (i) arrived at the node;
       (ii) the packet has been taken into account by the SRv6 End behavior;
      (iii) but an error has occurred during the processing.
      
      The error (iii) could be caused by different reasons, such as wrong route
      settings on the node or due to an invalid SID List carried by the SRv6
      packet. Anyway, the error counter is used to exclude that the packet did
      not arrive at the node or it has not been processed by the behavior at
      all.
      
      Turning on/off counters for SRv6 Behaviors
      ------------------------------------------
      
      Each SRv6 Behavior instance can be configured, at the time of its creation,
      to make use of counters.
      This is done through iproute2 which allows the user to create an SRv6
      Behavior instance specifying the optional "count" attribute as shown in the
      following example:
      
       $ ip -6 route add 2001:db8::1 encap seg6local action End count dev eth0
      
      per-behavior counters can be shown by adding "-s" to the iproute2 command
      line, i.e.:
      
       $ ip -s -6 route show 2001:db8::1
       2001:db8::1 encap seg6local action End packets 0 bytes 0 errors 0 dev eth0
      
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Impact of counters for SRv6 Behaviors on performance
      ====================================================
      
      To determine the performance impact due to the introduction of counters in
      the SRv6 Behavior subsystem, we have carried out extensive tests.
      
      We chose to test the throughput achieved by the SRv6 End.DX2 Behavior
      because, among all the other behaviors implemented so far, it reaches the
      highest throughput which is around 1.5 Mpps (per core at 2.4 GHz on a
      Xeon(R) CPU E5-2630 v3) on kernel 5.12-rc2 using packets of size ~ 100
      bytes.
      
      Three different tests were conducted in order to evaluate the overall
      throughput of the SRv6 End.DX2 Behavior in the following scenarios:
      
       1) vanilla kernel (without the SRv6 Behavior counters patch) and a single
          instance of an SRv6 End.DX2 Behavior;
       2) patched kernel with SRv6 Behavior counters and a single instance of
          an SRv6 End.DX2 Behavior with counters turned off;
       3) patched kernel with SRv6 Behavior counters and a single instance of
          SRv6 End.DX2 Behavior with counters turned on.
      
      All tests were performed on a testbed deployed on the CloudLab facilities
      [2], a flexible infrastructure dedicated to scientific research on the
      future of Cloud Computing.
      
      Results of tests are shown in the following table:
      
      Scenario (1): average 1504764,81 pps (~1504,76 kpps); std. dev 3956,82 pps
      Scenario (2): average 1501469,78 pps (~1501,47 kpps); std. dev 2979,85 pps
      Scenario (3): average 1501315,13 pps (~1501,32 kpps); std. dev 2956,00 pps
      
      As can be observed, throughputs achieved in scenarios (2),(3) did not
      suffer any observable degradation compared to scenario (1).
      
      Thanks to Jakub Kicinski and David Ahern for their valuable suggestions
      and comments provided during the discussion of the proposed RFCs.
      
      [2] https://www.cloudlab.usSigned-off-by: default avatarAndrea Mayer <andrea.mayer@uniroma2.it>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      94604548
    • Linus Torvalds's avatar
      Merge tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next · 9d31d233
      Linus Torvalds authored
      Pull networking updates from Jakub Kicinski:
       "Core:
      
         - bpf:
              - allow bpf programs calling kernel functions (initially to
                reuse TCP congestion control implementations)
              - enable task local storage for tracing programs - remove the
                need to store per-task state in hash maps, and allow tracing
                programs access to task local storage previously added for
                BPF_LSM
              - add bpf_for_each_map_elem() helper, allowing programs to walk
                all map elements in a more robust and easier to verify fashion
              - sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
                redirection
              - lpm: add support for batched ops in LPM trie
              - add BTF_KIND_FLOAT support - mostly to allow use of BTF on
                s390 which has floats in its headers files
              - improve BPF syscall documentation and extend the use of kdoc
                parsing scripts we already employ for bpf-helpers
              - libbpf, bpftool: support static linking of BPF ELF files
              - improve support for encapsulation of L2 packets
      
         - xdp: restructure redirect actions to avoid a runtime lookup,
           improving performance by 4-8% in microbenchmarks
      
         - xsk: build skb by page (aka generic zerocopy xmit) - improve
           performance of software AF_XDP path by 33% for devices which don't
           need headers in the linear skb part (e.g. virtio)
      
         - nexthop: resilient next-hop groups - improve path stability on
           next-hops group changes (incl. offload for mlxsw)
      
         - ipv6: segment routing: add support for IPv4 decapsulation
      
         - icmp: add support for RFC 8335 extended PROBE messages
      
         - inet: use bigger hash table for IP ID generation
      
         - tcp: deal better with delayed TX completions - make sure we don't
           give up on fast TCP retransmissions only because driver is slow in
           reporting that it completed transmitting the original
      
         - tcp: reorder tcp_congestion_ops for better cache locality
      
         - mptcp:
              - add sockopt support for common TCP options
              - add support for common TCP msg flags
              - include multiple address ids in RM_ADDR
              - add reset option support for resetting one subflow
      
         - udp: GRO L4 improvements - improve 'forward' / 'frag_list'
           co-existence with UDP tunnel GRO, allowing the first to take place
           correctly even for encapsulated UDP traffic
      
         - micro-optimize dev_gro_receive() and flow dissection, avoid
           retpoline overhead on VLAN and TEB GRO
      
         - use less memory for sysctls, add a new sysctl type, to allow using
           u8 instead of "int" and "long" and shrink networking sysctls
      
         - veth: allow GRO without XDP - this allows aggregating UDP packets
           before handing them off to routing, bridge, OvS, etc.
      
         - allow specifing ifindex when device is moved to another namespace
      
         - netfilter:
              - nft_socket: add support for cgroupsv2
              - nftables: add catch-all set element - special element used to
                define a default action in case normal lookup missed
              - use net_generic infra in many modules to avoid allocating
                per-ns memory unnecessarily
      
         - xps: improve the xps handling to avoid potential out-of-bound
           accesses and use-after-free when XPS change race with other
           re-configuration under traffic
      
         - add a config knob to turn off per-cpu netdev refcnt to catch
           underflows in testing
      
        Device APIs:
      
         - add WWAN subsystem to organize the WWAN interfaces better and
           hopefully start driving towards more unified and vendor-
           independent APIs
      
         - ethtool:
              - add interface for reading IEEE MIB stats (incl. mlx5 and bnxt
                support)
              - allow network drivers to dump arbitrary SFP EEPROM data,
                current offset+length API was a poor fit for modern SFP which
                define EEPROM in terms of pages (incl. mlx5 support)
      
         - act_police, flow_offload: add support for packet-per-second
           policing (incl. offload for nfp)
      
         - psample: add additional metadata attributes like transit delay for
           packets sampled from switch HW (and corresponding egress and
           policy-based sampling in the mlxsw driver)
      
         - dsa: improve support for sandwiched LAGs with bridge and DSA
      
         - netfilter:
              - flowtable: use direct xmit in topologies with IP forwarding,
                bridging, vlans etc.
              - nftables: counter hardware offload support
      
         - Bluetooth:
              - improvements for firmware download w/ Intel devices
              - add support for reading AOSP vendor capabilities
              - add support for virtio transport driver
      
         - mac80211:
              - allow concurrent monitor iface and ethernet rx decap
              - set priority and queue mapping for injected frames
      
         - phy: add support for Clause-45 PHY Loopback
      
         - pci/iov: add sysfs MSI-X vector assignment interface to distribute
           MSI-X resources to VFs (incl. mlx5 support)
      
        New hardware/drivers:
      
         - dsa: mv88e6xxx: add support for Marvell mv88e6393x - 11-port
           Ethernet switch with 8x 1-Gigabit Ethernet and 3x 10-Gigabit
           interfaces.
      
         - dsa: support for legacy Broadcom tags used on BCM5325, BCM5365 and
           BCM63xx switches
      
         - Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches
      
         - ath11k: support for QCN9074 a 802.11ax device
      
         - Bluetooth: Broadcom BCM4330 and BMC4334
      
         - phy: Marvell 88X2222 transceiver support
      
         - mdio: add BCM6368 MDIO mux bus controller
      
         - r8152: support RTL8153 and RTL8156 (USB Ethernet) chips
      
         - mana: driver for Microsoft Azure Network Adapter (MANA)
      
         - Actions Semi Owl Ethernet MAC
      
         - can: driver for ETAS ES58X CAN/USB interfaces
      
        Pure driver changes:
      
         - add XDP support to: enetc, igc, stmmac
      
         - add AF_XDP support to: stmmac
      
         - virtio:
              - page_to_skb() use build_skb when there's sufficient tailroom
                (21% improvement for 1000B UDP frames)
              - support XDP even without dedicated Tx queues - share the Tx
                queues with the stack when necessary
      
         - mlx5:
              - flow rules: add support for mirroring with conntrack, matching
                on ICMP, GTP, flex filters and more
              - support packet sampling with flow offloads
              - persist uplink representor netdev across eswitch mode changes
              - allow coexistence of CQE compression and HW time-stamping
              - add ethtool extended link error state reporting
      
         - ice, iavf: support flow filters, UDP Segmentation Offload
      
         - dpaa2-switch:
              - move the driver out of staging
              - add spanning tree (STP) support
              - add rx copybreak support
              - add tc flower hardware offload on ingress traffic
      
         - ionic:
              - implement Rx page reuse
              - support HW PTP time-stamping
      
         - octeon: support TC hardware offloads - flower matching on ingress
           and egress ratelimitting.
      
         - stmmac:
              - add RX frame steering based on VLAN priority in tc flower
              - support frame preemption (FPE)
              - intel: add cross time-stamping freq difference adjustment
      
         - ocelot:
              - support forwarding of MRP frames in HW
              - support multiple bridges
              - support PTP Sync one-step timestamping
      
         - dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
           learning, flooding etc.
      
         - ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
           SC7280 SoCs)
      
         - mt7601u: enable TDLS support
      
         - mt76:
              - add support for 802.3 rx frames (mt7915/mt7615)
              - mt7915 flash pre-calibration support
              - mt7921/mt7663 runtime power management fixes"
      
      * tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2451 commits)
        net: selftest: fix build issue if INET is disabled
        net: netrom: nr_in: Remove redundant assignment to ns
        net: tun: Remove redundant assignment to ret
        net: phy: marvell: add downshift support for M88E1240
        net: dsa: ksz: Make reg_mib_cnt a u8 as it never exceeds 255
        net/sched: act_ct: Remove redundant ct get and check
        icmp: standardize naming of RFC 8335 PROBE constants
        bpf, selftests: Update array map tests for per-cpu batched ops
        bpf: Add batched ops support for percpu array
        bpf: Implement formatted output helpers with bstr_printf
        seq_file: Add a seq_bprintf function
        sfc: adjust efx->xdp_tx_queue_count with the real number of initialized queues
        net:nfc:digital: Fix a double free in digital_tg_recv_dep_req
        net: fix a concurrency bug in l2tp_tunnel_register()
        net/smc: Remove redundant assignment to rc
        mpls: Remove redundant assignment to err
        llc2: Remove redundant assignment to rc
        net/tls: Remove redundant initialization of record
        rds: Remove redundant assignment to nr_sig
        dt-bindings: net: mdio-gpio: add compatible for microchip,mdio-smi0
        ...
      9d31d233
    • Linus Torvalds's avatar
      Merge tag 'x86-mm-2021-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 635de956
      Linus Torvalds authored
      Pull x86 tlb updates from Ingo Molnar:
       "The x86 MM changes in this cycle were:
      
         - Implement concurrent TLB flushes, which overlaps the local TLB
           flush with the remote TLB flush.
      
           In testing this improved sysbench performance measurably by a
           couple of percentage points, especially if TLB-heavy security
           mitigations are active.
      
         - Further micro-optimizations to improve the performance of TLB
           flushes"
      
      * tag 'x86-mm-2021-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        smp: Micro-optimize smp_call_function_many_cond()
        smp: Inline on_each_cpu_cond() and on_each_cpu()
        x86/mm/tlb: Remove unnecessary uses of the inline keyword
        cpumask: Mark functions as pure
        x86/mm/tlb: Do not make is_lazy dirty for no reason
        x86/mm/tlb: Privatize cpu_tlbstate
        x86/mm/tlb: Flush remote and local TLBs concurrently
        x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
        x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()
        smp: Run functions concurrently in smp_call_function_many_cond()
      635de956
    • Linus Torvalds's avatar
      Merge tag 'microblaze-v5.13' of git://git.monstr.eu/linux-2.6-microblaze · d0cc7eca
      Linus Torvalds authored
      Pull Microblaze updates from Michal Simek:
       "No new features, just about cleaning up some code and moving to
        generic syscall solution used by other architectures:
      
         - Switch to generic syscall scripts
      
         - Some small fixes"
      
      * tag 'microblaze-v5.13' of git://git.monstr.eu/linux-2.6-microblaze:
        microblaze: add 'fallthrough' to memcpy/memset/memmove
        microblaze: Fix a typo
        microblaze: tag highmem_setup() with __meminit
        microblaze: syscalls: switch to generic syscallhdr.sh
        microblaze: syscalls: switch to generic syscalltbl.sh
      d0cc7eca
    • Linus Torvalds's avatar
      Merge tag 'mips_5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux · 77d51337
      Linus Torvalds authored
      Pull MIPS updates from Thomas Bogendoerfer:
      
       - removed get_fs/set_fs
      
       - removed broken/unmaintained MIPS KVM trap and emulate support
      
       - added support for Loongson-2K1000
      
       - fixes and cleanups
      
      * tag 'mips_5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (107 commits)
        MIPS: BCM63XX: Use BUG_ON instead of condition followed by BUG.
        MIPS: select ARCH_KEEP_MEMBLOCK unconditionally
        mips: Do not include hi and lo in clobber list for R6
        MIPS:DTS:Correct the license for Loongson-2K
        MIPS:DTS:Fix label name and interrupt number of ohci for Loongson-2K
        MIPS: Avoid handcoded DIVU in `__div64_32' altogether
        lib/math/test_div64: Correct the spelling of "dividend"
        lib/math/test_div64: Fix error message formatting
        mips/bootinfo:correct some comments of fw_arg
        MIPS: Avoid DIVU in `__div64_32' is result would be zero
        MIPS: Reinstate platform `__div64_32' handler
        div64: Correct inline documentation for `do_div'
        lib/math: Add a `do_div' test module
        MIPS: Makefile: Replace -pg with CC_FLAGS_FTRACE
        MIPS: pci-legacy: revert "use generic pci_enable_resources"
        MIPS: Loongson64: Add kexec/kdump support
        MIPS: pci-legacy: use generic pci_enable_resources
        MIPS: pci-legacy: remove busn_resource field
        MIPS: pci-legacy: remove redundant info messages
        MIPS: pci-legacy: stop using of_pci_range_to_resource
        ...
      77d51337
    • Linus Torvalds's avatar
      Merge tag 'fsnotify_for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs · 3644286f
      Linus Torvalds authored
      Pull fsnotify updates from Jan Kara:
      
       - support for limited fanotify functionality for unpriviledged users
      
       - faster merging of fanotify events
      
       - a few smaller fsnotify improvements
      
      * tag 'fsnotify_for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
        shmem: allow reporting fanotify events with file handles on tmpfs
        fs: introduce a wrapper uuid_to_fsid()
        fanotify_user: use upper_32_bits() to verify mask
        fanotify: support limited functionality for unprivileged users
        fanotify: configurable limits via sysfs
        fanotify: limit number of event merge attempts
        fsnotify: use hash table for faster events merge
        fanotify: mix event info and pid into merge key hash
        fanotify: reduce event objectid to 29-bit hash
        fsnotify: allow fsnotify_{peek,remove}_first_event with empty queue
      3644286f
    • Linus Torvalds's avatar
      Merge tag 'for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs · 767fcbc8
      Linus Torvalds authored
      Pull quota, ext2, reiserfs updates from Jan Kara:
      
       - support for path (instead of device) based quotactl syscall
         (quotactl_path(2))
      
       - ext2 conversion to kmap_local()
      
       - other minor cleanups & fixes
      
      * tag 'for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
        fs/reiserfs/journal.c: delete useless variables
        fs/ext2: Replace kmap() with kmap_local_page()
        ext2: Match up ext2_put_page() with ext2_dotdot() and ext2_find_entry()
        fs/ext2/: fix misspellings using codespell tool
        quota: report warning limits for realtime space quotas
        quota: wire up quotactl_path
        quota: Add mountpath based quota support
      767fcbc8
    • Linus Torvalds's avatar
      Merge tag 'xfs-5.13-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · d2b6f8a1
      Linus Torvalds authored
      Pull xfs updates from Darrick Wong:
       "The notable user-visible addition this cycle is ability to remove
        space from the last AG in a filesystem. This is the first of many
        changes needed for full-fledged support for shrinking a filesystem.
        Still needed are (a) the ability to reorganize files and metadata away
        from the end of the fs; (b) the ability to remove entire allocation
        groups; (c) shrink support for realtime volumes; and (d) thorough
        testing of (a-c).
      
        There are a number of performance improvements in this code drop: Dave
        streamlined various parts of the buffer logging code and reduced the
        cost of various debugging checks, and added the ability to pre-create
        the xattr structures while creating files. Brian eliminated
        transaction reservations that were being held across writeback (thus
        reducing livelock potential.
      
        Other random pieces: Pavel fixed the repetitve warnings about
        deprecated mount options, I fixed online fsck to behave itself when a
        readonly remount comes in during scrub, and refactored various other
        parts of that code, Christoph contributed a lot of refactoring this
        cycle. The xfs_icdinode structure has been absorbed into the (incore)
        xfs_inode structure, and the format and flags handling around
        xfs_inode_fork structures has been simplified. Chandan provided a
        number of fixes for extent count overflow related problems that have
        been shaken out by debugging knobs added during 5.12.
      
        Summary:
      
         - Various minor fixes in online scrub.
      
         - Prevent metadata files from being automatically inactivated.
      
         - Validate btree heights by the computed per-btree limits.
      
         - Don't warn about remounting with deprecated mount options.
      
         - Initialize attr forks at create time if we suspect we're going to
           need to store them.
      
         - Reduce memory reallocation workouts in the logging code.
      
         - Fix some theoretical math calculation errors in logged buffers that
           span multiple discontig memory ranges but contiguous ondisk
           regions.
      
         - Speedups in dirty buffer bitmap handling.
      
         - Make type verifier functions more inline-happy to reduce overhead.
      
         - Reduce debug overhead in directory checking code.
      
         - Many many typo fixes.
      
         - Begin to handle the permanent loss of the very end of a filesystem.
      
         - Fold struct xfs_icdinode into xfs_inode.
      
         - Deprecate the long defunct BMV_IF_NO_DMAPI_READ from the bmapx
           ioctl.
      
         - Remove a broken directory block format check from online scrub.
      
         - Fix a bug where we could produce an unnecessarily tall data fork
           btree when creating an attr fork.
      
         - Fix scrub and readonly remounts racing.
      
         - Fix a writeback ioend log deadlock problem by dropping the behavior
           where we could preallocate a setfilesize transaction.
      
         - Fix some bugs in the new extent count checking code.
      
         - Fix some bugs in the attr fork preallocation code.
      
         - Refactor if_flags out of the incore inode fork data structure"
      
      * tag 'xfs-5.13-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (77 commits)
        xfs: remove xfs_quiesce_attr declaration
        xfs: remove XFS_IFEXTENTS
        xfs: remove XFS_IFINLINE
        xfs: remove XFS_IFBROOT
        xfs: only look at the fork format in xfs_idestroy_fork
        xfs: simplify xfs_attr_remove_args
        xfs: rename and simplify xfs_bmap_one_block
        xfs: move the XFS_IFEXTENTS check into xfs_iread_extents
        xfs: drop unnecessary setfilesize helper
        xfs: drop unused ioend private merge and setfilesize code
        xfs: open code ioend needs workqueue helper
        xfs: drop submit side trans alloc for append ioends
        xfs: fix return of uninitialized value in variable error
        xfs: get rid of the ip parameter to xchk_setup_*
        xfs: fix scrub and remount-ro protection when running scrub
        xfs: move the check for post-EOF mappings into xfs_can_free_eofblocks
        xfs: move the xfs_can_free_eofblocks call under the IOLOCK
        xfs: precalculate default inode attribute offset
        xfs: default attr fork size does not handle device inodes
        xfs: inode fork allocation depends on XFS_IFEXTENT flag
        ...
      d2b6f8a1
    • Linus Torvalds's avatar
      Merge tag 'gfs2-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 · f2c80837
      Linus Torvalds authored
      Pull gfs2 updates from Andreas Gruenbacher:
      
       - Fix some compiler and kernel-doc warnings
      
       - Various minor cleanups and optimizations
      
       - Add a new sysfs gfs2 status file with some filesystem wide
         information
      
      * tag 'gfs2-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
        gfs2: Fix fall-through warnings for Clang
        gfs2: Fix a number of kernel-doc warnings
        gfs2: Make gfs2_setattr_simple static
        gfs2: Add new sysfs file for gfs2 status
        gfs2: Silence possible null pointer dereference warning
        gfs2: Turn gfs2_meta_indirect_buffer into gfs2_meta_buffer
        gfs2: Replace gfs2_lblk_to_dblk with gfs2_get_extent
        gfs2: Turn gfs2_extent_map into gfs2_{get,alloc}_extent
        gfs2: Add new gfs2_iomap_get helper
        gfs2: Remove unused variable sb_format
        gfs2: Fix dir.c function parameter descriptions
        gfs2: Eliminate gh parameter from go_xmote_bh func
        gfs2: don't create empty buffers for NO_CREATE
      f2c80837