1. 06 May, 2022 15 commits
    • Eric Dumazet's avatar
      netlink: do not reset transport header in netlink_recvmsg() · d5076fe4
      Eric Dumazet authored
      netlink_recvmsg() does not need to change transport header.
      
      If transport header was needed, it should have been reset
      by the producer (netlink_dump()), not the consumer(s).
      
      The following trace probably happened when multiple threads
      were using MSG_PEEK.
      
      BUG: KCSAN: data-race in netlink_recvmsg / netlink_recvmsg
      
      write to 0xffff88811e9f15b2 of 2 bytes by task 32012 on cpu 1:
       skb_reset_transport_header include/linux/skbuff.h:2760 [inline]
       netlink_recvmsg+0x1de/0x790 net/netlink/af_netlink.c:1978
       sock_recvmsg_nosec net/socket.c:948 [inline]
       sock_recvmsg net/socket.c:966 [inline]
       __sys_recvfrom+0x204/0x2c0 net/socket.c:2097
       __do_sys_recvfrom net/socket.c:2115 [inline]
       __se_sys_recvfrom net/socket.c:2111 [inline]
       __x64_sys_recvfrom+0x74/0x90 net/socket.c:2111
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      write to 0xffff88811e9f15b2 of 2 bytes by task 32005 on cpu 0:
       skb_reset_transport_header include/linux/skbuff.h:2760 [inline]
       netlink_recvmsg+0x1de/0x790 net/netlink/af_netlink.c:1978
       ____sys_recvmsg+0x162/0x2f0
       ___sys_recvmsg net/socket.c:2674 [inline]
       __sys_recvmsg+0x209/0x3f0 net/socket.c:2704
       __do_sys_recvmsg net/socket.c:2714 [inline]
       __se_sys_recvmsg net/socket.c:2711 [inline]
       __x64_sys_recvmsg+0x42/0x50 net/socket.c:2711
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0xffff -> 0x0000
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 32005 Comm: syz-executor.4 Not tainted 5.18.0-rc1-syzkaller-00328-ge1f700eb-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Link: https://lore.kernel.org/r/20220505161946.2867638-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d5076fe4
    • Lokesh Dhoundiyal's avatar
      ipv4: drop dst in multicast routing path · 9e6c6d17
      Lokesh Dhoundiyal authored
      kmemleak reports the following when routing multicast traffic over an
      ipsec tunnel.
      
      Kmemleak output:
      unreferenced object 0x8000000044bebb00 (size 256):
        comm "softirq", pid 0, jiffies 4294985356 (age 126.810s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 80 00 00 00 05 13 74 80  ..............t.
          80 00 00 00 04 9b bf f9 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000f83947e0>] __kmalloc+0x1e8/0x300
          [<00000000b7ed8dca>] metadata_dst_alloc+0x24/0x58
          [<0000000081d32c20>] __ipgre_rcv+0x100/0x2b8
          [<00000000824f6cf1>] gre_rcv+0x178/0x540
          [<00000000ccd4e162>] gre_rcv+0x7c/0xd8
          [<00000000c024b148>] ip_protocol_deliver_rcu+0x124/0x350
          [<000000006a483377>] ip_local_deliver_finish+0x54/0x68
          [<00000000d9271b3a>] ip_local_deliver+0x128/0x168
          [<00000000bd4968ae>] xfrm_trans_reinject+0xb8/0xf8
          [<0000000071672a19>] tasklet_action_common.isra.16+0xc4/0x1b0
          [<0000000062e9c336>] __do_softirq+0x1fc/0x3e0
          [<00000000013d7914>] irq_exit+0xc4/0xe0
          [<00000000a4d73e90>] plat_irq_dispatch+0x7c/0x108
          [<000000000751eb8e>] handle_int+0x16c/0x178
          [<000000001668023b>] _raw_spin_unlock_irqrestore+0x1c/0x28
      
      The metadata dst is leaked when ip_route_input_mc() updates the dst for
      the skb. Commit f38a9eb1 ("dst: Metadata destinations") correctly
      handled dropping the dst in ip_route_input_slow() but missed the
      multicast case which is handled by ip_route_input_mc(). Drop the dst in
      ip_route_input_mc() avoiding the leak.
      
      Fixes: f38a9eb1 ("dst: Metadata destinations")
      Signed-off-by: default avatarLokesh Dhoundiyal <lokesh.dhoundiyal@alliedtelesis.co.nz>
      Signed-off-by: default avatarChris Packham <chris.packham@alliedtelesis.co.nz>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20220505020017.3111846-1-chris.packham@alliedtelesis.co.nzSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9e6c6d17
    • Jakub Kicinski's avatar
      Merge branch 'ocelot-vcap-fixes' · c88d3908
      Jakub Kicinski authored
      Vladimir Oltean says:
      
      ====================
      Ocelot VCAP fixes
      
      Changes in v2:
      fix the NPDs and UAFs caused by filter->trap_list in a more robust way
      that actually does not introduce bugs of its own (1/5)
      
      This series fixes issues found while running
      tools/testing/selftests/net/forwarding/tc_actions.sh on the ocelot
      switch:
      
      - NULL pointer dereference when failing to offload a filter
      - NULL pointer dereference after deleting a trap
      - filters still having effect after being deleted
      - dropped packets still being seen by software
      - statistics counters showing double the amount of hits
      - statistics counters showing inexistent hits
      - invalid configurations not rejected
      ====================
      
      Link: https://lore.kernel.org/r/20220504235503.4161890-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c88d3908
    • Vladimir Oltean's avatar
      net: mscc: ocelot: avoid corrupting hardware counters when moving VCAP filters · 93a84170
      Vladimir Oltean authored
      Given the following order of operations:
      
      (1) we add filter A using tc-flower
      (2) we send a packet that matches it
      (3) we read the filter's statistics to find a hit count of 1
      (4) we add a second filter B with a higher preference than A, and A
          moves one position to the right to make room in the TCAM for it
      (5) we send another packet, and this matches the second filter B
      (6) we read the filter statistics again.
      
      When this happens, the hit count of filter A is 2 and of filter B is 1,
      despite a single packet having matched each filter.
      
      Furthermore, in an alternate history, reading the filter stats a second
      time between steps (3) and (4) makes the hit count of filter A remain at
      1 after step (6), as expected.
      
      The reason why this happens has to do with the filter->stats.pkts field,
      which is written to hardware through the call path below:
      
                     vcap_entry_set
                     /      |      \
                    /       |       \
                   /        |        \
                  /         |         \
      es0_entry_set   is1_entry_set   is2_entry_set
                  \         |         /
                   \        |        /
                    \       |       /
              vcap_data_set(data.counter, ...)
      
      The primary role of filter->stats.pkts is to transport the filter hit
      counters from the last readout all the way from vcap_entry_get() ->
      ocelot_vcap_filter_stats_update() -> ocelot_cls_flower_stats().
      The reason why vcap_entry_set() writes it to hardware is so that the
      counters (saturating and having a limited bit width) are cleared
      after each user space readout.
      
      The writing of filter->stats.pkts to hardware during the TCAM entry
      movement procedure is an unintentional consequence of the code design,
      because the hit count isn't up to date at this point.
      
      So at step (4), when filter A is moved by ocelot_vcap_filter_add() to
      make room for filter B, the hardware hit count is 0 (no packet matched
      on it in the meantime), but filter->stats.pkts is 1, because the last
      readout saw the earlier packet. The movement procedure programs the old
      hit count back to hardware, so this creates the impression to user space
      that more packets have been matched than they really were.
      
      The bug can be seen when running the gact_drop_and_ok_test() from the
      tc_actions.sh selftest.
      
      Fix the issue by reading back the hit count to tmp->stats.pkts before
      migrating the VCAP filter. Sure, this is a best-effort technique, since
      the packets that hit the rule between vcap_entry_get() and
      vcap_entry_set() won't be counted, but at least it allows the counters
      to be reliably used for selftests where the traffic is under control.
      
      The vcap_entry_get() name is a bit unintuitive, but it only reads back
      the counter portion of the TCAM entry, not the entire entry.
      
      The index from which we retrieve the counter is also a bit unintuitive
      (i - 1 during add, i + 1 during del), but this is the way in which TCAM
      entry movement works. The "entry index" isn't a stored integer for a
      TCAM filter, instead it is dynamically computed by
      ocelot_vcap_block_get_filter_index() based on the entry's position in
      the &block->rules list. That position (as well as block->count) is
      automatically updated by ocelot_vcap_filter_add_to_block() on add, and
      by ocelot_vcap_block_remove_filter() on del. So "i" is the new filter
      index, and "i - 1" or "i + 1" respectively are the old addresses of that
      TCAM entry (we only support installing/deleting one filter at a time).
      
      Fixes: b5962294 ("net: mscc: ocelot: Add support for tcam")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      93a84170
    • Vladimir Oltean's avatar
      net: mscc: ocelot: restrict tc-trap actions to VCAP IS2 lookup 0 · 477d2b91
      Vladimir Oltean authored
      Once the CPU port was added to the destination port mask of a packet, it
      can never be cleared, so even packets marked as dropped by the MASK_MODE
      of a VCAP IS2 filter will still reach it. This is why we need the
      OCELOT_POLICER_DISCARD to "kill dropped packets dead" and make software
      stop seeing them.
      
      We disallow policer rules from being put on any other chain than the one
      for the first lookup, but we don't do this for "drop" rules, although we
      should. This change is merely ascertaining that the rules dont't
      (completely) work and letting the user know.
      
      The blamed commit is the one that introduced the multi-chain architecture
      in ocelot. Prior to that, we should have always offloaded the filters to
      VCAP IS2 lookup 0, where they did work.
      
      Fixes: 1397a2eb ("net: mscc: ocelot: create TCAM skeleton from tc filter chains")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      477d2b91
    • Vladimir Oltean's avatar
      net: mscc: ocelot: fix VCAP IS2 filters matching on both lookups · 6741e118
      Vladimir Oltean authored
      The VCAP IS2 TCAM is looked up twice per packet, and each filter can be
      configured to only match during the first, second lookup, or both, or
      none.
      
      The blamed commit wrote the code for making VCAP IS2 filters match only
      on the given lookup. But right below that code, there was another line
      that explicitly made the lookup a "don't care", and this is overwriting
      the lookup we've selected. So the code had no effect.
      
      Some of the more noticeable effects of having filters match on both
      lookups:
      
      - in "tc -s filter show dev swp0 ingress", we see each packet matching a
        VCAP IS2 filter counted twice. This throws off scripts such as
        tools/testing/selftests/net/forwarding/tc_actions.sh and makes them
        fail.
      
      - a "tc-drop" action offloaded to VCAP IS2 needs a policer as well,
        because once the CPU port becomes a member of the destination port
        mask of a packet, nothing removes it, not even a PERMIT/DENY mask mode
        with a port mask of 0. But VCAP IS2 rules with the POLICE_ENA bit in
        the action vector can only appear in the first lookup. What happens
        when a filter matches both lookups is that the action vector is
        combined, and this makes the POLICE_ENA bit ineffective, since the
        last lookup in which it has appeared is the second one. In other
        words, "tc-drop" actions do not drop packets for the CPU port, dropped
        packets are still seen by software unless there was an FDB entry that
        directed those packets to some other place different from the CPU.
      
      The last bit used to work, because in the initial commit b5962294
      ("net: mscc: ocelot: Add support for tcam"), we were writing the FIRST
      field of the VCAP IS2 half key with a 1, not with a "don't care".
      The change to "don't care" was made inadvertently by me in commit
      c1c3993e ("net: mscc: ocelot: generalize existing code for VCAP"),
      which I just realized, and which needs a separate fix from this one,
      for "stable" kernels that lack the commit blamed below.
      
      Fixes: 226e9cd8 ("net: mscc: ocelot: only install TCAM entries into a specific lookup and PAG")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6741e118
    • Vladimir Oltean's avatar
      net: mscc: ocelot: fix last VCAP IS1/IS2 filter persisting in hardware when deleted · 16bbebd3
      Vladimir Oltean authored
      ocelot_vcap_filter_del() works by moving the next filters over the
      current one, and then deleting the last filter by calling vcap_entry_set()
      with a del_filter which was specially created by memsetting its memory
      to zeroes. vcap_entry_set() then programs this to the TCAM and action
      RAM via the cache registers.
      
      The problem is that vcap_entry_set() is a dispatch function which looks
      at del_filter->block_id. But since del_filter is zeroized memory, the
      block_id is 0, or otherwise said, VCAP_ES0. So practically, what we do
      is delete the entry at the same TCAM index from VCAP ES0 instead of IS1
      or IS2.
      
      The code was not always like this. vcap_entry_set() used to simply be
      is2_entry_set(), and then, the logic used to work.
      
      Restore the functionality by populating the block_id of the del_filter
      based on the VCAP block of the filter that we're deleting. This makes
      vcap_entry_set() know what to do.
      
      Fixes: 1397a2eb ("net: mscc: ocelot: create TCAM skeleton from tc filter chains")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      16bbebd3
    • Vladimir Oltean's avatar
      net: mscc: ocelot: mark traps with a bool instead of keeping them in a list · e1846cff
      Vladimir Oltean authored
      Since the blamed commit, VCAP filters can appear on more than one list.
      If their action is "trap", they are chained on ocelot->traps via
      filter->trap_list. This is in addition to their normal placement on the
      VCAP block->rules list head.
      
      Therefore, when we free a VCAP filter, we must remove it from all lists
      it is a member of, including ocelot->traps.
      
      There are at least 2 bugs which are direct consequences of this design
      decision.
      
      First is the incorrect usage of list_empty(), meant to denote whether
      "filter" is chained into ocelot->traps via filter->trap_list.
      This does not do the correct thing, because list_empty() checks whether
      "head->next == head", but in our case, head->next == head->prev == NULL.
      So we dereference NULL pointers and die when we call list_del().
      
      Second is the fact that not all places that should remove the filter
      from ocelot->traps do so. One example is ocelot_vcap_block_remove_filter(),
      which is where we have the main kfree(filter). By keeping freed filters
      in ocelot->traps we end up in a use-after-free in
      felix_update_trapping_destinations().
      
      Attempting to fix all the buggy patterns is a whack-a-mole game which
      makes the driver unmaintainable. Actually this is what the previous
      patch version attempted to do:
      https://patchwork.kernel.org/project/netdevbpf/patch/20220503115728.834457-3-vladimir.oltean@nxp.com/
      
      but it introduced another set of bugs, because there are other places in
      which create VCAP filters, not just ocelot_vcap_filter_create():
      
      - ocelot_trap_add()
      - felix_tag_8021q_vlan_add_rx()
      - felix_tag_8021q_vlan_add_tx()
      
      Relying on the convention that all those code paths must call
      INIT_LIST_HEAD(&filter->trap_list) is not going to scale.
      
      So let's do what should have been done in the first place and keep a
      bool in struct ocelot_vcap_filter which denotes whether we are looking
      at a trapping rule or not. Iterating now happens over the main VCAP IS2
      block->rules. The advantage is that we no longer risk having stale
      references to a freed filter, since it is only present in that list.
      
      Fixes: e42bd4ed ("net: mscc: ocelot: keep traps in a list")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e1846cff
    • Jonathan Toppins's avatar
      MAINTAINERS: add missing files for bonding definition · 4e707344
      Jonathan Toppins authored
      The bonding entry did not include additional include files that have
      been added nor did it reference the documentation. Add these references
      for completeness.
      Signed-off-by: default avatarJonathan Toppins <jtoppins@redhat.com>
      Link: https://lore.kernel.org/r/903ed2906b93628b38a2015664a20d2802042863.1651690748.git.jtoppins@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4e707344
    • Tariq Toukan's avatar
      net: Fix features skip in for_each_netdev_feature() · 85db6352
      Tariq Toukan authored
      The find_next_netdev_feature() macro gets the "remaining length",
      not bit index.
      Passing "bit - 1" for the following iteration is wrong as it skips
      the adjacent bit. Pass "bit" instead.
      
      Fixes: 3b89ea9c ("net: Fix for_each_netdev_feature on Big endian")
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarGal Pressman <gal@nvidia.com>
      Link: https://lore.kernel.org/r/20220504080914.1918-1-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      85db6352
    • Jakub Kicinski's avatar
      Merge branch 'vrf-fix-address-binding-with-icmp-socket' · 690447a2
      Jakub Kicinski authored
      Nicolas Dichtel says:
      
      ====================
      vrf: fix address binding with icmp socket
      
      The first patch fixes the issue.
      The second patch adds related tests in selftests.
      ====================
      
      Link: https://lore.kernel.org/r/20220504090739.21821-1-nicolas.dichtel@6wind.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      690447a2
    • Nicolas Dichtel's avatar
      selftests: add ping test with ping_group_range tuned · e71b7f1f
      Nicolas Dichtel authored
      The 'ping' utility is able to manage two kind of sockets (raw or icmp),
      depending on the sysctl ping_group_range. By default, ping_group_range is
      set to '1 0', which forces ping to use an ip raw socket.
      
      Let's replay the ping tests by allowing 'ping' to use the ip icmp socket.
      After the previous patch, ipv4 tests results are the same with both kinds
      of socket. For ipv6, there are a lot a new failures (the previous patch
      fixes only two cases).
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e71b7f1f
    • Nicolas Dichtel's avatar
      ping: fix address binding wrt vrf · e1a7ac6f
      Nicolas Dichtel authored
      When ping_group_range is updated, 'ping' uses the DGRAM ICMP socket,
      instead of an IP raw socket. In this case, 'ping' is unable to bind its
      socket to a local address owned by a vrflite.
      
      Before the patch:
      $ sysctl -w net.ipv4.ping_group_range='0  2147483647'
      $ ip link add blue type vrf table 10
      $ ip link add foo type dummy
      $ ip link set foo master blue
      $ ip link set foo up
      $ ip addr add 192.168.1.1/24 dev foo
      $ ip addr add 2001::1/64 dev foo
      $ ip vrf exec blue ping -c1 -I 192.168.1.1 192.168.1.2
      ping: bind: Cannot assign requested address
      $ ip vrf exec blue ping6 -c1 -I 2001::1 2001::2
      ping6: bind icmp socket: Cannot assign requested address
      
      CC: stable@vger.kernel.org
      Fixes: 1b69c6d0 ("net: Introduce L3 Master device abstraction")
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e1a7ac6f
    • Fabio Estevam's avatar
      net: phy: micrel: Pass .probe for KS8737 · 15f03ffe
      Fabio Estevam authored
      Since commit f1131b9c ("net: phy: micrel: use
      kszphy_suspend()/kszphy_resume for irq aware devices") the kszphy_suspend/
      resume hooks are used.
      
      These functions require the probe function to be called so that
      priv can be allocated.
      
      Otherwise, a NULL pointer dereference happens inside
      kszphy_config_reset().
      
      Cc: stable@vger.kernel.org
      Fixes: f1131b9c ("net: phy: micrel: use kszphy_suspend()/kszphy_resume for irq aware devices")
      Reported-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarFabio Estevam <festevam@denx.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20220504143104.1286960-2-festevam@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      15f03ffe
    • Fabio Estevam's avatar
      net: phy: micrel: Do not use kszphy_suspend/resume for KSZ8061 · e333eed6
      Fabio Estevam authored
      Since commit f1131b9c ("net: phy: micrel: use
      kszphy_suspend()/kszphy_resume for irq aware devices") the following
      NULL pointer dereference is observed on a board with KSZ8061:
      
       # udhcpc -i eth0
      udhcpc: started, v1.35.0
      8<--- cut here ---
      Unable to handle kernel NULL pointer dereference at virtual address 00000008
      pgd = f73cef4e
      [00000008] *pgd=00000000
      Internal error: Oops: 5 [#1] SMP ARM
      Modules linked in:
      CPU: 0 PID: 196 Comm: ifconfig Not tainted 5.15.37-dirty #94
      Hardware name: Freescale i.MX6 SoloX (Device Tree)
      PC is at kszphy_config_reset+0x10/0x114
      LR is at kszphy_resume+0x24/0x64
      ...
      
      The KSZ8061 phy_driver structure does not have the .probe/..driver_data
      fields, which means that priv is not allocated.
      
      This causes the NULL pointer dereference inside kszphy_config_reset().
      
      Fix the problem by using the generic suspend/resume functions as before.
      
      Another alternative would be to provide the .probe and .driver_data
      information into the structure, but to be on the safe side, let's
      just restore Ethernet functionality by using the generic suspend/resume.
      
      Cc: stable@vger.kernel.org
      Fixes: f1131b9c ("net: phy: micrel: use kszphy_suspend()/kszphy_resume for irq aware devices")
      Signed-off-by: default avatarFabio Estevam <festevam@denx.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20220504143104.1286960-1-festevam@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e333eed6
  2. 05 May, 2022 19 commits
    • Tetsuo Handa's avatar
      net: rds: use maybe_get_net() when acquiring refcount on TCP sockets · 6997fbd7
      Tetsuo Handa authored
      Eric Dumazet is reporting addition on 0 problem at rds_tcp_tune(), for
      delayed works queued in rds_wq might be invoked after a net namespace's
      refcount already reached 0.
      
      Since rds_tcp_exit_net() from cleanup_net() calls flush_workqueue(rds_wq),
      it is guaranteed that we can instead use maybe_get_net() from delayed work
      functions until rds_tcp_exit_net() returns.
      
      Note that I'm not convinced that all works which might access a net
      namespace are already queued in rds_wq by the moment rds_tcp_exit_net()
      calls flush_workqueue(rds_wq). If some race is there, rds_tcp_exit_net()
      will fail to wait for work functions, and kmem_cache_free() could be
      called from net_free() before maybe_get_net() is called from
      rds_tcp_tune().
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: 3a58f13a ("net: rds: acquire refcount on TCP sockets")
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/41d09faf-bc78-1a87-dfd1-c6d1b5984b61@I-love.SAKURA.ne.jpSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6997fbd7
    • Linus Torvalds's avatar
      Merge tag 'net-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 68533eb1
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Including fixes from can, rxrpc and wireguard.
      
        Previous releases - regressions:
      
         - igmp: respect RCU rules in ip_mc_source() and ip_mc_msfilter()
      
         - mld: respect RCU rules in ip6_mc_source() and ip6_mc_msfilter()
      
         - rds: acquire netns refcount on TCP sockets
      
         - rxrpc: enable IPv6 checksums on transport socket
      
         - nic: hinic: fix bug of wq out of bound access
      
         - nic: thunder: don't use pci_irq_vector() in atomic context
      
         - nic: bnxt_en: fix possible bnxt_open() failure caused by wrong RFS
           flag
      
         - nic: mlx5e:
            - lag, fix use-after-free in fib event handler
            - fix deadlock in sync reset flow
      
        Previous releases - always broken:
      
         - tcp: fix insufficient TCP source port randomness
      
         - can: grcan: grcan_close(): fix deadlock
      
         - nfc: reorder destructive operations in to avoid bugs
      
        Misc:
      
         - wireguard: improve selftests reliability"
      
      * tag 'net-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (63 commits)
        NFC: netlink: fix sleep in atomic bug when firmware download timeout
        selftests: ocelot: tc_flower_chains: specify conform-exceed action for policer
        tcp: drop the hash_32() part from the index calculation
        tcp: increase source port perturb table to 2^16
        tcp: dynamically allocate the perturb table used by source ports
        tcp: add small random increments to the source port
        tcp: resalt the secret every 10 seconds
        tcp: use different parts of the port_offset for index and offset
        secure_seq: use the 64 bits of the siphash for port offset calculation
        wireguard: selftests: set panic_on_warn=1 from cmdline
        wireguard: selftests: bump package deps
        wireguard: selftests: restore support for ccache
        wireguard: selftests: use newer toolchains to fill out architectures
        wireguard: selftests: limit parallelism to $(nproc) tests at once
        wireguard: selftests: make routing loop test non-fatal
        net/mlx5: Fix matching on inner TTC
        net/mlx5: Avoid double clear or set of sync reset requested
        net/mlx5: Fix deadlock in sync reset flow
        net/mlx5e: Fix trust state reset in reload
        net/mlx5e: Avoid checking offload capability in post_parse action
        ...
      68533eb1
    • Duoming Zhou's avatar
      NFC: netlink: fix sleep in atomic bug when firmware download timeout · 4071bf12
      Duoming Zhou authored
      There are sleep in atomic bug that could cause kernel panic during
      firmware download process. The root cause is that nlmsg_new with
      GFP_KERNEL parameter is called in fw_dnld_timeout which is a timer
      handler. The call trace is shown below:
      
      BUG: sleeping function called from invalid context at include/linux/sched/mm.h:265
      Call Trace:
      kmem_cache_alloc_node
      __alloc_skb
      nfc_genl_fw_download_done
      call_timer_fn
      __run_timers.part.0
      run_timer_softirq
      __do_softirq
      ...
      
      The nlmsg_new with GFP_KERNEL parameter may sleep during memory
      allocation process, and the timer handler is run as the result of
      a "software interrupt" that should not call any other function
      that could sleep.
      
      This patch changes allocation mode of netlink message from GFP_KERNEL
      to GFP_ATOMIC in order to prevent sleep in atomic bug. The GFP_ATOMIC
      flag makes memory allocation operation could be used in atomic context.
      
      Fixes: 9674da87 ("NFC: Add firmware upload netlink command")
      Fixes: 9ea7187c ("NFC: netlink: Rename CMD_FW_UPLOAD to CMD_FW_DOWNLOAD")
      Signed-off-by: default avatarDuoming Zhou <duoming@zju.edu.cn>
      Reviewed-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Link: https://lore.kernel.org/r/20220504055847.38026-1-duoming@zju.edu.cnSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      4071bf12
    • Vladimir Oltean's avatar
      selftests: ocelot: tc_flower_chains: specify conform-exceed action for policer · 5a7c5f70
      Vladimir Oltean authored
      As discussed here with Ido Schimmel:
      https://patchwork.kernel.org/project/netdevbpf/patch/20220224102908.5255-2-jianbol@nvidia.com/
      
      the default conform-exceed action is "reclassify", for a reason we don't
      really understand.
      
      The point is that hardware can't offload that police action, so not
      specifying "conform-exceed" was always wrong, even though the command
      used to work in hardware (but not in software) until the kernel started
      adding validation for it.
      
      Fix the command used by the selftest by making the policer drop on
      exceed, and pass the packet to the next action (goto) on conform.
      
      Fixes: 8cd6b020 ("selftests: ocelot: add some example VCAP IS1, IS2 and ES0 tc offloads")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Link: https://lore.kernel.org/r/20220503121428.842906-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5a7c5f70
    • Jakub Kicinski's avatar
      Merge branch 'insufficient-tcp-source-port-randomness' · ef562489
      Jakub Kicinski authored
      Willy Tarreau says:
      
      ====================
      insufficient TCP source port randomness
      
      In a not-yet published paper, Moshe Kol, Amit Klein, and Yossi Gilad
      report being able to accurately identify a client by forcing it to emit
      only 40 times more connections than the number of entries in the
      table_perturb[] table, which is indexed by hashing the connection tuple.
      The current 2^8 setting allows them to perform that attack with only 10k
      connections, which is not hard to achieve in a few seconds.
      
      Eric, Amit and I have been working on this for a few weeks now imagining,
      testing and eliminating a number of approaches that Amit and his team were
      still able to break or that were found to be too risky or too expensive,
      and ended up with the simple improvements in this series that resists to
      the attack, doesn't degrade the performance, and preserves a reliable port
      selection algorithm to avoid connection failures, including the odd/even
      port selection preference that allows bind() to always find a port quickly
      even under strong connect() stress.
      
      The approach relies on several factors:
        - resalting the hash secret that's used to choose the table_perturb[]
          entry every 10 seconds to eliminate slow attacks and force the
          attacker to forget everything that was learned after this delay.
          This already eliminates most of the problem because if a client
          stays silent for more than 10 seconds there's no link between the
          previous and the next patterns, and 10s isn't yet frequent enough
          to cause too frequent repetition of a same port that may induce a
          connection failure ;
      
        - adding small random increments to the source port. Previously, a
          random 0 or 1 was added every 16 ports. Now a random 0 to 7 is
          added after each port. This means that with the default 32768-60999
          range, a worst case rollover happens after 1764 connections, and
          an average of 3137. This doesn't stop statistical attacks but
          requires significantly more iterations of the same attack to
          confirm a guess.
      
        - increasing the table_perturb[] size from 2^8 to 2^16, which Amit
          says will require 2.6 million connections to be attacked with the
          changes above, making it pointless to get a fingerprint that will
          only last 10 seconds. Due to the size, the table was made dynamic.
      
        - a few minor improvements on the bits used from the hash, to eliminate
          some unfortunate correlations that may possibly have been exploited
          to design future attack models.
      
      These changes were tested under the most extreme conditions, up to
      1.1 million connections per second to one and a few targets, showing no
      performance regression, and only 2 connection failures within 13 billion,
      which is less than 2^-32 and perfectly within usual values.
      
      The series is split into small reviewable changes and was already reviewed
      by Amit and Eric.
      ====================
      
      Link: https://lore.kernel.org/r/20220502084614.24123-1-w@1wt.euSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ef562489
    • Willy Tarreau's avatar
      tcp: drop the hash_32() part from the index calculation · e8161345
      Willy Tarreau authored
      In commit 190cc824 ("tcp: change source port randomizarion at
      connect() time"), the table_perturb[] array was introduced and an
      index was taken from the port_offset via hash_32(). But it turns
      out that hash_32() performs a multiplication while the input here
      comes from the output of SipHash in secure_seq, that is well
      distributed enough to avoid the need for yet another hash.
      Suggested-by: default avatarAmit Klein <aksecurity@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e8161345
    • Willy Tarreau's avatar
      tcp: increase source port perturb table to 2^16 · 4c2c8f03
      Willy Tarreau authored
      Moshe Kol, Amit Klein, and Yossi Gilad reported being able to accurately
      identify a client by forcing it to emit only 40 times more connections
      than there are entries in the table_perturb[] table. The previous two
      improvements consisting in resalting the secret every 10s and adding
      randomness to each port selection only slightly improved the situation,
      and the current value of 2^8 was too small as it's not very difficult
      to make a client emit 10k connections in less than 10 seconds.
      
      Thus we're increasing the perturb table from 2^8 to 2^16 so that the
      same precision now requires 2.6M connections, which is more difficult in
      this time frame and harder to hide as a background activity. The impact
      is that the table now uses 256 kB instead of 1 kB, which could mostly
      affect devices making frequent outgoing connections. However such
      components usually target a small set of destinations (load balancers,
      database clients, perf assessment tools), and in practice only a few
      entries will be visited, like before.
      
      A live test at 1 million connections per second showed no performance
      difference from the previous value.
      Reported-by: default avatarMoshe Kol <moshe.kol@mail.huji.ac.il>
      Reported-by: default avatarYossi Gilad <yossi.gilad@mail.huji.ac.il>
      Reported-by: default avatarAmit Klein <aksecurity@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4c2c8f03
    • Willy Tarreau's avatar
      tcp: dynamically allocate the perturb table used by source ports · e9261476
      Willy Tarreau authored
      We'll need to further increase the size of this table and it's likely
      that at some point its size will not be suitable anymore for a static
      table. Let's allocate it on boot from inet_hashinfo2_init(), which is
      called from tcp_init().
      
      Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
      Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
      Cc: Amit Klein <aksecurity@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e9261476
    • Willy Tarreau's avatar
      tcp: add small random increments to the source port · ca7af040
      Willy Tarreau authored
      Here we're randomly adding between 0 and 7 random increments to the
      selected source port in order to add some noise in the source port
      selection that will make the next port less predictable.
      
      With the default port range of 32768-60999 this means a worst case
      reuse scenario of 14116/8=1764 connections between two consecutive
      uses of the same port, with an average of 14116/4.5=3137. This code
      was stressed at more than 800000 connections per second to a fixed
      target with all connections closed by the client using RSTs (worst
      condition) and only 2 connections failed among 13 billion, despite
      the hash being reseeded every 10 seconds, indicating a perfectly
      safe situation.
      
      Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
      Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
      Cc: Amit Klein <aksecurity@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ca7af040
    • Eric Dumazet's avatar
      tcp: resalt the secret every 10 seconds · 4dfa9b43
      Eric Dumazet authored
      In order to limit the ability for an observer to recognize the source
      ports sequence used to contact a set of destinations, we should
      periodically shuffle the secret. 10 seconds looks effective enough
      without causing particular issues.
      
      Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
      Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
      Cc: Amit Klein <aksecurity@gmail.com>
      Cc: Jason A. Donenfeld <Jason@zx2c4.com>
      Tested-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4dfa9b43
    • Willy Tarreau's avatar
      tcp: use different parts of the port_offset for index and offset · 9e9b70ae
      Willy Tarreau authored
      Amit Klein suggests that we use different parts of port_offset for the
      table's index and the port offset so that there is no direct relation
      between them.
      
      Cc: Jason A. Donenfeld <Jason@zx2c4.com>
      Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
      Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
      Cc: Amit Klein <aksecurity@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9e9b70ae
    • Willy Tarreau's avatar
      secure_seq: use the 64 bits of the siphash for port offset calculation · b2d05756
      Willy Tarreau authored
      SipHash replaced MD5 in secure_ipv{4,6}_port_ephemeral() via commit
      7cd23e53 ("secure_seq: use SipHash in place of MD5"), but the output
      remained truncated to 32-bit only. In order to exploit more bits from the
      hash, let's make the functions return the full 64-bit of siphash_3u32().
      We also make sure the port offset calculation in __inet_hash_connect()
      remains done on 32-bit to avoid the need for div_u64_rem() and an extra
      cost on 32-bit systems.
      
      Cc: Jason A. Donenfeld <Jason@zx2c4.com>
      Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
      Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
      Cc: Amit Klein <aksecurity@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b2d05756
    • Jakub Kicinski's avatar
      Merge branch 'wireguard-patches-for-5-18-rc6' · 205557ba
      Jakub Kicinski authored
      Jason A. Donenfeld says:
      
      ====================
      wireguard patches for 5.18-rc6
      
      In working on some other problems, I wound up leaning on the WireGuard
      CI more than usual and uncovered a few small issues with reliability.
      These are fairly low key changes, since they don't impact kernel code
      itself.
      
      One change does stick out in particular, though, which is the "make
      routing loop test non-fatal" commit. I'm not thrilled about doing this,
      but currently [1] remains unsolved, and I'm still working on a real
      solution to that (hopefully for 5.19 or 5.20 if I can come up with a
      good idea...), so for now that test just prints a big red warning
      instead.
      
      [1] https://lore.kernel.org/netdev/YmszSXueTxYOC41G@zx2c4.com/
      ====================
      
      Link: https://lore.kernel.org/r/20220504202920.72908-1-Jason@zx2c4.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      205557ba
    • Jason A. Donenfeld's avatar
      wireguard: selftests: set panic_on_warn=1 from cmdline · 3fc1b11e
      Jason A. Donenfeld authored
      Rather than setting this once init is running, set panic_on_warn from
      the kernel command line, so that it catches splats from WireGuard
      initialization code and the various crypto selftests.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3fc1b11e
    • Jason A. Donenfeld's avatar
      wireguard: selftests: bump package deps · a6b8ea91
      Jason A. Donenfeld authored
      Use newer, more reliable package dependencies. These should hopefully
      reduce flakes. However, we keep the old iputils package, as it
      accumulated bugs after resulting in flakes on slow machines.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a6b8ea91
    • Jason A. Donenfeld's avatar
      wireguard: selftests: restore support for ccache · d261ba6a
      Jason A. Donenfeld authored
      When moving to non-system toolchains, we inadvertantly killed the
      ability to use ccache. So instead, build ccache support into the test
      harness directly.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d261ba6a
    • Jason A. Donenfeld's avatar
      wireguard: selftests: use newer toolchains to fill out architectures · d5d9b29b
      Jason A. Donenfeld authored
      Rather than relying on the system to have cross toolchains available,
      simply download musl.cc's ones and use that libc.so, and then we use it
      to fill in a few missing platforms, such as riscv64, riscv64, powerpc64,
      and s390x.
      
      Since riscv doesn't have a second serial port in its device description,
      we have to use virtio's vport. This is actually the same situation on
      ARM, but we were previously hacking QEMU up to work around this, which
      required a custom QEMU. Instead just do the vport trick on ARM too.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d5d9b29b
    • Jason A. Donenfeld's avatar
      wireguard: selftests: limit parallelism to $(nproc) tests at once · 39f02bf1
      Jason A. Donenfeld authored
      The parallel tests were added to catch queueing issues from multiple
      cores. But what happens in reality when testing tons of processes is
      that these separate threads wind up fighting with the scheduler, and we
      wind up with contention in places we don't care about that decrease the
      chances of hitting a bug. So just do a test with the number of CPU
      cores, rather than trying to scale up arbitrarily.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      39f02bf1
    • Jason A. Donenfeld's avatar
      wireguard: selftests: make routing loop test non-fatal · ae2de669
      Jason A. Donenfeld authored
      I hate to do this, but I still do not have a good solution to actually
      fix this bug across architectures. So just disable it for now, so that
      the CI can still deliver actionable results. This commit adds a large
      red warning, so that at least the failure isn't lost forever, and
      hopefully this can be revisited down the line.
      
      Link: https://lore.kernel.org/netdev/CAHmME9pv1x6C4TNdL6648HydD8r+txpV4hTUXOBVkrapBXH4QQ@mail.gmail.com/
      Link: https://lore.kernel.org/netdev/YmszSXueTxYOC41G@zx2c4.com/
      Link: https://lore.kernel.org/wireguard/CAHmME9rNnBiNvBstb7MPwK-7AmAN0sOfnhdR=eeLrowWcKxaaQ@mail.gmail.com/Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ae2de669
  3. 04 May, 2022 6 commits