1. 09 Feb, 2022 24 commits
  2. 08 Feb, 2022 7 commits
    • Jakub Kicinski's avatar
      Merge branch 'inet-separate-dscp-from-ecn-bits-using-new-dscp_t-type' · c3e676b9
      Jakub Kicinski authored
      Guillaume Nault says:
      
      ====================
      inet: Separate DSCP from ECN bits using new dscp_t type
      
      The networking stack currently doesn't clearly distinguish between DSCP
      and ECN bits. The entire DSCP+ECN bits are stored in u8 variables (or
      structure fields), and each part of the stack handles them in their own
      way, using different macros. This has created several bugs in the past
      and some uncommon code paths are still unfixed.
      
      Such bugs generally manifest by selecting invalid routes because of ECN
      bits interfering with FIB routes and rules lookups (more details in the
      LPC 2021 talk[1] and in the RFC of this series[2]).
      
      This patch series aims at preventing the introduction of such bugs (and
      detecting existing ones), by introducing a dscp_t type, representing
      "sanitised" DSCP values (that is, with no ECN information), as opposed
      to plain u8 values that contain both DSCP and ECN information. dscp_t
      makes it clear for the reader what we're working on, and Sparse can
      flag invalid interactions between dscp_t and plain u8.
      
      This series converts only a few variables and structures:
      
        * Patch 1 converts the tclass field of struct fib6_rule. It
          effectively forbids the use of ECN bits in the tos/dsfield option
          of ip -6 rule. Rules now match packets solely based on their DSCP
          bits, so ECN doesn't influence the result any more. This contrasts
          with the previous behaviour where all 8 bits of the Traffic Class
          field were used. It is believed that this change is acceptable as
          matching ECN bits wasn't usable for IPv4, so only IPv6-only
          deployments could be depending on it. Also the previous behaviour
          made DSCP-based ip6-rules fail for packets with both a DSCP and an
          ECN mark, which is another reason why any such deploy is unlikely.
      
        * Patch 2 converts the tos field of struct fib4_rule. This one too
          effectively forbids defining ECN bits, this time in ip -4 rule.
          Before that, setting ECN bit 1 was accepted, while ECN bit 0 was
          rejected. But even when accepted, the rule would never match, as
          the packets would have their ECN bits cleared before doing the
          rule lookup.
      
        * Patch 3 converts the fc_tos field of struct fib_config. This is
          equivalent to patch 2, but for IPv4 routes. Routes using a
          tos/dsfield option with any ECN bit set is now rejected. Before
          this patch, they were accepted but, as with ip4 rules, these routes
          couldn't match any packet, since their ECN bits are cleared before
          the lookup.
      
        * Patch 4 converts the fa_tos field of struct fib_alias. This one is
          pure internal u8 to dscp_t conversion. While patches 1-3 had user
          facing consequences, this patch shouldn't have any side effect and
          is there to give an overview of what future conversion patches will
          look like. Conversions are quite mechanical, but imply some code
          churn, which is the price for the extra clarity a possibility of
          type checking.
      
      To summarise, all the behaviour changes required for the dscp_t type
      approach to work should be contained in patches 1-3. These changes are
      edge cases of ip-route and ip-rule that don't currently work properly.
      So they should be safe. Also, a kernel selftest is added for each of
      them.
      
      Finally, this work also paves the way for allowing the usage of the 3
      high order DSCP bits in IPv4 (a few call paths already handle them, but
      in general the stack clears them before IPv4 rule and route lookups).
      
      References:
        [1] LPC 2021 talk:
              - https://linuxplumbersconf.org/event/11/contributions/943/
              - Direct link to slide deck:
                  https://linuxplumbersconf.org/event/11/contributions/943/attachments/901/1780/inet_tos_lpc2021.pdf
        [2] RFC version of this series:
            - https://lore.kernel.org/netdev/cover.1638814614.git.gnault@redhat.com/
      ====================
      
      Link: https://lore.kernel.org/r/cover.1643981839.git.gnault@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c3e676b9
    • Guillaume Nault's avatar
      ipv4: Use dscp_t in struct fib_alias · 32ccf110
      Guillaume Nault authored
      Use the new dscp_t type to replace the fa_tos field of fib_alias. This
      ensures ECN bits are ignored and makes the field compatible with the
      fc_dscp field of struct fib_config.
      
      Converting old *tos variables and fields to dscp_t allows sparse to
      flag incorrect uses of DSCP and ECN bits. This patch is entirely about
      type annotation and shouldn't change any existing behaviour.
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Acked-by: default avatarDavid Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      32ccf110
    • Guillaume Nault's avatar
      ipv4: Reject routes specifying ECN bits in rtm_tos · f55fbb6a
      Guillaume Nault authored
      Use the new dscp_t type to replace the fc_tos field of fib_config, to
      ensure IPv4 routes aren't influenced by ECN bits when configured with
      non-zero rtm_tos.
      
      Before this patch, IPv4 routes specifying an rtm_tos with some of the
      ECN bits set were accepted. However they wouldn't work (never match) as
      IPv4 normally clears the ECN bits with IPTOS_RT_MASK before doing a FIB
      lookup (although a few buggy code paths don't).
      
      After this patch, IPv4 routes specifying an rtm_tos with any ECN bit
      set is rejected.
      
      Note: IPv6 routes ignore rtm_tos altogether, any rtm_tos is accepted,
      but treated as if it were 0.
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Acked-by: default avatarDavid Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f55fbb6a
    • Guillaume Nault's avatar
      ipv4: Stop taking ECN bits into account in fib4-rules · 563f8e97
      Guillaume Nault authored
      Use the new dscp_t type to replace the tos field of struct fib4_rule,
      so that fib4-rules consistently ignore ECN bits.
      
      Before this patch, fib4-rules did accept rules with the high order ECN
      bit set (but not the low order one). Also, it relied on its callers
      masking the ECN bits of ->flowi4_tos to prevent those from influencing
      the result. This was brittle and a few call paths still do the lookup
      without masking the ECN bits first.
      
      After this patch fib4-rules only compare the DSCP bits. ECN can't
      influence the result anymore, even if the caller didn't mask these
      bits. Also, fib4-rules now must have both ECN bits cleared or they will
      be rejected.
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Acked-by: default avatarDavid Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      563f8e97
    • Guillaume Nault's avatar
      ipv6: Define dscp_t and stop taking ECN bits into account in fib6-rules · a410a0cf
      Guillaume Nault authored
      Define a dscp_t type and its appropriate helpers that ensure ECN bits
      are not taken into account when handling DSCP.
      
      Use this new type to replace the tclass field of struct fib6_rule, so
      that fib6-rules don't get influenced by ECN bits anymore.
      
      Before this patch, fib6-rules didn't make any distinction between the
      DSCP and ECN bits. Therefore, rules specifying a DSCP (tos or dsfield
      options in iproute2) stopped working as soon a packets had at least one
      of its ECN bits set (as a work around one could create four rules for
      each DSCP value to match, one for each possible ECN value).
      
      After this patch fib6-rules only compare the DSCP bits. ECN doesn't
      influence the result anymore. Also, fib6-rules now must have the ECN
      bits cleared or they will be rejected.
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Acked-by: default avatarDavid Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a410a0cf
    • Yannick Vignon's avatar
      net: stmmac: optimize locking around PTP clock reads · 642436a1
      Yannick Vignon authored
      Reading the PTP clock is a simple operation requiring only 3 register
      reads. Under a PREEMPT_RT kernel, protecting those reads by a spin_lock is
      counter-productive: if the 2nd task preempting the 1st has a higher prio
      but needs to read time as well, it will require 2 context switches, which
      will pretty much always be more costly than just disabling preemption for
      the duration of the reads. Moreover, with the code logic recently added
      to get_systime(), disabling preemption is not even required anymore:
      reads and writes just need to be protected from each other, to prevent a
      clock read while the clock is being updated.
      
      Improve the above situation by replacing the PTP spinlock by a rwlock, and
      using read_lock for PTP clock reads so simultaneous reads do not block
      each other.
      Signed-off-by: default avatarYannick Vignon <yannick.vignon@nxp.com>
      Link: https://lore.kernel.org/r/20220204135545.2770625-1-yannick.vignon@oss.nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      642436a1
    • Eric Dumazet's avatar
      net: typhoon: include <net/vxlan.h> · d1d5bd64
      Eric Dumazet authored
      We need this to get vxlan_features_check() definition.
      
      Fixes: d2692eee ("net: typhoon: implement ndo_features_check method")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20220208003502.1799728-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d1d5bd64
  3. 07 Feb, 2022 9 commits
    • Dan Carpenter's avatar
      net: dsa: mv88e6xxx: Unlock on error in mv88e6xxx_port_bridge_join() · ff624338
      Dan Carpenter authored
      Call mv88e6xxx_reg_unlock(chip) before returning on this error path.
      
      Fixes: 7af4a361 ("net: dsa: mv88e6xxx: Improve isolation of standalone ports")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ff624338
    • Dan Carpenter's avatar
      net: dsa: mv88e6xxx: Fix off by in one in mv88e6185_phylink_get_caps() · dde41a69
      Dan Carpenter authored
      The <= ARRAY_SIZE() needs to be < ARRAY_SIZE() to prevent an out of
      bounds error.
      
      Fixes: d4ebf12b ("net: dsa: mv88e6xxx: populate supported_interfaces and mac_capabilities")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dde41a69
    • Yufeng Mo's avatar
      net: hns3: add support for TX push mode · 87a9b2fd
      Yufeng Mo authored
      For the device that supports the TX push capability, the BD can
      be directly copied to the device memory. However, due to hardware
      restrictions, the push mode can be used only when there are no
      more than two BDs, otherwise, the doorbell mode based on device
      memory is used.
      Signed-off-by: default avatarYufeng Mo <moyufeng@huawei.com>
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      87a9b2fd
    • Pavel Skripkin's avatar
      net: asix: add proper error handling of usb read errors · 920a9fa2
      Pavel Skripkin authored
      Syzbot once again hit uninit value in asix driver. The problem still the
      same -- asix_read_cmd() reads less bytes, than was requested by caller.
      
      Since all read requests are performed via asix_read_cmd() let's catch
      usb related error there and add __must_check notation to be sure all
      callers actually check return value.
      
      So, this patch adds sanity check inside asix_read_cmd(), that simply
      checks if bytes read are not less, than was requested and adds missing
      error handling of asix_read_cmd() all across the driver code.
      
      Fixes: d9fe64e5 ("net: asix: Add in_pm parameter")
      Reported-and-tested-by: syzbot+6ca9f7867b77c2d316ac@syzkaller.appspotmail.com
      Signed-off-by: default avatarPavel Skripkin <paskripkin@gmail.com>
      Tested-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      920a9fa2
    • Heiner Kallweit's avatar
      r8169: factor out redundant RTL8168d PHY config functionality to rtl8168d_1_common() · b845bac8
      Heiner Kallweit authored
      rtl8168d_2_hw_phy_config() shares quite some functionality with
      rtl8168d_1_hw_phy_config(), so let's factor out the common part to a
      new function rtl8168d_1_common(). In addition improve the code a little.
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b845bac8
    • Eric Dumazet's avatar
      ip6mr: fix use-after-free in ip6mr_sk_done() · 7d9b1b57
      Eric Dumazet authored
      Apparently addrconf_exit_net() is called before igmp6_net_exit()
      and ndisc_net_exit() at netns dismantle time:
      
       net_namespace: call ip6table_mangle_net_exit()
       net_namespace: call ip6_tables_net_exit()
       net_namespace: call ipv6_sysctl_net_exit()
       net_namespace: call ioam6_net_exit()
       net_namespace: call seg6_net_exit()
       net_namespace: call ping_v6_proc_exit_net()
       net_namespace: call tcpv6_net_exit()
       ip6mr_sk_done sk=ffffa354c78a74c0
       net_namespace: call ipv6_frags_exit_net()
       net_namespace: call addrconf_exit_net()
       net_namespace: call ip6addrlbl_net_exit()
       net_namespace: call ip6_flowlabel_net_exit()
       net_namespace: call ip6_route_net_exit_late()
       net_namespace: call fib6_rules_net_exit()
       net_namespace: call xfrm6_net_exit()
       net_namespace: call fib6_net_exit()
       net_namespace: call ip6_route_net_exit()
       net_namespace: call ipv6_inetpeer_exit()
       net_namespace: call if6_proc_net_exit()
       net_namespace: call ipv6_proc_exit_net()
       net_namespace: call udplite6_proc_exit_net()
       net_namespace: call raw6_exit_net()
       net_namespace: call igmp6_net_exit()
       ip6mr_sk_done sk=ffffa35472b2a180
       ip6mr_sk_done sk=ffffa354c78a7980
       net_namespace: call ndisc_net_exit()
       ip6mr_sk_done sk=ffffa35472b2ab00
       net_namespace: call ip6mr_net_exit()
       net_namespace: call inet6_net_exit()
      
      This was fine because ip6mr_sk_done() would not reach the point decreasing
      net->ipv6.devconf_all->mc_forwarding until my patch in ip6mr_sk_done().
      
      To fix this without changing struct pernet_operations ordering,
      we can clear net->ipv6.devconf_dflt and net->ipv6.devconf_all
      when they are freed from addrconf_exit_net()
      
      BUG: KASAN: use-after-free in instrument_atomic_read include/linux/instrumented.h:71 [inline]
      BUG: KASAN: use-after-free in atomic_read include/linux/atomic/atomic-instrumented.h:27 [inline]
      BUG: KASAN: use-after-free in ip6mr_sk_done+0x11b/0x410 net/ipv6/ip6mr.c:1578
      Read of size 4 at addr ffff88801ff08688 by task kworker/u4:4/963
      
      CPU: 0 PID: 963 Comm: kworker/u4:4 Not tainted 5.17.0-rc2-syzkaller-00650-g5a8fb33e #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: netns cleanup_net
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255
       __kasan_report mm/kasan/report.c:442 [inline]
       kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
       check_region_inline mm/kasan/generic.c:183 [inline]
       kasan_check_range+0x13d/0x180 mm/kasan/generic.c:189
       instrument_atomic_read include/linux/instrumented.h:71 [inline]
       atomic_read include/linux/atomic/atomic-instrumented.h:27 [inline]
       ip6mr_sk_done+0x11b/0x410 net/ipv6/ip6mr.c:1578
       rawv6_close+0x58/0x80 net/ipv6/raw.c:1201
       inet_release+0x12e/0x280 net/ipv4/af_inet.c:428
       inet6_release+0x4c/0x70 net/ipv6/af_inet6.c:478
       __sock_release net/socket.c:650 [inline]
       sock_release+0x87/0x1b0 net/socket.c:678
       inet_ctl_sock_destroy include/net/inet_common.h:65 [inline]
       igmp6_net_exit+0x6b/0x170 net/ipv6/mcast.c:3173
       ops_exit_list+0xb0/0x170 net/core/net_namespace.c:168
       cleanup_net+0x4ea/0xb00 net/core/net_namespace.c:600
       process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
       worker_thread+0x657/0x1110 kernel/workqueue.c:2454
       kthread+0x2e9/0x3a0 kernel/kthread.c:377
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
       </TASK>
      
      Fixes: f2f2325e ("ip6mr: ip6mr_sk_done() can exit early in common cases")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7d9b1b57
    • Tom Rix's avatar
      caif: cleanup double word in comment · 0812beb7
      Tom Rix authored
      Replace the second 'so' with 'free'.
      Signed-off-by: default avatarTom Rix <trix@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0812beb7
    • David S. Miller's avatar
      Merge branch 'mlxsw-dip-sip-mangling' · f485da3c
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      mlxsw: Add SIP and DIP mangling support
      
      Danielle says:
      
      On Spectrum-2 onwards, it is possible to overwrite SIP and DIP address
      of an IPv4 or IPv6 packet in the ACL engine. That corresponds to pedit
      munges of, respectively, ip src and ip dst fields, and likewise for ip6.
      Offload these munges on the systems where they are supported.
      
      Patchset overview:
      Patch #1: introduces SIP_DIP_ACTION and its fields.
      Patch #2-#3: adds the new pedit fields, and dispatches on them on
      	     Spectrum-2 and above.
      Patch #4 adds a selftest.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f485da3c
    • Danielle Ratson's avatar
      selftests: forwarding: Add a test for pedit munge SIP and DIP · 92ad3828
      Danielle Ratson authored
      Add a test that checks that pedit adjusts source and destination
      addresses of IPv4 and IPv6 packets.
      
      Output example:
      
      $ ./pedit_ip.sh
      TEST: ping                                                          [ OK ]
      TEST: ping6                                                         [ OK ]
      TEST: dev swp2 ingress pedit ip src set 198.51.100.1                [ OK ]
      TEST: dev swp3 egress pedit ip src set 198.51.100.1                 [ OK ]
      TEST: dev swp2 ingress pedit ip dst set 198.51.100.1                [ OK ]
      TEST: dev swp3 egress pedit ip dst set 198.51.100.1                 [ OK ]
      TEST: dev swp2 ingress pedit ip6 src set 2001:db8:2::1              [ OK ]
      TEST: dev swp3 egress pedit ip6 src set 2001:db8:2::1               [ OK ]
      TEST: dev swp2 ingress pedit ip6 dst set 2001:db8:2::1              [ OK ]
      TEST: dev swp3 egress pedit ip6 dst set 2001:db8:2::1               [ OK ]
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      92ad3828