1. 08 Aug, 2021 3 commits
  2. 05 Aug, 2021 34 commits
    • Yajun Deng's avatar
      netdevice: add the case if dev is NULL · b37a4668
      Yajun Deng authored
      Add the case if dev is NULL in dev_{put, hold}, so the caller doesn't
      need to care whether dev is NULL or not.
      Signed-off-by: default avatarYajun Deng <yajun.deng@linux.dev>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b37a4668
    • Yajun Deng's avatar
      net: Remove redundant if statements · 1160dfa1
      Yajun Deng authored
      The 'if (dev)' statement already move into dev_{put , hold}, so remove
      redundant if statements.
      Signed-off-by: default avatarYajun Deng <yajun.deng@linux.dev>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1160dfa1
    • David S. Miller's avatar
      Revert "wwan: mhi: Fix build." · a85b99ab
      David S. Miller authored
      This reverts commit ab996c42.
      
      Only aplicable when net is merged into net-next.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a85b99ab
    • David S. Miller's avatar
      Merge branch 'GRO-Toeplitz-selftests' · 6234219d
      David S. Miller authored
      Coco Li says:
      
      ====================
      GRO and Toeplitz hash selftests
      
      This patch contains two selftests in net, as well as respective
      scripts to run the tests on a single machine in loopback mode.
      GRO: tests the Linux kernel GRO behavior
      Toeplitz: tests the toeplitz hash implementation
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6234219d
    • Coco Li's avatar
      selftests/net: toeplitz test · 5ebfb4cc
      Coco Li authored
      To verify that this hash implements the Toeplitz hash function.
      
      Additionally, provide a script toeplitz.sh to run the test in loopback mode
      on a networking device of choice (see setup_loopback.sh). Since the
      script modifies the NIC setup, it will not be run by selftests
      automatically.
      
      Tested:
      ./toeplitz.sh -i eth0 -irq_prefix <eth0_pattern> -t -6
      carrier ready
      rxq 0: cpu 14
      rxq 1: cpu 20
      rxq 2: cpu 17
      rxq 3: cpu 23
      cpu 14: rx_hash 0x69103ebc [saddr fda8::2 daddr fda8::1 sport 58938 dport 8000] OK rxq 0 (cpu 14)
      ...
      cpu 20: rx_hash 0x257118b9 [saddr fda8::2 daddr fda8::1 sport 59258 dport 8000] OK rxq 1 (cpu 20)
      count: pass=111 nohash=0 fail=0
      Test Succeeded!
      Signed-off-by: default avatarCoco Li <lixiaoyan@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5ebfb4cc
    • Coco Li's avatar
      selftests/net: GRO coalesce test · 7d157501
      Coco Li authored
      Implement a GRO testsuite that expects Linux kernel GRO behavior.
      All tests pass with the kernel software GRO stack. Run against a device
      with hardware GRO to verify that it matches the software stack.
      
      gro.c generates packets and sends them out through a packet socket. The
      receiver in gro.c (run separately) receives the packets on a packet
      socket, filters them by destination ports using BPF and checks the
      packet geometry to see whether GRO was applied.
      
      gro.sh provides a wrapper to run the gro.c in NIC loopback mode.
      It is not included in continuous testing because it modifies network
      configuration around a physical NIC: gro.sh sets the NIC in loopback
      mode, creates macvlan devices on the physical device in separate
      namespaces, and sends traffic generated by gro.c between the two
      namespaces to observe coalescing behavior.
      
      GRO coalescing is time sensitive.
      Some tests may prove flaky on some hardware.
      
      Note that this test suite tests for software GRO unless hardware GRO is
      enabled (ethtool -K $DEV rx-gro-hw on).
      
      To test, run ./gro.sh.
      The wrapper will output success or failed test names, and generate
      log.txt and stderr.
      
      Sample log.txt result:
      ...
      pure data packet of same size: Test succeeded
      
      large data packets followed by a smaller one: Test succeeded
      
      small data packets followed by a larger one: Test succeeded
      ...
      
      Sample stderr result:
      ...
      carrier ready
      running test ipv4 data
      Expected {200 }, Total 1 packets
      Received {200 }, Total 1 packets.
      ...
      Signed-off-by: default avatarCoco Li <lixiaoyan@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7d157501
    • David S. Miller's avatar
      wwan: mhi: Fix build. · ab996c42
      David S. Miller authored
      Reported-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab996c42
    • Gustavo A. R. Silva's avatar
      net/ipv6/mcast: Use struct_size() helper · e11c0e25
      Gustavo A. R. Silva authored
      Replace IP6_SFLSIZE() with struct_size() helper in order to avoid any
      potential type mistakes or integer overflows that, in the worst
      scenario, could lead to heap overflows.
      Signed-off-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e11c0e25
    • Gustavo A. R. Silva's avatar
      net/ipv4/igmp: Use struct_size() helper · e6a1f7e0
      Gustavo A. R. Silva authored
      Replace IP_SFLSIZE() with struct_size() helper in order to avoid any
      potential type mistakes or integer overflows that, in the worst
      scenario, could lead to heap overflows.
      Signed-off-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e6a1f7e0
    • Gustavo A. R. Silva's avatar
      net/ipv4/ipv6: Replace one-element arraya with flexible-array members · db243b79
      Gustavo A. R. Silva authored
      There is a regular need in the kernel to provide a way to declare having
      a dynamically sized set of trailing elements in a structure. Kernel code
      should always use “flexible array members”[1] for these cases. The older
      style of one-element or zero-length arrays should no longer be used[2].
      
      Use an anonymous union with a couple of anonymous structs in order to
      keep userspace unchanged and refactor the related code accordingly:
      
      $ pahole -C group_filter net/ipv4/ip_sockglue.o
      struct group_filter {
      	union {
      		struct {
      			__u32      gf_interface_aux;     /*     0     4 */
      
      			/* XXX 4 bytes hole, try to pack */
      
      			struct __kernel_sockaddr_storage gf_group_aux; /*     8   128 */
      			/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
      			__u32      gf_fmode_aux;         /*   136     4 */
      			__u32      gf_numsrc_aux;        /*   140     4 */
      			struct __kernel_sockaddr_storage gf_slist[1]; /*   144   128 */
      		};                                       /*     0   272 */
      		struct {
      			__u32      gf_interface;         /*     0     4 */
      
      			/* XXX 4 bytes hole, try to pack */
      
      			struct __kernel_sockaddr_storage gf_group; /*     8   128 */
      			/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
      			__u32      gf_fmode;             /*   136     4 */
      			__u32      gf_numsrc;            /*   140     4 */
      			struct __kernel_sockaddr_storage gf_slist_flex[0]; /*   144     0 */
      		};                                       /*     0   144 */
      	};                                               /*     0   272 */
      
      	/* size: 272, cachelines: 5, members: 1 */
      	/* last cacheline: 16 bytes */
      };
      
      $ pahole -C compat_group_filter net/ipv4/ip_sockglue.o
      struct compat_group_filter {
      	union {
      		struct {
      			__u32      gf_interface_aux;     /*     0     4 */
      			struct __kernel_sockaddr_storage gf_group_aux __attribute__((__aligned__(4))); /*     4   128 */
      			/* --- cacheline 2 boundary (128 bytes) was 4 bytes ago --- */
      			__u32      gf_fmode_aux;         /*   132     4 */
      			__u32      gf_numsrc_aux;        /*   136     4 */
      			struct __kernel_sockaddr_storage gf_slist[1] __attribute__((__aligned__(4))); /*   140   128 */
      		} __attribute__((__packed__)) __attribute__((__aligned__(4)));                     /*     0   268 */
      		struct {
      			__u32      gf_interface;         /*     0     4 */
      			struct __kernel_sockaddr_storage gf_group __attribute__((__aligned__(4))); /*     4   128 */
      			/* --- cacheline 2 boundary (128 bytes) was 4 bytes ago --- */
      			__u32      gf_fmode;             /*   132     4 */
      			__u32      gf_numsrc;            /*   136     4 */
      			struct __kernel_sockaddr_storage gf_slist_flex[0] __attribute__((__aligned__(4))); /*   140     0 */
      		} __attribute__((__packed__)) __attribute__((__aligned__(4)));                     /*     0   140 */
      	} __attribute__((__aligned__(1)));               /*     0   268 */
      
      	/* size: 268, cachelines: 5, members: 1 */
      	/* forced alignments: 1 */
      	/* last cacheline: 12 bytes */
      } __attribute__((__packed__));
      
      This helps with the ongoing efforts to globally enable -Warray-bounds
      and get us closer to being able to tighten the FORTIFY_SOURCE routines
      on memcpy().
      
      [1] https://en.wikipedia.org/wiki/Flexible_array_member
      [2] https://www.kernel.org/doc/html/v5.10/process/deprecated.html#zero-length-and-one-element-arrays
      
      Link: https://github.com/KSPP/linux/issues/79
      Link: https://github.com/KSPP/linux/issues/109Signed-off-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      db243b79
    • David S. Miller's avatar
      Merge branch 'bridge-ioctl-fixes' · d15040a3
      David S. Miller authored
      Nikolay Aleksandrov says:
      
      ====================
      net: bridge: fix recent ioctl changes
      
      These are three fixes for the recent bridge removal of ndo_do_ioctl
      done by commit ad2f99ae ("net: bridge: move bridge ioctls out of
      .ndo_do_ioctl"). Patch 01 fixes a deadlock of the new bridge ioctl
      hook lock and rtnl by taking a netdev reference and always taking the
      bridge ioctl lock first then rtnl from within the bridge hook.
      Patch 02 fixes old_deviceless() bridge calls device name argument, and
      patch 03 checks in dev_ifsioc()'s SIOCBRADD/DELIF cases if the netdevice is
      actually a bridge before interpreting its private ptr as net_bridge.
      
      Patch 01 was tested by running old bridge-utils commands with lockdep
      enabled. Patch 02 was tested again by using bridge-utils and using the
      respective ioctl calls on a "up" bridge device. Patch 03 was tested by
      using the addif ioctl on a non-bridge device (e.g. loopback).
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d15040a3
    • Nikolay Aleksandrov's avatar
      net: core: don't call SIOCBRADD/DELIF for non-bridge devices · 9384eacd
      Nikolay Aleksandrov authored
      Commit ad2f99ae ("net: bridge: move bridge ioctls out of .ndo_do_ioctl")
      changed SIOCBRADD/DELIF to use bridge's ioctl hook (br_ioctl_hook)
      without checking if the target netdevice is actually a bridge which can
      cause crashes and generally interpreting other devices' private pointers
      as net_bridge pointers.
      
      Crash example (lo - loopback):
      $ brctl addif lo ens16
       BUG: kernel NULL pointer dereference, address: 000000000000059898
       #PF: supervisor read access in kernel modede
       #PF: error_code(0x0000) - not-present pagege
       PGD 0 P4D 0 ^Ac
       Oops: 0000 [#1] SMP NOPTI
       CPU: 2 PID: 1376 Comm: brctl Kdump: loaded Tainted: G        W         5.14.0-rc3+ #405
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/01/2014
       RIP: 0010:add_del_if+0x1f/0x7c [bridge]
       Code: 80 bf 1b a0 41 5c e9 c0 3c 03 e1 0f 1f 44 00 00 41 55 41 54 41 89 f4 be 0c 00 00 00 55 48 89 fd 53 48 8b 87 88 00 00 00 89 d3 <4c> 8b a8 98 05 00 00 49 8b bd d0 00 00 00 e8 17 d7 f3 e0 84 c0 74
       RSP: 0018:ffff888109d97cb0 EFLAGS: 00010202^Ac
       RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: 000000000000000c RDI: ffff888101239bc0
       RBP: ffff888101239bc0 R08: 0000000000000001 R09: 0000000000000000
       R10: ffff888109d97cd8 R11: 00000000000000a3 R12: 0000000000000012
       R13: 0000000000000000 R14: ffff888101239bc0 R15: ffff888109d97e10
       FS:  00007fc1e365b540(0000) GS:ffff88822be80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000598 CR3: 0000000106506000 CR4: 00000000000006e0
       Call Trace:
        br_ioctl_stub+0x7c/0x441 [bridge]
        br_ioctl_call+0x6d/0x8a
        dev_ifsioc+0x325/0x4e8
        dev_ioctl+0x46b/0x4e1
        sock_do_ioctl+0x7b/0xad
        sock_ioctl+0x2de/0x2f2
        vfs_ioctl+0x1e/0x2b
        __do_sys_ioctl+0x63/0x86
        do_syscall_64+0xcb/0xf2
        entry_SYSCALL_64_after_hwframe+0x44/0xae
       RIP: 0033:0x7fc1e3589427
       Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48
       RSP: 002b:00007ffc8d501d38 EFLAGS: 00000202 ORIG_RAX: 000000000000001010
       RAX: ffffffffffffffda RBX: 0000000000000012 RCX: 00007fc1e3589427
       RDX: 00007ffc8d501d60 RSI: 00000000000089a3 RDI: 0000000000000003
       RBP: 00007ffc8d501d60 R08: 0000000000000000 R09: fefefeff77686d74
       R10: fffffffffffff8f9 R11: 0000000000000202 R12: 00007ffc8d502e06
       R13: 00007ffc8d502e06 R14: 0000000000000000 R15: 0000000000000000
       Modules linked in: bridge stp llc bonding ipv6 virtio_net [last unloaded: llc]^Ac
       CR2: 0000000000000598
      
      Reported-by: syzbot+79f4a8692e267bdb7227@syzkaller.appspotmail.com
      Fixes: ad2f99ae ("net: bridge: move bridge ioctls out of .ndo_do_ioctl")
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9384eacd
    • Nikolay Aleksandrov's avatar
      net: bridge: fix ioctl old_deviceless bridge argument · cbd7ad29
      Nikolay Aleksandrov authored
      Commit ad2f99ae ("net: bridge: move bridge ioctls out of .ndo_do_ioctl")
      changed the source of the argument copy in bridge's old_deviceless() from
      args[1] (user ptr to device name) to uarg (ptr to ioctl arguments) causing
      wrong device name to be used.
      
      Example (broken, bridge exists but is up):
      $ brctl delbr bridge
      bridge bridge doesn't exist; can't delete it
      
      Example (working):
      $ brctl delbr bridge
      bridge bridge is still up; can't delete it
      
      Fixes: ad2f99ae ("net: bridge: move bridge ioctls out of .ndo_do_ioctl")
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cbd7ad29
    • Nikolay Aleksandrov's avatar
      net: bridge: fix ioctl locking · 893b1958
      Nikolay Aleksandrov authored
      Before commit ad2f99ae ("net: bridge: move bridge ioctls out of
      .ndo_do_ioctl") the bridge ioctl calls were divided in two parts:
      one was deviceless called by sock_ioctl and didn't expect rtnl to be held,
      the other was with a device called by dev_ifsioc() and expected rtnl to be
      held. After the commit above they were united in a single ioctl stub, but
      it didn't take care of the locking expectations.
      For sock_ioctl now we acquire  (1) br_ioctl_mutex, (2) rtnl
      and for dev_ifsioc we acquire  (1) rtnl,           (2) br_ioctl_mutex
      
      The fix is to get a refcnt on the netdev for dev_ifsioc calls and drop rtnl
      then to reacquire it in the bridge ioctl stub after br_ioctl_mutex has
      been acquired. That will avoid playing locking games and make the rules
      straight-forward: we always take br_ioctl_mutex first, and then rtnl.
      
      Reported-by: syzbot+34fe5894623c4ab1b379@syzkaller.appspotmail.com
      Fixes: ad2f99ae ("net: bridge: move bridge ioctls out of .ndo_do_ioctl")
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      893b1958
    • Gustavo A. R. Silva's avatar
      net/ipv4: Revert use of struct_size() helper · 4167a960
      Gustavo A. R. Silva authored
      Revert the use of structr_size() and stay with IP_MSFILTER_SIZE() for
      now, as in this case, the size of struct ip_msfilter didn't change with
      the addition of the flexible array imsf_slist_flex[]. So, if we use
      struct_size() we will be allocating and calculating the size of
      struct ip_msfilter with one too many items for imsf_slist_flex[].
      
      We might use struct_size() in the future, but for now let's stay
      with IP_MSFILTER_SIZE().
      
      Fixes: 	2d3e5caf ("net/ipv4: Replace one-element array with flexible-array member")
      Signed-off-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4167a960
    • Paolo Abeni's avatar
      net: fix GRO skb truesize update · af352460
      Paolo Abeni authored
      commit 5e10da53 ("skbuff: allow 'slow_gro' for skb carring sock
      reference") introduces a serious regression at the GRO layer setting
      the wrong truesize for stolen-head skbs.
      
      Restore the correct truesize: SKB_DATA_ALIGN(...) instead of
      SKB_TRUESIZE(...)
      Reported-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Fixes: 5e10da53 ("skbuff: allow 'slow_gro' for skb carring sock reference")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Tested-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      af352460
    • David S. Miller's avatar
      Merge branch 'ipa-runtime-pm' · 83945480
      David S. Miller authored
      Alex Elder says:
      
      ====================
      net: ipa: more work toward runtime PM
      
      The first two patches in this series are basically bug fixes, but in
      practice I don't think we've seen the problems they might cause.
      
      The third patch moves clock and interconnect related error messages
      around a bit, reporting better information and doing so in the
      functions where they are enabled or disabled (rather than those
      functions' callers).
      
      The last three patches move power-related code into "ipa_clock.c",
      as a step toward generalizing the purpose of that source file.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      83945480
    • Alex Elder's avatar
      net: ipa: move IPA flags field · afb08b7e
      Alex Elder authored
      The ipa->flags field is only ever used in "ipa_clock.c", related to
      suspend/resume activity.
      
      Move the definition of the ipa_flag enumerated type to "ipa_clock.c".
      And move the flags field from the ipa structure and to the ipa_clock
      structure.  Rename the type and its values to include "power" or
      "POWER" in the name.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      afb08b7e
    • Alex Elder's avatar
      net: ipa: move ipa_suspend_handler() · afe1baa8
      Alex Elder authored
      Move ipa_suspend_handler() into "ipa_clock.c" from "ipa_main.c", to
      group with the reset of the suspend/resume code.  This IPA interrupt
      is triggered if an IPA RX endpoint is suspended but has a packet to
      be delivered.
      
      Introduce ipa_power_setup() and ipa_power_teardown() to add and
      remove the handler for the IPA SUSPEND interrupt at the same place
      as before, while allowing the handler to remain private.
      
      The "power" naming convention will be adopted elsewhere in this
      file as well (soon).
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      afe1baa8
    • Alex Elder's avatar
      net: ipa: move IPA power operations to ipa_clock.c · 73ff316d
      Alex Elder authored
      Move ipa_suspend() and ipa_resume(), as well as the definition of
      the ipa_pm_ops structure into "ipa_clock.c".  Make ipa_pm_ops public
      and declare it as extern in "ipa_clock.h".
      
      This is part of centralizing IPA power management functionality into
      "ipa_clock.c" (the file will eventually get a name change).
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      73ff316d
    • Alex Elder's avatar
      net: ipa: improve IPA clock error messages · 8ee7c40a
      Alex Elder authored
      Rearrange messages reported when errors occur in the IPA clock code,
      so that the specific interconnect is identified when an error occurs
      enabling or disabling it, or the core clock is indicated when an
      error occurs enabling it.
      
      Have ipa_interconnect_disable() return zero or the negative error
      value returned by the first interconnect that produced an error
      when disabled.  For now, the callers ignore the returned value.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ee7c40a
    • Alex Elder's avatar
      net: ipa: reorder netdev pointer assignments · 10cc73c4
      Alex Elder authored
      Assign the ipa->modem_netdev and endpoint->netdev pointers *before*
      registering the network device.  As soon as the device is
      registered it can be opened, and by that time we'll want those
      pointers valid.
      
      Similarly, don't make those pointers NULL until *after* the modem
      network device is unregistered in ipa_modem_stop().
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      10cc73c4
    • Alex Elder's avatar
      net: ipa: don't suspend/resume modem if not up · 30c2515b
      Alex Elder authored
      The modem network device is set up by ipa_modem_start().  But its
      TX queue is not actually started and endpoints enabled until it is
      opened.
      
      So avoid stopping the modem network device TX queue and disabling
      endpoints on suspend or stop unless the netdev is marked UP.  And
      skip attempting to resume unless it is UP.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      30c2515b
    • David S. Miller's avatar
      Merge branch 'sja1105-H' · 1f52247e
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      NXP SJA1105 driver support for "H" switch topologies
      
      Changes in v3:
      Preserve the behavior of dsa_tree_setup_default_cpu() which is to pick
      the first CPU port and not the last.
      
      Changes in v2:
      Send as non-RFC, drop the patches for discarding DSA-tagged packets on
      user ports and DSA-untagged packets on DSA and CPU ports for now.
      
      NXP builds boards like the Bluebox 3 where there are multiple SJA1110
      switches connected to an LX2160A, but they are also connected to each
      other. I call this topology an "H" tree because of the lateral
      connection between switches. A piece extracted from a non-upstream
      device tree looks like this:
      
      &spi_bridge {
              /* SW1 */
              ethernet-switch@0 {
                      compatible = "nxp,sja1110a";
                      reg = <0>;
                      dsa,member = <0 0>;
      
                      ethernet-ports {
                              #address-cells = <1>;
                              #size-cells = <0>;
      
                              /* SW1_P1 */
                              port@1 {
                                      reg = <1>;
                                      label = "con_2x20";
                                      phy-mode = "sgmii";
      
                                      fixed-link {
                                              speed = <1000>;
                                              full-duplex;
                                      };
                              };
      
                              port@2 {
                                      reg = <2>;
                                      ethernet = <&dpmac17>;
                                      phy-mode = "rgmii-id";
      
                                      fixed-link {
                                              speed = <1000>;
                                              full-duplex;
                                      };
                              };
      
                              port@3 {
                                      reg = <3>;
                                      label = "1ge_p1";
                                      phy-mode = "rgmii-id";
                                      phy-handle = <&sw1_mii3_phy>;
                              };
      
                              sw1p4: port@4 {
                                      reg = <4>;
                                      link = <&sw2p1>;
                                      phy-mode = "sgmii";
      
                                      fixed-link {
                                              speed = <1000>;
                                              full-duplex;
                                      };
                              };
      
                              port@5 {
                                      reg = <5>;
                                      label = "trx1";
                                      phy-mode = "internal";
                                      phy-handle = <&sw1_port5_base_t1_phy>;
                              };
      
                              port@6 {
                                      reg = <6>;
                                      label = "trx2";
                                      phy-mode = "internal";
                                      phy-handle = <&sw1_port6_base_t1_phy>;
                              };
      
                              port@7 {
                                      reg = <7>;
                                      label = "trx3";
                                      phy-mode = "internal";
                                      phy-handle = <&sw1_port7_base_t1_phy>;
                              };
      
                              port@8 {
                                      reg = <8>;
                                      label = "trx4";
                                      phy-mode = "internal";
                                      phy-handle = <&sw1_port8_base_t1_phy>;
                              };
      
                              port@9 {
                                      reg = <9>;
                                      label = "trx5";
                                      phy-mode = "internal";
                                      phy-handle = <&sw1_port9_base_t1_phy>;
                              };
      
                              port@a {
                                      reg = <10>;
                                      label = "trx6";
                                      phy-mode = "internal";
                                      phy-handle = <&sw1_port10_base_t1_phy>;
                              };
                      };
              };
      
              /* SW2 */
              ethernet-switch@2 {
                      compatible = "nxp,sja1110a";
                      reg = <2>;
                      dsa,member = <0 1>;
      
                      ethernet-ports {
                              #address-cells = <1>;
                              #size-cells = <0>;
      
                              sw2p1: port@1 {
                                      reg = <1>;
                                      link = <&sw1p4>;
                                      phy-mode = "sgmii";
      
                                      fixed-link {
                                              speed = <1000>;
                                              full-duplex;
                                      };
                              };
      
                              port@2 {
                                      reg = <2>;
                                      ethernet = <&dpmac18>;
                                      phy-mode = "rgmii-id";
      
                                      fixed-link {
                                              speed = <1000>;
                                              full-duplex;
                                      };
                              };
      
                              port@3 {
                                      reg = <3>;
                                      label = "1ge_p2";
                                      phy-mode = "rgmii-id";
                                      phy-handle = <&sw2_mii3_phy>;
                              };
      
                              port@4 {
                                      reg = <4>;
                                      label = "to_sw3";
                                      phy-mode = "2500base-x";
      
                                      fixed-link {
                                              speed = <2500>;
                                              full-duplex;
                                      };
                              };
      
                              port@5 {
                                      reg = <5>;
                                      label = "trx7";
                                      phy-mode = "internal";
                                      phy-handle = <&sw2_port5_base_t1_phy>;
                              };
      
                              port@6 {
                                      reg = <6>;
                                      label = "trx8";
                                      phy-mode = "internal";
                                      phy-handle = <&sw2_port6_base_t1_phy>;
                              };
      
                              port@7 {
                                      reg = <7>;
                                      label = "trx9";
                                      phy-mode = "internal";
                                      phy-handle = <&sw2_port7_base_t1_phy>;
                              };
      
                              port@8 {
                                      reg = <8>;
                                      label = "trx10";
                                      phy-mode = "internal";
                                      phy-handle = <&sw2_port8_base_t1_phy>;
                              };
      
                              port@9 {
                                      reg = <9>;
                                      label = "trx11";
                                      phy-mode = "internal";
                                      phy-handle = <&sw2_port9_base_t1_phy>;
                              };
      
                              port@a {
                                      reg = <10>;
                                      label = "trx12";
                                      phy-mode = "internal";
                                      phy-handle = <&sw2_port10_base_t1_phy>;
                              };
                      };
              };
      };
      
      Basically it is a single DSA tree with 2 "ethernet" properties, i.e. a
      multi-CPU-port system. There is also a DSA link between the switches,
      but it is not a daisy chain topology, i.e. there is no "upstream" and
      "downstream" switch, the DSA link is only to be used for the bridge data
      plane (autonomous forwarding between switches, between the RJ-45 ports
      and the automotive Ethernet ports), otherwise all traffic that should
      reach the host should do so through the dedicated CPU port of the switch.
      
      Of course, plain forwarding in this topology is bound to create packet
      loops. I have thought long and hard about strategies to cut forwarding
      in such a way as to prevent loops but also not impede normal operation
      of the network on such a system, and I believe I have found a solution
      that does work as expected. This relies heavily on DSA's recent ability
      to perform RX filtering towards the host by installing MAC addresses as
      static FDB entries. Since we have 2 distinct DSA masters, we have 2
      distinct MAC addresses, and if the bridge is configured to have its own
      MAC address that makes it 3 distinct MAC addresses. The bridge core,
      plus the switchdev_handle_fdb_add_to_device() extension, handle each MAC
      address by replicating it to each port of the DSA switch tree. So the
      end result is that both switch 1 and switch 2 will have static FDB
      entries towards their respective CPU ports for the 3 MAC addresses
      corresponding to the DSA masters and to the bridge net device (and of
      course, towards any station learned on a foreign interface).
      
      So I think the basic design works, and it is basically just as fragile
      as any other multi-CPU-port system is bound to be in terms of reliance
      on static FDB entries towards the host (if hardware address learning on
      the CPU port is to be used, MAC addresses would randomly bounce between
      one CPU port and the other otherwise). In fact, I think it is even
      better to start DSA's support of multi-CPU-port systems with something
      small like the NXP Bluebox 3, because we allow some time for the code
      paths like dsa_switch_host_address_match(), which were specifically
      designed for it, to break in, and this board needs no user space
      configuration of CPU ports, like static assignments between user and CPU
      ports, or bonding between the CPU ports/DSA masters.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f52247e
    • Vladimir Oltean's avatar
      net: dsa: sja1105: enable address learning on cascade ports · 81d45898
      Vladimir Oltean authored
      Right now, address learning is disabled on DSA ports, which means that a
      packet received over a DSA port from a cross-chip switch will be flooded
      to unrelated ports.
      
      It is desirable to eliminate that, but for that we need a breakdown of
      the possibilities for the sja1105 driver. A DSA port can be:
      
      - a downstream-facing cascade port. This is simple because it will
        always receive packets from a downstream switch, and there should be
        no other route to reach that downstream switch in the first place,
        which means it should be safe to learn that MAC address towards that
        switch.
      
      - an upstream-facing cascade port. This receives packets either:
        * autonomously forwarded by an upstream switch (and therefore these
          packets belong to the data plane of a bridge, so address learning
          should be ok), or
        * injected from the CPU. This deserves further discussion, as normally,
          an upstream-facing cascade port is no different than the CPU port
          itself. But with "H" topologies (a DSA link towards a switch that
          has its own CPU port), these are more "laterally-facing" cascade
          ports than they are "upstream-facing". Here, there is a risk that
          the port might learn the host addresses on the wrong port (on the
          DSA port instead of on its own CPU port), but this is solved by
          DSA's RX filtering infrastructure, which installs the host addresses
          as static FDB entries on the CPU port of all switches in a "H" tree.
          So even if there will be an attempt from the switch to migrate the
          FDB entry from the CPU port to the laterally-facing cascade port, it
          will fail to do that, because the FDB entry that already exists is
          static and cannot migrate. So address learning should be safe for
          this configuration too.
      
      Ok, so what about other MAC addresses coming from the host, not
      necessarily the bridge local FDB entries? What about MAC addresses
      dynamically learned on foreign interfaces, isn't there a risk that
      cascade ports will learn these entries dynamically when they are
      supposed to be delivered towards the CPU port? Well, that is correct,
      and this is why we also need to enable the assisted learning feature, to
      snoop for these addresses and write them to hardware as static FDB
      entries towards the CPU, to make the switch's learning process on the
      cascade ports ineffective for them. With assisted learning enabled, the
      hardware learning on the CPU port must be disabled.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      81d45898
    • Vladimir Oltean's avatar
      net: dsa: sja1105: suppress TX packets from looping back in "H" topologies · 0f9b762c
      Vladimir Oltean authored
      H topologies like this one have a problem:
      
               eth0                                                     eth1
                |                                                        |
             CPU port                                                CPU port
                |                        DSA link                        |
       sw0p0  sw0p1  sw0p2  sw0p3  sw0p4 -------- sw1p4  sw1p3  sw1p2  sw1p1  sw1p0
         |             |      |                            |      |             |
       user          user   user                         user   user          user
       port          port   port                         port   port          port
      
      Basically any packet sent by the eth0 DSA master can be flooded on the
      interconnecting DSA link sw0p4 <-> sw1p4 and it will be received by the
      eth1 DSA master too. Basically we are talking to ourselves.
      
      In VLAN-unaware mode, these packets are encoded using a tag_8021q TX
      VLAN, which dsa_8021q_rcv() rightfully cannot decode and complains.
      Whereas in VLAN-aware mode, the packets are encoded with a bridge VLAN
      which _can_ be decoded by the tagger running on eth1, so it will attempt
      to reinject that packet into the network stack (the bridge, if there is
      any port under eth1 that is under a bridge). In the case where the ports
      under eth1 are under the same cross-chip bridge as the ports under eth0,
      the TX packets will even be learned as RX packets. The only thing that
      will prevent loops with the software bridging path, and therefore
      disaster, is that the source port and the destination port are in the
      same hardware domain, and the bridge will receive packets from the
      driver with skb->offload_fwd_mark = true and will not forward between
      the two.
      
      The proper solution to this problem is to detect H topologies and
      enforce that all packets are received through the local switch and we do
      not attempt to receive packets on our CPU port from switches that have
      their own. This is a viable solution which works thanks to the fact that
      MAC addresses which should be filtered towards the host are installed by
      DSA as static MAC addresses towards the CPU port of each switch.
      
      TX from a CPU port towards the DSA port continues to be allowed, this is
      because sja1105 supports bridge TX forwarding offload, and the skb->dev
      used initially for xmit does not have any direct correlation with where
      the station that will respond to that packet is connected. It may very
      well happen that when we send a ping through a br0 interface that spans
      all switch ports, the xmit packet will exit the system through a DSA
      switch interface under eth1 (say sw1p2), but the destination station is
      connected to a switch port under eth0, like sw0p0. So the switch under
      eth1 needs to communicate on TX with the switch under eth0. The
      response, however, will not follow the same path, but instead, this
      patch enforces that the response is sent by the first switch directly to
      its DSA master which is eth0.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0f9b762c
    • Vladimir Oltean's avatar
      net: dsa: sja1105: increase MTU to account for VLAN header on DSA ports · 777e55e3
      Vladimir Oltean authored
      Since all packets are transmitted as VLAN-tagged over a DSA link (this
      VLAN tag represents the tag_8021q header), we need to increase the MTU
      of these interfaces to account for the possibility that we are already
      transporting a user-visible VLAN header.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      777e55e3
    • Vladimir Oltean's avatar
      net: dsa: sja1105: manage VLANs on cascade ports · c5130029
      Vladimir Oltean authored
      Since commit ed040abc ("net: dsa: sja1105: use 4095 as the private
      VLAN for untagged traffic"), this driver uses a reserved value as pvid
      for the host port (DSA CPU port). Control packets which are sent as
      untagged get classified to this VLAN, and all ports are members of it
      (this is to be expected for control packets).
      
      Manage all cascade ports in the same way and allow control packets to
      egress everywhere.
      
      Also, all VLANs need to be sent as egress-tagged on all cascade ports.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c5130029
    • Vladimir Oltean's avatar
      net: dsa: sja1105: manage the forwarding domain towards DSA ports · 3fa21270
      Vladimir Oltean authored
      Manage DSA links towards other switches, be they host ports or cascade
      ports, the same as the CPU port, i.e. allow forwarding and flooding
      unconditionally from all user ports.
      
      We send packets as always VLAN-tagged on a DSA port, and we rely on the
      cross-chip notifiers from tag_8021q to install the RX VLAN of a switch
      port only on the proper remote ports of another switch (the ports that
      are in the same bridging domain). So if there is no cross-chip bridging
      in the system, the flooded packets will be sent on the DSA ports too,
      but they will be dropped by the remote switches due to either
      (a) a lack of the RX VLAN in the VLAN table of the ingress DSA port, or
      (b) a lack of valid destinations for those packets, due to a lack of the
          RX VLAN on the user ports of the switch
      
      Note that switches which only transport packets in a cross-chip bridge,
      but have no user ports of their own as part of that bridge, such as
      switch 1 in this case:
      
                          DSA link                   DSA link
        sw0p0 sw0p1 sw0p2 -------- sw1p0 sw1p2 sw1p3 -------- sw2p0 sw2p2 sw2p3
      
      ip link set sw0p0 master br0
      ip link set sw2p3 master br0
      
      will still work, because the tag_8021q cross-chip notifiers keep the RX
      VLANs installed on all DSA ports.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3fa21270
    • Vladimir Oltean's avatar
      net: dsa: sja1105: configure the cascade ports based on topology · 30a100e6
      Vladimir Oltean authored
      The sja1105 switch family has a feature called "cascade ports" which can
      be used in topologies where multiple SJA1105/SJA1110 switches are daisy
      chained. Upstream switches set this bit for the DSA link towards the
      downstream switches. This is used when the upstream switch receives a
      control packet (PTP, STP) from a downstream switch, because if the
      source port for a control packet is marked as a cascade port, then the
      source port, switch ID and RX timestamp will not be taken again on the
      upstream switch, it is assumed that this has already been done by the
      downstream switch (the leaf port in the tree) and that the CPU has
      everything it needs to decode the information from this packet.
      
      We need to distinguish between an upstream-facing DSA link and a
      downstream-facing DSA link, because the upstream-facing DSA links are
      "host ports" for the SJA1105/SJA1110 switches, and the downstream-facing
      DSA links are "cascade ports".
      
      Note that SJA1105 supports a single cascade port, so only daisy chain
      topologies work. With SJA1110, there can be more complex topologies such
      as:
      
                          eth0
                           |
                       host port
                           |
       sw0p0    sw0p1    sw0p2    sw0p3    sw0p4
         |        |                 |        |
       cascade  cascade            user     user
        port     port              port     port
         |        |
         |        |
         |        |
         |       host
         |       port
         |        |
         |      sw1p0    sw1p1    sw1p2    sw1p3    sw1p4
         |                 |        |        |        |
         |                user     user     user     user
        host              port     port     port     port
        port
         |
       sw2p0    sw2p1    sw2p2    sw2p3    sw2p4
                  |        |        |        |
                 user     user     user     user
                 port     port     port     port
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      30a100e6
    • Vladimir Oltean's avatar
      net: dsa: give preference to local CPU ports · 2c0b0325
      Vladimir Oltean authored
      Be there an "H" switch topology, where there are 2 switches connected as
      follows:
      
               eth0                                                     eth1
                |                                                        |
             CPU port                                                CPU port
                |                        DSA link                        |
       sw0p0  sw0p1  sw0p2  sw0p3  sw0p4 -------- sw1p4  sw1p3  sw1p2  sw1p1  sw1p0
         |             |      |                            |      |             |
       user          user   user                         user   user          user
       port          port   port                         port   port          port
      
      basically one where each switch has its own CPU port for termination,
      but there is also a DSA link in case packets need to be forwarded in
      hardware between one switch and another.
      
      DSA insists to see this as a daisy chain topology, basically registering
      all network interfaces as sw0p0@eth0, ... sw1p0@eth0 and disregarding
      eth1 as a valid DSA master.
      
      This is only half the story, since when asked using dsa_port_is_cpu(),
      DSA will respond that sw1p1 is a CPU port, however one which has no
      dp->cpu_dp pointing to it. So sw1p1 is enabled, but not used.
      
      Furthermore, be there a driver for switches which support only one
      upstream port. This driver iterates through its ports and checks using
      dsa_is_upstream_port() whether the current port is an upstream one.
      For switch 1, two ports pass the "is upstream port" checks:
      
      - sw1p4 is an upstream port because it is a routing port towards the
        dedicated CPU port assigned using dsa_tree_setup_default_cpu()
      
      - sw1p1 is also an upstream port because it is a CPU port, albeit one
        that is disabled. This is because dsa_upstream_port() returns:
      
      	if (!cpu_dp)
      		return port;
      
        which means that if @dp does not have a ->cpu_dp pointer (which is a
        characteristic of CPU ports themselves as well as unused ports), then
        @dp is its own upstream port.
      
      So the driver for switch 1 rightfully says: I have two upstream ports,
      but I don't support multiple upstream ports! So let me error out, I
      don't know which one to choose and what to do with the other one.
      
      Generally I am against enforcing any default policy in the kernel in
      terms of user to CPU port assignment (like round robin or such) but this
      case is different. To solve the conundrum, one would have to:
      
      - Disable sw1p1 in the device tree or mark it as "not a CPU port" in
        order to comply with DSA's view of this topology as a daisy chain,
        where the termination traffic from switch 1 must pass through switch 0.
        This is counter-productive because it wastes 1Gbps of termination
        throughput in switch 1.
      - Disable the DSA link between sw0p4 and sw1p4 and do software
        forwarding between switch 0 and 1, and basically treat the switches as
        part of disjoint switch trees. This is counter-productive because it
        wastes 1Gbps of autonomous forwarding throughput between switch 0 and 1.
      - Treat sw0p4 and sw1p4 as user ports instead of DSA links. This could
        work, but it makes cross-chip bridging impossible. In this setup we
        would need to have 2 separate bridges, br0 spanning the ports of
        switch 0, and br1 spanning the ports of switch 1, and the "DSA links
        treated as user ports" sw0p4 (part of br0) and sw1p4 (part of br1) are
        the gateway ports between one bridge and another. This is hard to
        manage from a user's perspective, who wants to have a unified view of
        the switching fabric and the ability to transparently add ports to the
        same bridge. VLANs would also need to be explicitly managed by the
        user on these gateway ports.
      
      So it seems that the only reasonable thing to do is to make DSA prefer
      CPU ports that are local to the switch. Meaning that by default, the
      user and DSA ports of switch 0 will get assigned to the CPU port from
      switch 0 (sw0p1) and the user and DSA ports of switch 1 will get
      assigned to the CPU port from switch 1.
      
      The way this solves the problem is that sw1p4 is no longer an upstream
      port as far as switch 1 is concerned (it no longer views sw0p1 as its
      dedicated CPU port).
      
      So here we are, the first multi-CPU port that DSA supports is also
      perhaps the most uneventful one: the individual switches don't support
      multiple CPUs, however the DSA switch tree as a whole does have multiple
      CPU ports. No user space assignment of user ports to CPU ports is
      desirable, necessary, or possible.
      
      Ports that do not have a local CPU port (say there was an extra switch
      hanging off of sw0p0) default to the standard implementation of getting
      assigned to the first CPU port of the DSA switch tree. Is that good
      enough? Probably not (if the downstream switch was hanging off of switch
      1, we would most certainly prefer its CPU port to be sw1p1), but in
      order to support that use case too, we would need to traverse the
      dst->rtable in search of an optimum dedicated CPU port, one that has the
      smallest number of hops between dp->ds and dp->cpu_dp->ds. At the
      moment, the DSA routing table structure does not keep the number of hops
      between dl->dp and dl->link_dp, and while it is probably deducible,
      there is zero justification to write that code now. Let's hope DSA will
      never have to support that use case.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2c0b0325
    • Vladimir Oltean's avatar
      net: dsa: rename teardown_default_cpu to teardown_cpu_ports · 0e8eb9a1
      Vladimir Oltean authored
      There is nothing specific to having a default CPU port to what
      dsa_tree_teardown_default_cpu() does. Even with multiple CPU ports,
      it would do the same thing: iterate through the ports of this switch
      tree and reset the ->cpu_dp pointer to NULL. So rename it accordingly.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0e8eb9a1
    • Alex Elder's avatar
      net: ipa: fix IPA v4.9 interconnects · 0fd75f57
      Alex Elder authored
      Three interconnects are defined for IPA version 4.9, but there
      should only be two.  They should also use names that match what's
      used for other platforms (and specified in the Device Tree binding).
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0fd75f57
    • Colin Ian King's avatar
      mctp: remove duplicated assignment of pointer hdr · df7ba0eb
      Colin Ian King authored
      The pointer hdr is being initialized and also re-assigned with the
      same value from the call to function mctp_hdr. Static analysis reports
      that the initializated value is unused. The second assignment is
      duplicated and can be removed.
      
      Addresses-Coverity: ("Unused value").
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df7ba0eb
  3. 04 Aug, 2021 3 commits