1. 20 Sep, 2022 40 commits
    • Kuniyuki Iwashima's avatar
      tcp: Introduce optional per-netns ehash. · d1e5e640
      Kuniyuki Iwashima authored
      The more sockets we have in the hash table, the longer we spend looking
      up the socket.  While running a number of small workloads on the same
      host, they penalise each other and cause performance degradation.
      
      The root cause might be a single workload that consumes much more
      resources than the others.  It often happens on a cloud service where
      different workloads share the same computing resource.
      
      On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
      entries), after running iperf3 in different netns, creating 24Mi sockets
      without data transfer in the root netns causes about 10% performance
      regression for the iperf3's connection.
      
       thash_entries		sockets		length		Gbps
      	524288		      1		     1		50.7
      			   24Mi		    48		45.1
      
      It is basically related to the length of the list of each hash bucket.
      For testing purposes to see how performance drops along the length,
      I set 131072 (1Mi / 8) to thash_entries, and here's the result.
      
       thash_entries		sockets		length		Gbps
              131072		      1		     1		50.7
      			    1Mi		     8		49.9
      			    2Mi		    16		48.9
      			    4Mi		    32		47.3
      			    8Mi		    64		44.6
      			   16Mi		   128		40.6
      			   24Mi		   192		36.3
      			   32Mi		   256		32.5
      			   40Mi		   320		27.0
      			   48Mi		   384		25.0
      
      To resolve the socket lookup degradation, we introduce an optional
      per-netns hash table for TCP, but it's just ehash, and we still share
      the global bhash, bhash2 and lhash2.
      
      With a smaller ehash, we can look up non-listener sockets faster and
      isolate such noisy neighbours.  In addition, we can reduce lock contention.
      
      We can control the ehash size by a new sysctl knob.  However, depending
      on workloads, it will require very sensitive tuning, so we disable the
      feature by default (net.ipv4.tcp_child_ehash_entries == 0).  Moreover,
      we can fall back to using the global ehash in case we fail to allocate
      enough memory for a new ehash.  The maximum size is 16Mi, which is large
      enough that even if we have 48Mi sockets, the average list length is 3,
      and regression would be less than 1%.
      
      We can check the current ehash size by another read-only sysctl knob,
      net.ipv4.tcp_ehash_entries.  A negative value means the netns shares
      the global ehash (per-netns ehash is disabled or failed to allocate
      memory).
      
        # dmesg | cut -d ' ' -f 5- | grep "established hash"
        TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
      
        # sysctl net.ipv4.tcp_ehash_entries
        net.ipv4.tcp_ehash_entries = 524288  # can be changed by thash_entries
      
        # sysctl net.ipv4.tcp_child_ehash_entries
        net.ipv4.tcp_child_ehash_entries = 0  # disabled by default
      
        # ip netns add test1
        # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
        net.ipv4.tcp_ehash_entries = -524288  # share the global ehash
      
        # sysctl -w net.ipv4.tcp_child_ehash_entries=100
        net.ipv4.tcp_child_ehash_entries = 100
      
        # ip netns add test2
        # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
        net.ipv4.tcp_ehash_entries = 128  # own a per-netns ehash with 2^n buckets
      
      When more than two processes in the same netns create per-netns ehash
      concurrently with different sizes, we need to guarantee the size in
      one of the following ways:
      
        1) Share the global ehash and create per-netns ehash
      
        First, unshare() with tcp_child_ehash_entries==0.  It creates dedicated
        netns sysctl knobs where we can safely change tcp_child_ehash_entries
        and clone()/unshare() to create a per-netns ehash.
      
        2) Control write on sysctl by BPF
      
        We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
        sysctl knobs.
      
      Note that the global ehash allocated at the boot time is spread over
      available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate
      pages for each per-netns ehash depending on the current process's NUMA
      policy.  By default, the allocation is done in the local node only, so
      the per-netns hash table could fully reside on a random node.  Thus,
      depending on the NUMA policy the netns is created with and the CPU the
      current thread is running on, we could see some performance differences
      for highly optimised networking applications.
      
      Note also that the default values of two sysctl knobs depend on the ehash
      size and should be tuned carefully:
      
        tcp_max_tw_buckets  : tcp_child_ehash_entries / 2
        tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)
      
      As a bonus, we can dismantle netns faster.  Currently, while destroying
      netns, we call inet_twsk_purge(), which walks through the global ehash.
      It can be potentially big because it can have many sockets other than
      TIME_WAIT in all netns.  Splitting ehash changes that situation, where
      it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
      in each netns.
      
      With regard to this, we do not free the per-netns ehash in inet_twsk_kill()
      to avoid UAF while iterating the per-netns ehash in inet_twsk_purge().
      Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to
      keep it protocol-family-independent.
      
      In the future, we could optimise ehash lookup/iteration further by removing
      netns comparison for the per-netns ehash.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d1e5e640
    • Kuniyuki Iwashima's avatar
      tcp: Save unnecessary inet_twsk_purge() calls. · edc12f03
      Kuniyuki Iwashima authored
      While destroying netns, we call inet_twsk_purge() in tcp_sk_exit_batch()
      and tcpv6_net_exit_batch() for AF_INET and AF_INET6.  These commands
      trigger the kernel to walk through the potentially big ehash twice even
      though the netns has no TIME_WAIT sockets.
      
        # ip netns add test
        # ip netns del test
      
        or
      
        # unshare -n /bin/true >/dev/null
      
      When tw_refcount is 1, we need not call inet_twsk_purge() at least
      for the net.  We can save such unneeded iterations if all netns in
      net_exit_list have no TIME_WAIT sockets.  This change eliminates
      the tax by the additional unshare() described in the next patch to
      guarantee the per-netns ehash size.
      
      Tested:
      
        # mount -t debugfs none /sys/kernel/debug/
        # echo cleanup_net > /sys/kernel/debug/tracing/set_ftrace_filter
        # echo inet_twsk_purge >> /sys/kernel/debug/tracing/set_ftrace_filter
        # echo function > /sys/kernel/debug/tracing/current_tracer
        # cat ./add_del_unshare.sh
        for i in `seq 1 40`
        do
            (for j in `seq 1 100` ; do  unshare -n /bin/true >/dev/null ; done) &
        done
        wait;
        # ./add_del_unshare.sh
      
      Before the patch:
      
        # cat /sys/kernel/debug/tracing/trace_pipe
          kworker/u128:0-8       [031] ...1.   174.162765: cleanup_net <-process_one_work
          kworker/u128:0-8       [031] ...1.   174.240796: inet_twsk_purge <-cleanup_net
          kworker/u128:0-8       [032] ...1.   174.244759: inet_twsk_purge <-tcp_sk_exit_batch
          kworker/u128:0-8       [034] ...1.   174.290861: cleanup_net <-process_one_work
          kworker/u128:0-8       [039] ...1.   175.245027: inet_twsk_purge <-cleanup_net
          kworker/u128:0-8       [046] ...1.   175.290541: inet_twsk_purge <-tcp_sk_exit_batch
          kworker/u128:0-8       [037] ...1.   175.321046: cleanup_net <-process_one_work
          kworker/u128:0-8       [024] ...1.   175.941633: inet_twsk_purge <-cleanup_net
          kworker/u128:0-8       [025] ...1.   176.242539: inet_twsk_purge <-tcp_sk_exit_batch
      
      After:
      
        # cat /sys/kernel/debug/tracing/trace_pipe
          kworker/u128:0-8       [038] ...1.   428.116174: cleanup_net <-process_one_work
          kworker/u128:0-8       [038] ...1.   428.262532: cleanup_net <-process_one_work
          kworker/u128:0-8       [030] ...1.   429.292645: cleanup_net <-process_one_work
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      edc12f03
    • Kuniyuki Iwashima's avatar
      tcp: Access &tcp_hashinfo via net. · 4461568a
      Kuniyuki Iwashima authored
      We will soon introduce an optional per-netns ehash.
      
      This means we cannot use tcp_hashinfo directly in most places.
      
      Instead, access it via net->ipv4.tcp_death_row.hashinfo.
      
      The access will be valid only while initialising tcp_hashinfo
      itself and creating/destroying each netns.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4461568a
    • Kuniyuki Iwashima's avatar
      tcp: Set NULL to sk->sk_prot->h.hashinfo. · 429e42c1
      Kuniyuki Iwashima authored
      We will soon introduce an optional per-netns ehash.
      
      This means we cannot use the global sk->sk_prot->h.hashinfo
      to fetch a TCP hashinfo.
      
      Instead, set NULL to sk->sk_prot->h.hashinfo for TCP and get
      a proper hashinfo from net->ipv4.tcp_death_row.hashinfo.
      
      Note that we need not use sk->sk_prot->h.hashinfo if DCCP is
      disabled.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      429e42c1
    • Kuniyuki Iwashima's avatar
      tcp: Don't allocate tcp_death_row outside of struct netns_ipv4. · e9bd0cca
      Kuniyuki Iwashima authored
      We will soon introduce an optional per-netns ehash and access hash
      tables via net->ipv4.tcp_death_row->hashinfo instead of &tcp_hashinfo
      in most places.
      
      It could harm the fast path because dereferences of two fields in net
      and tcp_death_row might incur two extra cache line misses.  To save one
      dereference, let's place tcp_death_row back in netns_ipv4 and fetch
      hashinfo via net->ipv4.tcp_death_row"."hashinfo.
      
      Note tcp_death_row was initially placed in netns_ipv4, and commit
      fbb82952 ("tcp: allocate tcp_death_row outside of struct netns_ipv4")
      changed it to a pointer so that we can fire TIME_WAIT timers after freeing
      net.  However, we don't do so after commit 04c494e6 ("Revert "tcp/dccp:
      get rid of inet_twsk_purge()""), so we need not define tcp_death_row as a
      pointer.
      
      Also, we move refcount_dec_and_test(&tw_refcount) from tcp_sk_exit() to
      tcp_sk_exit_batch() as a debug check.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e9bd0cca
    • Kuniyuki Iwashima's avatar
      tcp: Clean up some functions. · 08eaef90
      Kuniyuki Iwashima authored
      This patch adds no functional change and cleans up some functions
      that the following patches touch around so that we make them tidy
      and easy to review/revert.  The changes are
      
        - Keep reverse christmas tree order
        - Remove unnecessary init of port in inet_csk_find_open_port()
        - Use req_to_sk() once in reqsk_queue_unlink()
        - Use sock_net(sk) once in tcp_time_wait() and tcp_v[46]_connect()
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      08eaef90
    • Christophe JAILLET's avatar
      headers: Remove some left-over license text · 17df341d
      Christophe JAILLET authored
      Remove a left-over from commit 2874c5fd ("treewide: Replace GPLv2
      boilerplate/reference with SPDX - rule 152")
      
      There is no need for an empty "License:".
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Link: https://lore.kernel.org/r/0e5ff727626b748238f4b78932f81572143d8f0b.1662896317.git.christophe.jaillet@wanadoo.frSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      17df341d
    • Hangbin Liu's avatar
      selftests/bonding: add a test for bonding lladdr target · 152e8ec7
      Hangbin Liu authored
      This is a regression test for commit 592335a4 ("bonding: accept
      unsolicited NA message") and commit b7f14132 ("bonding: use unspecified
      address if no available link local address"). When the bond interface
      up and no available link local address, unspecified address(::) is used to
      send the NS message. The unsolicited NA message should also be accepted
      for validation.
      Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Acked-by: default avatarJonathan Toppins <jtoppins@redhat.com>
      Link: https://lore.kernel.org/r/20220920033047.173244-1-liuhangbin@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      152e8ec7
    • Yang Yingliang's avatar
      net: mdio: mux-multiplexer: Switch to use dev_err_probe() helper · 4633b391
      Yang Yingliang authored
      dev_err() can be replace with dev_err_probe() which will check if error
      code is -EPROBE_DEFER.
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20220915065043.665138-3-yangyingliang@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4633b391
    • Yang Yingliang's avatar
      net: mdio: mux-mmioreg: Switch to use dev_err_probe() helper · 770aac8d
      Yang Yingliang authored
      dev_err() can be replace with dev_err_probe() which will check if error
      code is -EPROBE_DEFER.
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20220915065043.665138-2-yangyingliang@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      770aac8d
    • Yang Yingliang's avatar
      net: mdio: mux-meson-g12a: Switch to use dev_err_probe() helper · de0665c8
      Yang Yingliang authored
      dev_err() can be replace with dev_err_probe() which will check if error
      code is -EPROBE_DEFER.
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20220915065043.665138-1-yangyingliang@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      de0665c8
    • Biju Das's avatar
      ravb: Add RZ/G2L MII interface support · 1089877a
      Biju Das authored
      EMAC IP found on RZ/G2L Gb ethernet supports MII interface.
      This patch adds support for selecting MII interface mode.
      Signed-off-by: default avatarBiju Das <biju.das.jz@bp.renesas.com>
      Reviewed-by: default avatarSergey Shtylyov <s.shtylyov@omp.ru>
      Link: https://lore.kernel.org/r/20220914192604.265859-1-biju.das.jz@bp.renesas.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1089877a
    • Phil Sutter's avatar
      net: rtnetlink: Enslave device before bringing it up · a4abfa62
      Phil Sutter authored
      Unlike with bridges, one can't add an interface to a bond and set it up
      at the same time:
      
      | # ip link set dummy0 down
      | # ip link set dummy0 master bond0 up
      | Error: Device can not be enslaved while up.
      
      Of all drivers with ndo_add_slave callback, bond and team decline if
      IFF_UP flag is set, vrf cycles the interface (i.e., sets it down and
      immediately up again) and the others just don't care.
      
      Support the common notion of setting the interface up after enslaving it
      by sorting the operations accordingly.
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20220914150623.24152-1-phil@nwl.ccSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a4abfa62
    • Jakub Kicinski's avatar
      Merge branch 'macb-add-zynqmp-sgmii-dynamic-configuration-support' · 5f4e2564
      Jakub Kicinski authored
      Radhey Shyam Pandey says:
      
      ====================
      macb: add zynqmp SGMII dynamic configuration support
      
      This patchset add firmware and driver support to do SD/GEM dynamic
      configuration. In traditional flow GEM secure space configuration
      is done by FSBL. However in specific usescases like dynamic designs
      where GEM is not enabled in base vivado design, FSBL skips GEM
      initialization and we need a mechanism to configure GEM secure space
      in linux space at runtime.
      ====================
      
      Link: https://lore.kernel.org/r/1663158796-14869-1-git-send-email-radhey.shyam.pandey@amd.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5f4e2564
    • Radhey Shyam Pandey's avatar
      net: macb: Add zynqmp SGMII dynamic configuration support · 32cee781
      Radhey Shyam Pandey authored
      Add support for the dynamic configuration which takes care of
      configuring the GEM secure space configuration registers
      using EEMI APIs.
      High level sequence is to:
      - Check for the PM dynamic configuration support, if no error proceed with
        GEM dynamic configurations(next steps) otherwise skip the dynamic
        configuration.
      - Configure GEM Fixed configurations.
      - Configure GEM_CLK_CTRL (gemX_sgmii_mode).
      - Trigger GEM reset.
      Signed-off-by: default avatarRadhey Shyam Pandey <radhey.shyam.pandey@amd.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Tested-by: Conor Dooley <conor.dooley@microchip.com> (for MPFS)
      Reviewed-by: default avatarClaudiu Beznea <claudiu.beznea@microchip.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      32cee781
    • Ronak Jain's avatar
      firmware: xilinx: add support for sd/gem config · 256dea91
      Ronak Jain authored
      Add new APIs in firmware to configure SD/GEM registers. Internally
      it calls PM IOCTL for below SD/GEM register configuration:
      - SD/EMMC select
      - SD slot type
      - SD base clock
      - SD 8 bit support
      - SD fixed config
      - GEM SGMII Mode
      - GEM fixed config
      Signed-off-by: default avatarRonak Jain <ronak.jain@xilinx.com>
      Signed-off-by: default avatarRadhey Shyam Pandey <radhey.shyam.pandey@amd.com>
      Reviewed-by: default avatarClaudiu Beznea <claudiu.beznea@microchip.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      256dea91
    • ruanjinjie's avatar
      xen-netfront: make bounce_skb static · 53ff2517
      ruanjinjie authored
      The symbol is not used outside of the file, so mark it static.
      
      Fixes the following warning:
      
      ./drivers/net/xen-netfront.c:676:16: warning: symbol 'bounce_skb' was not declared. Should it be static?
      Signed-off-by: default avatarruanjinjie <ruanjinjie@huawei.com>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Link: https://lore.kernel.org/r/20220914064339.49841-1-ruanjinjie@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      53ff2517
    • Horatiu Vultur's avatar
      net: phy: micrel: Add interrupts support for LAN8804 PHY · b324c6e5
      Horatiu Vultur authored
      Add support for interrupts for LAN8804 PHY.
      
      Tested-by: Michael Walle <michael@walle.cc> # on kontron-kswitch-d10
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Link: https://lore.kernel.org/r/20220913142926.816746-1-horatiu.vultur@microchip.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b324c6e5
    • Jakub Kicinski's avatar
      Merge branch 'sfp-add-support-for-halny-gpon-module' · c3188dba
      Jakub Kicinski authored
      Russell King says:
      
      ====================
      sfp: add support for HALNy GPON module
      
      This series adds support for the HALNy GPON SFP module. In order to do
      this sensibly, we need a more flexible quirk system, since we need to
      change the behaviour of the SFP cage driver to ignore the LOS and
      TX_FAULT signals after module detection.
      
      Since we move the SFP quirks into the SFP cage driver, we can use it
      for the MA5671A and 3FE46541AA modules as well.
      ====================
      
      Link: https://lore.kernel.org/r/YyDUnvM1b0dZPmmd@shell.armlinux.org.ukSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c3188dba
    • Russell King (Oracle)'s avatar
      net: sfp: add support for HALNy GPON SFP · 73472c83
      Russell King (Oracle) authored
      Add a quirk for the HALNy HL-GSFP module, which appears to have an
      inverted RX_LOS signal, and maybe uses TX_FAULT as a serial port
      transmit pin. Rather than use these hardware signals, switch to
      using software polling for these status signals.
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      73472c83
    • Russell King (Oracle)'s avatar
      net: sfp: move Huawei MA5671A fixup · 5029be76
      Russell King (Oracle) authored
      Move this module over to the new fixup mechanism.
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5029be76
    • Russell King (Oracle)'s avatar
      net: sfp: move Alcatel Lucent 3FE46541AA fixup · 27541675
      Russell King (Oracle) authored
      Add a new fixup mechanism to the SFP quirks, and use it for this
      module.
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      27541675
    • Russell King (Oracle)'s avatar
      net: sfp: move quirk handling into sfp.c · 23571c7b
      Russell King (Oracle) authored
      We need to handle more quirks than just those which affect the link
      modes of the module. Move the quirk lookup into sfp.c, and pass the
      quirk to sfp-bus.c
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      23571c7b
    • Russell King (Oracle)'s avatar
      net: sfp: re-implement soft state polling setup · 8475c4b7
      Russell King (Oracle) authored
      Re-implement the decision making for soft state polling. Instead of
      generating the soft state mask in sfp_soft_start_poll() by looking at
      which GPIOs are available, record their availability in
      sfp_sm_mod_probe() in sfp->state_hw_mask.
      
      This will then allow us to clear bits in sfp->state_hw_mask in module
      specific quirks when the hardware signals should not be used, thereby
      allowing us to switch to using the software state polling.
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8475c4b7
    • Vladimir Oltean's avatar
      dt-bindings: net: dsa: convert ocelot.txt to dt-schema · 7f32974b
      Vladimir Oltean authored
      Replace the free-form description of device tree bindings for VSC9959
      and VSC9953 with a YAML formatted dt-schema description. This contains
      more or less the same information, but reworded to be a bit more
      succint.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarMaxim Kochetkov <fido_max@inbox.ru>
      Reviewed-by: default avatarRob Herring <robh@kernel.org>
      Link: https://lore.kernel.org/r/20220913125806.524314-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7f32974b
    • Jakub Kicinski's avatar
      Merge branch 'net-ipa-a-mix-of-cleanups' · 93ece9a6
      Jakub Kicinski authored
      Alex Elder says:
      
      ====================
      net: ipa: a mix of cleanups
      
      This series contains a set of cleanups done in preparation for a
      more substantitive upcoming series that reworks how IPA registers
      and their fields are defined.
      
      The first eliminates about half of the possible GSI register
      constant symbols by removing offset definitions that are not
      currently required.
      
      The next two mainly rearrange code for some common enumerated types.
      
      The next one fixes two spots that reuse local variable names in
      inner scopes when defining offsets.
      
      The next adds some additional restrictions on the value held in a
      register.
      
      And the last one just fixes two field mask symbol names so they
      adhere to the common naming convention.
      ====================
      
      Link: https://lore.kernel.org/r/20220910011131.1431934-1-elder@linaro.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      93ece9a6
    • Alex Elder's avatar
      net: ipa: fix two symbol names · dae4af6b
      Alex Elder authored
      All field mask symbols are defined with a "_FMASK" suffix, but
      EOT_COAL_GRANULARITY and DRBIP_ACL_ENABLE are defined without one.
      Fix that.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dae4af6b
    • Alex Elder's avatar
      net: ipa: update sequencer definition constraints · a14d5937
      Alex Elder authored
      Starting with IPA v4.5, replication is done differently from before,
      and as a result the "replication" portion of the how the sequencer
      is specified must be zero.
      
      Add a check for the configuration data failing that requirement, and
      only update the sesquencer type value when it's supported.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a14d5937
    • Alex Elder's avatar
      net: ipa: don't reuse variable names · 9eefd2fb
      Alex Elder authored
      In ipa_endpoint_init_hdr(), as well as ipa_endpoint_init_hdr_ext(),
      a top-level automatic variable named "offset" is used to represent
      the offset of a register.
      
      However, deeper within each of those functions is *another*
      definition of a local variable with the same name, representing
      something else.  Scoping rules ensure the result is what was
      intended, but this variable name reuse is bad practice and makes
      the code confusing.
      
      Fix this by naming the inner variable "off".  Use "off" instead of
      "checksum_offset" in ipa_endpoint_init_cfg() for consistency.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9eefd2fb
    • Alex Elder's avatar
      net: ipa: move and redefine ipa_version_valid() · 8b3cb084
      Alex Elder authored
      Move the definition of ipa_version_valid(), making it a static
      inline function defined together with the enumerated type in
      "ipa_version.h".  Define a new count value in the type.
      
      Rename the function to be ipa_version_supported(), and have it
      return true only if the IPA version supplied is explicitly supported
      by the driver.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8b3cb084
    • Alex Elder's avatar
      net: ipa: move the definition of gsi_ee_id · bb788de3
      Alex Elder authored
      Move the definition of the gsi_ee_id enumerated type out of "gsi.h"
      and into "ipa_version.h".  That latter header file isolates the
      definition of the ipa_version enumerated type, allowing it to be
      included in both IPA and GSI code.  We have the same requirement for
      gsi_ee_id, and moving it here makes it easier to get only that
      definition without everything else defined in "gsi.h".
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bb788de3
    • Alex Elder's avatar
      net: ipa: don't define unneeded GSI register offsets · 5ea42858
      Alex Elder authored
      Each GSI execution environment (EE) is able to access many of the
      GSI registers associated with the other EEs.  A block of GSI
      registers is contained within a region of memory, and an EE's
      register offset can be determined by adding the register's base
      offset to the product of the EE ID and a fixed constant.
      
      Despite this possibility, the AP IPA code *never* accesses any GSI
      registers other than its own.  So there's no need to define the
      macros that compute register offsets for other EEs.
      
      Redefine the AP access macros to compute the offset the way the more
      general "any EE" macro would, and get rid of the unneeded macros.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5ea42858
    • Paolo Abeni's avatar
      Merge branch 'net-ethernet-adi-add-adin1110-support' · 01544a27
      Paolo Abeni authored
      Alexandru Tachici says:
      
      ====================
      net: ethernet: adi: Add ADIN1110 support
      
      The ADIN1110 is a low power single port 10BASE-T1L MAC-PHY
      designed for industrial Ethernet applications. It integrates
      an Ethernet PHY core with a MAC and all the associated analog
      circuitry, input and output clock buffering.
      
      ADIN1110 MAC-PHY encapsulates the ADIN1100 PHY. The PHY registers
      can be accessed through the MDIO MAC registers.
      We are registering an MDIO bus with custom read/write in order
      to let the PHY to be discovered by the PAL. This will let
      the ADIN1100 Linux driver to probe and take control of
      the PHY.
      
      The ADIN2111 is a low power, low complexity, two-Ethernet ports
      switch with integrated 10BASE-T1L PHYs and one serial peripheral
      interface (SPI) port.
      
      The device is designed for industrial Ethernet applications using
      low power constrained nodes and is compliant with the IEEE 802.3cg-2019
      Ethernet standard for long reach 10 Mbps single pair Ethernet (SPE).
      The switch supports various routing configurations between
      the two Ethernet ports and the SPI host port providing a flexible
      solution for line, daisy-chain, or ring network topologies.
      
      The ADIN2111 supports cable reach of up to 1700 meters with ultra
      low power consumption of 77 mW. The two PHY cores support the
      1.0 V p-p operating mode and the 2.4 V p-p operating mode defined
      in the IEEE 802.3cg standard.
      
      The device integrates the switch, two Ethernet physical layer (PHY)
      cores with a media access control (MAC) interface and all the
      associated analog circuitry, and input and output clock buffering.
      
      The device also includes internal buffer queues, the SPI and
      subsystem registers, as well as the control logic to manage the reset
      and clock control and hardware pin configuration.
      
      Access to the PHYs is exposed via an internal MDIO bus. Writes/reads
      can be performed by reading/writing to the ADIN2111 MDIO registers
      via SPI.
      
      On probe, for each port, a struct net_device is allocated and
      registered. When both ports are added to the same bridge, the driver
      will enable offloading of frame forwarding at the hardware level.
      
      Driver offers STP support. Normal operation on forwarding state.
      Allows only frames with the 802.1d DA to be passed to the host
      when in any of the other states.
      
      When both ports of ADIN2111 belong to the same SW bridge a maximum
      of 12 FDB entries will offloaded by the hardware and are marked as such.
      ====================
      
      Link: https://lore.kernel.org/r/20220913122629.124546-1-andrei.tachici@stud.acs.upb.roSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      01544a27
    • Alexandru Tachici's avatar
      dt-bindings: net: adin1110: Add docs · 9fd12e86
      Alexandru Tachici authored
      Add bindings for the ADIN1110/2111 MAC-PHY/SWITCH.
      Reviewed-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarAlexandru Tachici <alexandru.tachici@analog.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      9fd12e86
    • Alexandru Tachici's avatar
      net: ethernet: adi: Add ADIN1110 support · bc93e19d
      Alexandru Tachici authored
      The ADIN1110 is a low power single port 10BASE-T1L MAC-PHY
      designed for industrial Ethernet applications. It integrates
      an Ethernet PHY core with a MAC and all the associated analog
      circuitry, input and output clock buffering.
      
      ADIN1110 MAC-PHY encapsulates the ADIN1100 PHY. The PHY registers
      can be accessed through the MDIO MAC registers.
      We are registering an MDIO bus with custom read/write in order
      to let the PHY to be discovered by the PAL. This will let
      the ADIN1100 Linux driver to probe and take control of
      the PHY.
      
      The ADIN2111 is a low power, low complexity, two-Ethernet ports
      switch with integrated 10BASE-T1L PHYs and one serial peripheral
      interface (SPI) port.
      
      The device is designed for industrial Ethernet applications using
      low power constrained nodes and is compliant with the IEEE 802.3cg-2019
      Ethernet standard for long reach 10 Mbps single pair Ethernet (SPE).
      The switch supports various routing configurations between
      the two Ethernet ports and the SPI host port providing a flexible
      solution for line, daisy-chain, or ring network topologies.
      
      The ADIN2111 supports cable reach of up to 1700 meters with ultra
      low power consumption of 77 mW. The two PHY cores support the
      1.0 V p-p operating mode and the 2.4 V p-p operating mode defined
      in the IEEE 802.3cg standard.
      
      The device integrates the switch, two Ethernet physical layer (PHY)
      cores with a media access control (MAC) interface and all the
      associated analog circuitry, and input and output clock buffering.
      
      The device also includes internal buffer queues, the SPI and
      subsystem registers, as well as the control logic to manage the reset
      and clock control and hardware pin configuration.
      
      Access to the PHYs is exposed via an internal MDIO bus. Writes/reads
      can be performed by reading/writing to the ADIN2111 MDIO registers
      via SPI.
      
      On probe, for each port, a struct net_device is allocated and
      registered. When both ports are added to the same bridge, the driver
      will enable offloading of frame forwarding at the hardware level.
      
      Driver offers STP support. Normal operation on forwarding state.
      Allows only frames with the 802.1d DA to be passed to the host
      when in any of the other states.
      
      When both ports of ADIN2111 belong to the same SW bridge a maximum
      of 12 FDB entries will offloaded by the hardware and are marked as such.
      Co-developed-by: default avatarLennart Franzen <lennart@lfdomain.com>
      Signed-off-by: default avatarLennart Franzen <lennart@lfdomain.com>
      Signed-off-by: default avatarAlexandru Tachici <alexandru.tachici@analog.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      bc93e19d
    • Alexandru Tachici's avatar
      net: phy: adin1100: add PHY IDs of adin1110/adin2111 · 875b718a
      Alexandru Tachici authored
      Add additional PHY IDs for the internal PHYs of adin1110 and adin2111.
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarAlexandru Tachici <alexandru.tachici@analog.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      875b718a
    • Paolo Abeni's avatar
      Merge branch 'seg6-add-next-c-sid-support-for-srv6-end-behavior' · cec9d59e
      Paolo Abeni authored
      Andrea Mayer says:
      
      ====================
      seg6: add NEXT-C-SID support for SRv6 End behavior
      
      The Segment Routing (SR) architecture is based on loose source routing.
      A list of instructions, called segments, can be added to the packet headers to
      influence the forwarding and processing of the packets in an SR enabled
      network.
      In SRv6 (Segment Routing over IPv6 data plane) [1], the segment identifiers
      (SIDs) are IPv6 addresses (128 bits) and the segment list (SID List) is carried
      in the Segment Routing Header (SRH). A segment may correspond to a "behavior"
      that is executed by a node when the packet is received.
      The Linux kernel currently supports a large subset of the behaviors described
      in [2] (e.g., End, End.X, End.T and so on).
      
      Some SRv6 scenarios (i.e.: traffic-engineering, fast-rerouting, VPN, mobile
      network backhaul, etc.) may require a large number of segments (i.e. up to 15).
      Therefore, reducing the size of the SID List is useful to minimize the impact
      on MTU (Maximum Transfer Unit) and to enable SRv6 on legacy hardware devices
      with limited processing power that can suffer from long IPv6 headers.
      
      Draft-ietf-spring-srv6-srh-compression [3] extends the SRv6 architecture by
      providing different mechanisms for the efficient representation (i.e.
      compression) of the SID List.
      
      The NEXT-C-SID mechanism described in [3] offers the possibility of encoding
      several SRv6 segments within a single 128 bit SID address. Such a SID address
      is called a Compressed SID Container. In this way, the length of the SID List
      can be drastically reduced. In some cases, the SRH can be omitted, as the IPv6
      Destination Address can carry the whole Segment List, using its compressed
      representation.
      
      The NEXT-C-SID mechanism relies on the "flavors" framework defined in [2].
      The flavors represent additional operations that can modify or extend a subset
      of the existing behaviors.
      
      In this patchset we extend the SRv6 Subsystem in order to support the
      NEXT-C-SID mechanism.
      
      In details the patchset is made of:
       - patch 1/3: add netlink_ext_ack support in parsing SRv6 behavior attributes;
       - patch 2/3: add NEXT-C-SID support for SRv6 End behavior;
       - patch 3/3: add selftest for NEXT-C-SID in SRv6 End behavior.
      
      The corresponding iproute2 patch for supporting the NEXT-C-SID in SRv6 End
      behavior is provided in a separated patchset.
      
      Comments, improvements and suggestions are always appreciated.
      
      [1] - https://datatracker.ietf.org/doc/html/rfc8754
      [2] - https://datatracker.ietf.org/doc/html/rfc8986
      [3] - https://datatracker.ietf.org/doc/html/draft-ietf-spring-srv6-srh-compression
      
      ====================
      
      Link: https://lore.kernel.org/r/20220912171619.16943-1-andrea.mayer@uniroma2.itSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      cec9d59e
    • Andrea Mayer's avatar
      selftests: seg6: add selftest for NEXT-C-SID flavor in SRv6 End behavior · 19d6356a
      Andrea Mayer authored
      This selftest is designed for testing the support of NEXT-C-SID flavor
      for SRv6 End behavior. It instantiates a virtual network composed of
      several nodes: hosts and SRv6 routers. Each node is realized using a
      network namespace that is properly interconnected to others through veth
      pairs.
      The test considers SRv6 routers implementing IPv4/IPv6 L3 VPNs leveraged
      by hosts for communicating with each other. Such routers i) apply
      different SRv6 Policies to the traffic received from connected hosts,
      considering the IPv4 or IPv6 protocols; ii) use the NEXT-C-SID
      compression mechanism for encoding several SRv6 segments within a single
      128-bit SID address, referred to as a Compressed SID (C-SID) container.
      
      The NEXT-C-SID is provided as a "flavor" of the SRv6 End behavior,
      enabling it to properly process the C-SID containers. The correct
      execution of the enabled NEXT-C-SID SRv6 End behavior is verified
      through reachability tests carried out between hosts belonging to the
      same VPN.
      Signed-off-by: default avatarAndrea Mayer <andrea.mayer@uniroma2.it>
      Acked-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      19d6356a
    • Andrea Mayer's avatar
      seg6: add NEXT-C-SID support for SRv6 End behavior · 848f3c0d
      Andrea Mayer authored
      The NEXT-C-SID mechanism described in [1] offers the possibility of
      encoding several SRv6 segments within a single 128 bit SID address. Such
      a SID address is called a Compressed SID (C-SID) container. In this way,
      the length of the SID List can be drastically reduced.
      
      A SID instantiated with the NEXT-C-SID flavor considers an IPv6 address
      logically structured in three main blocks: i) Locator-Block; ii)
      Locator-Node Function; iii) Argument.
      
                              C-SID container
      +------------------------------------------------------------------+
      |     Locator-Block      |Loc-Node|            Argument            |
      |                        |Function|                                |
      +------------------------------------------------------------------+
      <--------- B -----------> <- NF -> <------------- A --------------->
      
         (i) The Locator-Block can be any IPv6 prefix available to the provider;
      
        (ii) The Locator-Node Function represents the node and the function to
             be triggered when a packet is received on the node;
      
       (iii) The Argument carries the remaining C-SIDs in the current C-SID
             container.
      
      The NEXT-C-SID mechanism relies on the "flavors" framework defined in
      [2]. The flavors represent additional operations that can modify or
      extend a subset of the existing behaviors.
      
      This patch introduces the support for flavors in SRv6 End behavior
      implementing the NEXT-C-SID one. An SRv6 End behavior with NEXT-C-SID
      flavor works as an End behavior but it is capable of processing the
      compressed SID List encoded in C-SID containers.
      
      An SRv6 End behavior with NEXT-C-SID flavor can be configured to support
      user-provided Locator-Block and Locator-Node Function lengths. In this
      implementation, such lengths must be evenly divisible by 8 (i.e. must be
      byte-aligned), otherwise the kernel informs the user about invalid
      values with a meaningful error code and message through netlink_ext_ack.
      
      If Locator-Block and/or Locator-Node Function lengths are not provided
      by the user during configuration of an SRv6 End behavior instance with
      NEXT-C-SID flavor, the kernel will choose their default values i.e.,
      32-bit Locator-Block and 16-bit Locator-Node Function.
      
      [1] - https://datatracker.ietf.org/doc/html/draft-ietf-spring-srv6-srh-compression
      [2] - https://datatracker.ietf.org/doc/html/rfc8986Signed-off-by: default avatarAndrea Mayer <andrea.mayer@uniroma2.it>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      848f3c0d
    • Andrea Mayer's avatar
      seg6: add netlink_ext_ack support in parsing SRv6 behavior attributes · e2a8ecc4
      Andrea Mayer authored
      An SRv6 behavior instance can be set up using mandatory and/or optional
      attributes.
      In the setup phase, each supplied attribute is parsed and processed. If
      the parsing operation fails, the creation of the behavior instance stops
      and an error number/code is reported to the user.  In many cases, it is
      challenging for the user to figure out exactly what happened by relying
      only on the error code.
      
      For this reason, we add the support for netlink_ext_ack in parsing SRv6
      behavior attributes. In this way, when an SRv6 behavior attribute is
      parsed and an error occurs, the kernel can send a message to the
      userspace describing the error through a meaningful text message in
      addition to the classic error code.
      Signed-off-by: default avatarAndrea Mayer <andrea.mayer@uniroma2.it>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      e2a8ecc4