1. 08 Feb, 2023 40 commits
    • Vladimir Oltean's avatar
      net/sched: taprio: split segmentation logic from qdisc_enqueue() · 2d5e8071
      Vladimir Oltean authored
      The majority of the taprio_enqueue()'s function is spent doing TCP
      segmentation, which doesn't look right to me. Compilers shouldn't have a
      problem in inlining code no matter how we write it, so move the
      segmentation logic to a separate function.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d5e8071
    • Vladimir Oltean's avatar
      net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations · fed87cc6
      Vladimir Oltean authored
      taprio today has a huge problem with small TC gate durations, because it
      might accept packets in taprio_enqueue() which will never be sent by
      taprio_dequeue().
      
      Since not much infrastructure was available, a kludge was added in
      commit 497cc002 ("taprio: Handle short intervals and large
      packets"), which segmented large TCP segments, but the fact of the
      matter is that the issue isn't specific to large TCP segments (and even
      worse, the performance penalty in segmenting those is absolutely huge).
      
      In commit a54fc09e ("net/sched: taprio: allow user input of per-tc
      max SDU"), taprio gained support for queueMaxSDU, which is precisely the
      mechanism through which packets should be dropped at qdisc_enqueue() if
      they cannot be sent.
      
      After that patch, it was necessary for the user to manually limit the
      maximum MTU per TC. This change adds the necessary logic for taprio to
      further limit the values specified (or not specified) by the user to
      some minimum values which never allow oversized packets to be sent.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fed87cc6
    • Vladimir Oltean's avatar
      net/sched: keep the max_frm_len information inside struct sched_gate_list · a878fd46
      Vladimir Oltean authored
      I have one practical reason for doing this and one concerning correctness.
      
      The practical reason has to do with a follow-up patch, which aims to mix
      2 sources of max_sdu (one coming from the user and the other automatically
      calculated based on TC gate durations @current link speed). Among those
      2 sources of input, we must always select the smaller max_sdu value, but
      this can change at various link speeds. So the max_sdu coming from the
      user must be kept separated from the value that is operationally used
      (the minimum of the 2), because otherwise we overwrite it and forget
      what the user asked us to do.
      
      To solve that, this patch proposes that struct sched_gate_list contains
      the operationally active max_frm_len, and q->max_sdu contains just what
      was requested by the user.
      
      The reason having to do with correctness is based on the following
      observation: the admin sched_gate_list becomes operational at a given
      base_time in the future. Until then, it is inactive and applies no
      shaping, all gates are open, etc. So the queueMaxSDU dropping shouldn't
      apply either (this is a mechanism to ensure that packets smaller than
      the largest gate duration for that TC don't hang the port; clearly it
      makes little sense if the gates are always open).
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a878fd46
    • Vladimir Oltean's avatar
      net/sched: taprio: warn about missing size table · a3d91b2c
      Vladimir Oltean authored
      Vinicius intended taprio to take the L1 overhead into account when
      estimating packet transmission time through user input, specifically
      through the qdisc size table (man tc-stab).
      
      Something like this:
      
      tc qdisc replace dev $eth root stab overhead 24 taprio \
      	num_tc 8 \
      	map 0 1 2 3 4 5 6 7 \
      	queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
      	base-time 0 \
      	sched-entry S 0x7e 9000000 \
      	sched-entry S 0x82 1000000 \
      	max-sdu 0 0 0 0 0 0 0 200 \
      	flags 0x0 clockid CLOCK_TAI
      
      Without the overhead being specified, transmission times will be
      underestimated and will cause late transmissions. For an offloading
      driver, it might even cause TX hangs if there is no open gate large
      enough to send the maximum sized packets for that TC (including L1
      overhead). Properly knowing the L1 overhead will ensure that we are able
      to auto-calculate the queueMaxSDU per traffic class just right, and
      avoid these hangs due to head-of-line blocking.
      
      We can't make the stab mandatory due to existing setups, but we can warn
      the user that it's important with a warning netlink extack.
      
      Link: https://patchwork.kernel.org/project/netdevbpf/patch/20220505160357.298794-1-vladimir.oltean@nxp.com/Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a3d91b2c
    • Vladimir Oltean's avatar
      net/sched: make stab available before ops->init() call · 1f62879e
      Vladimir Oltean authored
      Some qdiscs like taprio turn out to be actually pretty reliant on a well
      configured stab, to not underestimate the skb transmission time (by
      properly accounting for L1 overhead).
      
      In a future change, taprio will need the stab, if configured by the
      user, to be available at ops->init() time. It will become even more
      important in upcoming work, when the overhead will be used for the
      queueMaxSDU calculation that is passed to an offloading driver.
      
      However, rcu_assign_pointer(sch->stab, stab) is called right after
      ops->init(), making it unavailable, and I don't really see a good reason
      for that.
      
      Move it earlier, which nicely seems to simplify the error handling path
      as well.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f62879e
    • Vladimir Oltean's avatar
      net/sched: taprio: calculate guard band against actual TC gate close time · a1e6ad30
      Vladimir Oltean authored
      taprio_dequeue_from_txq() looks at the entry->end_time to determine
      whether the skb will overrun its traffic class gate, as if at the end of
      the schedule entry there surely is a "gate close" event for it. Hint:
      maybe there isn't.
      
      For each schedule entry, introduce an array of kernel times which
      actually tracks when in the future will there be an *actual* gate close
      event for that traffic class, and use that in the guard band overrun
      calculation.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a1e6ad30
    • Vladimir Oltean's avatar
      net/sched: taprio: calculate budgets per traffic class · d2ad689d
      Vladimir Oltean authored
      Currently taprio assumes that the budget for a traffic class expires at
      the end of the current interval as if the next interval contains a "gate
      close" event for this traffic class.
      
      This is, however, an unfounded assumption. Allow schedule entry
      intervals to be fused together for a particular traffic class by
      calculating the budget until the gate *actually* closes.
      
      This means we need to keep budgets per traffic class, and we also need
      to update the budget consumption procedure.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d2ad689d
    • Vladimir Oltean's avatar
      net/sched: taprio: rename close_time to end_time · e5517551
      Vladimir Oltean authored
      There is a confusion in terms in taprio which makes what is called
      "close_time" to be actually used for 2 things:
      
      1. determining when an entry "closes" such that transmitted skbs are
         never allowed to overrun that time (?!)
      2. an aid for determining when to advance and/or restart the schedule
         using the hrtimer
      
      It makes more sense to call this so-called "close_time" "end_time",
      because it's not clear at all to me what "closes". Future patches will
      hopefully make better use of the term "to close".
      
      This is an absolutely mechanical change.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e5517551
    • Vladimir Oltean's avatar
      net/sched: taprio: calculate tc gate durations · a306a90c
      Vladimir Oltean authored
      Current taprio code operates on a very simplistic (and incorrect)
      assumption: that egress scheduling for a traffic class can only take
      place for the duration of the current interval, or i.o.w., it assumes
      that at the end of each schedule entry, there is a "gate close" event
      for all traffic classes.
      
      As an example, traffic sent with the schedule below will be jumpy, even
      though all 8 TC gates are open, so there is absolutely no "gate close"
      event (effectively a transition from BIT(tc)==1 to BIT(tc)==0 in
      consecutive schedule entries):
      
      tc qdisc replace dev veth0 parent root taprio \
      	num_tc 2 \
      	map 0 1 \
      	queues 1@0 1@1 \
      	base-time 0 \
      	sched-entry S 0xff 4000000000 \
      	clockid CLOCK_TAI \
      	flags 0x0
      
      This qdisc simply does not have what it takes in terms of logic to
      *actually* compute the durations of traffic classes. Also, it does not
      recognize the need to use this information on a per-traffic-class basis:
      it always looks at entry->interval and entry->close_time.
      
      This change proposes that each schedule entry has an array called
      tc_gate_duration[tc]. This holds the information: "for how long will
      this traffic class gate remain open, starting from *this* schedule
      entry". If the traffic class gate is always open, that value is equal to
      the cycle time of the schedule.
      
      We'll also need to keep track, for the purpose of queueMaxSDU[tc]
      calculation, what is the maximum time duration for a traffic class
      having an open gate. This gives us directly what is the maximum sized
      packet that this traffic class will have to accept. For everything else
      it has to qdisc_drop() it in qdisc_enqueue().
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a306a90c
    • Vladimir Oltean's avatar
      net/sched: taprio: give higher priority to higher TCs in software dequeue mode · 2f530df7
      Vladimir Oltean authored
      Current taprio software implementation is haunted by the shadow of the
      igb/igc hardware model. It iterates over child qdiscs in increasing
      order of TXQ index, therefore giving higher xmit priority to TXQ 0 and
      lower to TXQ N. According to discussions with Vinicius, that is the
      default (perhaps even unchangeable) prioritization scheme used for the
      NICs that taprio was first written for (igb, igc), and we have a case of
      two bugs canceling out, resulting in a functional setup on igb/igc, but
      a less sane one on other NICs.
      
      To the best of my understanding, taprio should prioritize based on the
      traffic class, so it should really dequeue starting with the highest
      traffic class and going down from there. We get to the TXQ using the
      tc_to_txq[] netdev property.
      
      TXQs within the same TC have the same (strict) priority, so we should
      pick from them as fairly as we can. We can achieve that by implementing
      something very similar to q->curband from multiq_dequeue().
      
      Since igb/igc really do have TXQ 0 of higher hardware priority than
      TXQ 1 etc, we need to preserve the behavior for them as well. We really
      have no choice, because in txtime-assist mode, taprio is essentially a
      software scheduler towards offloaded child tc-etf qdiscs, so the TXQ
      selection really does matter (not all igb TXQs support ETF/SO_TXTIME,
      says Kurt Kanzenbach).
      
      To preserve the behavior, we need a capability bit so that taprio can
      determine if it's running on igb/igc, or on something else. Because igb
      doesn't offload taprio at all, we can't piggyback on the
      qdisc_offload_query_caps() call from taprio_enable_offload(), but
      instead we need a separate call which is also made for software
      scheduling.
      
      Introduce two static keys to minimize the performance penalty on systems
      which only have igb/igc NICs, and on systems which only have other NICs.
      For mixed systems, taprio will have to dynamically check whether to
      dequeue using one prioritization algorithm or using the other.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2f530df7
    • Vladimir Oltean's avatar
      net/sched: taprio: avoid calling child->ops->dequeue(child) twice · 4c229427
      Vladimir Oltean authored
      Simplify taprio_dequeue_from_txq() by noticing that we can goto one call
      earlier than the previous skb_found label. This is possible because
      we've unified the treatment of the child->ops->dequeue(child) return
      call, we always try other TXQs now, instead of abandoning the root
      dequeue completely if we failed in the peek() case.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4c229427
    • Vladimir Oltean's avatar
      net/sched: taprio: refactor one skb dequeue from TXQ to separate function · 92f96667
      Vladimir Oltean authored
      Future changes will refactor the TXQ selection procedure, and a lot of
      stuff will become messy, the indentation of the bulk of the dequeue
      procedure would increase, etc.
      
      Break out the bulk of the function into a new one, which knows the TXQ
      (child qdisc) we should perform a dequeue from.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      92f96667
    • Vladimir Oltean's avatar
      net/sched: taprio: continue with other TXQs if one dequeue() failed · 1638bbbe
      Vladimir Oltean authored
      This changes the handling of an unlikely condition to not stop dequeuing
      if taprio failed to dequeue the peeked skb in taprio_dequeue().
      
      I've no idea when this can happen, but the only side effect seems to be
      that the atomic_sub_return() call right above will have consumed some
      budget. This isn't a big deal, since either that made us remain without
      any budget (and therefore, we'd exit on the next peeked skb anyway), or
      we could send some packets from other TXQs.
      
      I'm making this change because in a future patch I'll be refactoring the
      dequeue procedure to simplify it, and this corner case will have to go
      away.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1638bbbe
    • Vladimir Oltean's avatar
      net/sched: taprio: delete peek() implementation · ecc0cc98
      Vladimir Oltean authored
      There isn't any code in the network stack which calls taprio_peek().
      We only see qdisc->ops->peek() being called on child qdiscs of other
      classful qdiscs, never from the generic qdisc code. Whereas taprio is
      never a child qdisc, it is always root.
      
      This snippet of a comment from qdisc_peek_dequeued() seems to confirm:
      
      	/* we can reuse ->gso_skb because peek isn't called for root qdiscs */
      
      Since I've been known to be wrong many times though, I'm not completely
      removing it, but leaving a stub function in place which emits a warning.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ecc0cc98
    • David S. Miller's avatar
      Merge branch 'micrel-lan8841-support' · 6da13bf9
      David S. Miller authored
      Horatiu Vultur says:
      
      ====================
      net: micrel: Add support for lan8841 PHY
      
      Add support for lan8841 PHY.
      
      The first patch add the support for lan8841 PHY which can run at
      10/100/1000Mbit. It also has support for other features, but they are not
      added in this series.
      
      The second patch updates the documentation for the dt-bindings which is
      similar to the ksz9131.
      
      v3->v4:
      - add space between defines and function names
      - inside lan8841_config_init use only ret variable
      
      v2->v3:
      - reuse ksz9131_config_init
      - allow only open-drain configuration
      - change from single patch to a patch series
      
      v1->v2:
      - Remove hardcoded values
      - Fix typo in commit message
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6da13bf9
    • Horatiu Vultur's avatar
      dt-bindings: net: micrel-ksz90x1.txt: Update for lan8841 · 33e581d7
      Horatiu Vultur authored
      The lan8841 has the same bindings as ksz9131, so just reuse the entire
      section of ksz9131.
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Acked-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Signed-off-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33e581d7
    • Horatiu Vultur's avatar
      net: micrel: Add support for lan8841 PHY · a8f1a19d
      Horatiu Vultur authored
      The LAN8841 is completely integrated triple-speed (10BASE-T/ 100BASE-TX/
      1000BASE-T) Ethernet physical layer transceivers for transmission and
      reception of data on standard CAT-5, as well as CAT-5e and CAT-6,
      unshielded twisted pair (UTP) cables.
      The LAN8841 offers the industry-standard GMII/MII as well as the RGMII.
      Some of the features of the PHY are:
      - Wake on LAN
      - Auto-MDIX
      - IEEE 1588-2008 (V2)
      - LinkMD Capable diagnosis
      
      Currently the patch offers support only for link configuration.
      Signed-off-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a8f1a19d
    • Horatiu Vultur's avatar
      net: lan966x: Add support for TC flower filter statistics · 9ed138ff
      Horatiu Vultur authored
      Add flower filter packet statistics. This will just read the TCAM
      counter of the rule, which mention how many packages were hit by this
      rule.
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ed138ff
    • Jakub Kicinski's avatar
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · 1fe8a3b6
      Jakub Kicinski authored
      Tony Nguyen says:
      
      ====================
      ice: various virtualization cleanups
      
      Jacob Keller says:
      
      This series contains a variety of refactors and cleanups in the VF code for
      the ice driver. Its primary focus is cleanup and simplification of the VF
      operations and addition of a few new operations that will be required by
      Scalable IOV, as well as some other refactors needed for the handling of VF
      subfunctions.
      
      * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
        ice: remove unnecessary virtchnl_ether_addr struct use
        ice: introduce .irq_close VF operation
        ice: introduce clear_reset_state operation
        ice: convert vf_ops .vsi_rebuild to .create_vsi
        ice: introduce ice_vf_init_host_cfg function
        ice: add a function to initialize vf entry
        ice: Pull common tasks into ice_vf_post_vsi_rebuild
        ice: move ice_vf_vsi_release into ice_vf_lib.c
        ice: move vsi_type assignment from ice_vsi_alloc to ice_vsi_cfg
        ice: refactor VSI setup to use parameter structure
        ice: drop unnecessary VF parameter from several VSI functions
        ice: fix function comment referring to ice_vsi_alloc
        ice: Add more usage of existing function ice_get_vf_vsi(vf)
      ====================
      
      Link: https://lore.kernel.org/r/20230206214813.20107-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1fe8a3b6
    • Moshe Shemesh's avatar
      devlink: Fix memleak in health diagnose callback · cb6b2e11
      Moshe Shemesh authored
      The callback devlink_nl_cmd_health_reporter_diagnose_doit() miss
      devlink_fmsg_free(), which leads to memory leak.
      
      Fix it by adding devlink_fmsg_free().
      
      Fixes: e994a75f ("devlink: remove reporter reference counting")
      Signed-off-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Link: https://lore.kernel.org/r/1675698976-45993-1-git-send-email-moshe@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cb6b2e11
    • James Hershaw's avatar
      nfp: flower: add check for flower VF netdevs for get/set_eeprom · f8175547
      James Hershaw authored
      Move the nfp_net_get_port_mac_by_hwinfo() check to ahead in the
      get/set_eeprom() functions to in order to check for a VF netdev, which
      this function does not support.
      
      It is debatable if this is a fix or an enhancement, and we have chosen
      to go for the latter. It does address a problem introduced by
      commit 74b4f173 ("nfp: flower: change get/set_eeprom logic and enable for flower reps").
      However, the ethtool->len == 0 check avoids the problem manifesting as a
      run-time bug (NULL pointer dereference of app).
      Signed-off-by: default avatarJames Hershaw <james.hershaw@corigine.com>
      Reviewed-by: default avatarLouis Peens <louis.peens@corigine.com>
      Signed-off-by: default avatarSimon Horman <simon.horman@corigine.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Link: https://lore.kernel.org/r/20230206154836.2803995-1-simon.horman@corigine.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f8175547
    • Jakub Kicinski's avatar
      Merge branch 'mlxsw-misc-devlink-changes' · b24e9de3
      Jakub Kicinski authored
      Petr Machata says:
      
      ====================
      mlxsw: Misc devlink changes
      
      This patchset adjusts mlxsw to recent devlink changes in net-next.
      
      Patch #1 removes a devl_param_driverinit_value_set() call that was
      unnecessary, but now additionally triggers a WARN_ON.
      
      Patches #2-#4 are non-functional preparations for the following patches.
      
      Patch #5 fixes a use-after-free that is triggered while changing network
      namespaces.
      
      Patch #6 makes mlxsw consistent with netdevsim by having mlxsw register
      its devlink instance before its sub-objects. It helps us avoid a warning
      described in the commit message.
      ====================
      
      Link: https://lore.kernel.org/r/cover.1675692666.git.petrm@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b24e9de3
    • Ido Schimmel's avatar
      mlxsw: core: Register devlink instance before sub-objects · 9d9a90cd
      Ido Schimmel authored
      Recent changes made it possible to register the devlink instance before
      its sub-objects and under the instance lock. Among other things, it
      allows us to avoid warnings such as this one [1]. The warning is
      generated because a buggy firmware is generating a health event during
      driver initialization, before the devlink instance is registered.
      
      Move the registration of the devlink instance to the beginning of the
      initialization flow to avoid such problems.
      
      A similar change was implemented in netdevsim in commit 82a3aef2
      ("netdevsim: move devlink registration under the instance lock").
      
      [1]
      WARNING: CPU: 3 PID: 49 at net/devlink/leftover.c:7509 devlink_recover_notify.constprop.0+0xaf/0xc0
      [...]
      Call Trace:
       <TASK>
       devlink_health_report+0x45/0x1d0
       mlxsw_core_health_event_work+0x24/0x30 [mlxsw_core]
       process_one_work+0x1db/0x390
       worker_thread+0x49/0x3b0
       kthread+0xe5/0x110
       ret_from_fork+0x1f/0x30
       </TASK>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9d9a90cd
    • Ido Schimmel's avatar
      mlxsw: spectrum_acl_tcam: Move devlink param to TCAM code · 74cbc3c0
      Ido Schimmel authored
      Cited commit added 'DEVLINK_CMD_PARAM_DEL' notifications whenever the
      network namespace of the devlink instance is changed. Specifically, the
      notifications are generated after calling reload_down(), but before
      calling reload_up(). At this stage, the data structures accessed while
      reading the value of the "acl_region_rehash_interval" devlink parameter
      are uninitialized, resulting in a use-after-free [1].
      
      Fix by moving the registration and unregistration of the devlink
      parameter to the TCAM code where it is actually used. This means that
      the parameter is unregistered during reload_down() and then
      re-registered during reload_up(), avoiding the use-after-free between
      these two operations.
      
      Reproducer:
      
       # ip netns add test123
       # devlink dev reload pci/0000:06:00.0 netns test123
      
      [1]
      BUG: KASAN: use-after-free in mlxsw_sp_acl_tcam_vregion_rehash_intrvl_get+0xb2/0xd0
      Read of size 4 at addr ffff888162fd37d8 by task devlink/1323
      [...]
      Call Trace:
       <TASK>
       dump_stack_lvl+0x95/0xbd
       print_report+0x181/0x4a1
       kasan_report+0xdb/0x200
       mlxsw_sp_acl_tcam_vregion_rehash_intrvl_get+0xb2/0xd0
       mlxsw_sp_params_acl_region_rehash_intrvl_get+0x32/0x80
       devlink_nl_param_fill.constprop.0+0x29a/0x11e0
       devlink_param_notify.constprop.0+0xb9/0x250
       devlink_notify_unregister+0xbc/0x470
       devlink_reload+0x1aa/0x440
       devlink_nl_cmd_reload+0x559/0x11b0
       genl_family_rcv_msg_doit.isra.0+0x1f8/0x2e0
       genl_rcv_msg+0x558/0x7f0
       netlink_rcv_skb+0x170/0x440
       genl_rcv+0x2d/0x40
       netlink_unicast+0x53f/0x810
       netlink_sendmsg+0x961/0xe80
       __sys_sendto+0x2a4/0x420
       __x64_sys_sendto+0xe5/0x1c0
       do_syscall_64+0x38/0x80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Fixes: 7d7e9169 ("devlink: move devlink reload notifications back in between _down() and _up() calls")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      74cbc3c0
    • Ido Schimmel's avatar
      mlxsw: spectrum_acl_tcam: Reorder functions to avoid forward declarations · 194ab947
      Ido Schimmel authored
      Move the initialization and de-initialization code further below in
      order to avoid forward declarations in the next patch. No functional
      changes.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      194ab947
    • Ido Schimmel's avatar
      mlxsw: spectrum_acl_tcam: Make fini symmetric to init · 61fe3b91
      Ido Schimmel authored
      Move mutex_destroy() to the end to make the function symmetric with
      mlxsw_sp_acl_tcam_init(). No functional changes.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      61fe3b91
    • Ido Schimmel's avatar
      mlxsw: spectrum_acl_tcam: Add missing mutex_destroy() · 65823e07
      Ido Schimmel authored
      Pair mutex_init() with a mutex_destroy() in the error path. Found during
      code review. No functional changes.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      65823e07
    • Danielle Ratson's avatar
      mlxsw: spectrum: Remove pointless call to devlink_param_driverinit_value_set() · 8b50ac29
      Danielle Ratson authored
      The "acl_region_rehash_interval" devlink parameter is a "runtime"
      parameter, making the call to devl_param_driverinit_value_set()
      pointless. Before cited commit the function simply returned an error
      (that was not checked), but now it emits a WARNING [1].
      
      Fix by removing the function call.
      
      [1]
      WARNING: CPU: 0 PID: 7 at net/devlink/leftover.c:10974
      devl_param_driverinit_value_set+0x8c/0x90
      [...]
      Call Trace:
       <TASK>
       mlxsw_sp2_params_register+0x83/0xb0 [mlxsw_spectrum]
       __mlxsw_core_bus_device_register+0x5e5/0x990 [mlxsw_core]
       mlxsw_core_bus_device_register+0x42/0x60 [mlxsw_core]
       mlxsw_pci_probe+0x1f0/0x230 [mlxsw_pci]
       local_pci_probe+0x1a/0x40
       work_for_cpu_fn+0xf/0x20
       process_one_work+0x1db/0x390
       worker_thread+0x1d5/0x3b0
       kthread+0xe5/0x110
       ret_from_fork+0x1f/0x30
       </TASK>
      
      Fixes: 85fe0b32 ("devlink: make devlink_param_driverinit_value_set() return void")
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8b50ac29
    • Vladimir Oltean's avatar
      net: enetc: add support for MAC Merge statistics counters · cf52bd23
      Vladimir Oltean authored
      Add PF driver support for the following:
      
      - Viewing the standardized MAC Merge layer counters.
      
      - Viewing the standardized Ethernet MAC and RMON counters associated
        with the pMAC.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20230206094531.444988-2-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cf52bd23
    • Vladimir Oltean's avatar
      net: enetc: add support for MAC Merge layer · c7b9e808
      Vladimir Oltean authored
      Add PF driver support for viewing and changing the MAC Merge sublayer
      parameters, and seeing the verification state machine's current state.
      The verification handshake with the link partner is driven by hardware.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20230206094531.444988-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c7b9e808
    • Jakub Kicinski's avatar
      Merge branch 'sched-cpumask-improve-on-cpumask_local_spread-locality' · cc74ca30
      Jakub Kicinski authored
      Yury Norov says:
      
      ====================
      sched: cpumask: improve on cpumask_local_spread() locality
      
      cpumask_local_spread() currently checks local node for presence of i'th
      CPU, and then if it finds nothing makes a flat search among all non-local
      CPUs. We can do it better by checking CPUs per NUMA hops.
      
      This has significant performance implications on NUMA machines, for example
      when using NUMA-aware allocated memory together with NUMA-aware IRQ
      affinity hints.
      
      Performance tests from patch 8 of this series for mellanox network
      driver show:
      
        TCP multi-stream, using 16 iperf3 instances pinned to 16 cores (with aRFS on).
        Active cores: 64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121
      
        +-------------------------+-----------+------------------+------------------+
        |                         | BW (Gbps) | TX side CPU util | RX side CPU util |
        +-------------------------+-----------+------------------+------------------+
        | Baseline                | 52.3      | 6.4 %            | 17.9 %           |
        +-------------------------+-----------+------------------+------------------+
        | Applied on TX side only | 52.6      | 5.2 %            | 18.5 %           |
        +-------------------------+-----------+------------------+------------------+
        | Applied on RX side only | 94.9      | 11.9 %           | 27.2 %           |
        +-------------------------+-----------+------------------+------------------+
        | Applied on both sides   | 95.1      | 8.4 %            | 27.3 %           |
        +-------------------------+-----------+------------------+------------------+
      
        Bottleneck in RX side is released, reached linerate (~1.8x speedup).
        ~30% less cpu util on TX.
      ====================
      
      Link: https://lore.kernel.org/r/20230121042436.2661843-1-yury.norov@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cc74ca30
    • Yury Norov's avatar
      lib/cpumask: update comment for cpumask_local_spread() · 2ac4980c
      Yury Norov authored
      Now that we have an iterator-based alternative for a very common case
      of using cpumask_local_spread for all cpus in a row, it's worth to
      mention that in comment to cpumask_local_spread().
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      Reviewed-by: default avatarValentin Schneider <vschneid@redhat.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2ac4980c
    • Tariq Toukan's avatar
      net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints · 2acda577
      Tariq Toukan authored
      In the IRQ affinity hints, replace the binary NUMA preference (local /
      remote) with the improved for_each_numa_hop_cpu() API that minds the
      actual distances, so that remote NUMAs with short distance are preferred
      over farther ones.
      
      This has significant performance implications when using NUMA-aware
      allocated memory (follow [1] and derivatives for example).
      
      [1]
      drivers/net/ethernet/mellanox/mlx5/core/en_main.c :: mlx5e_open_channel()
         int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));
      
      Performance tests:
      
      TCP multi-stream, using 16 iperf3 instances pinned to 16 cores (with aRFS on).
      Active cores: 64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121
      
      +-------------------------+-----------+------------------+------------------+
      |                         | BW (Gbps) | TX side CPU util | RX side CPU util |
      +-------------------------+-----------+------------------+------------------+
      | Baseline                | 52.3      | 6.4 %            | 17.9 %           |
      +-------------------------+-----------+------------------+------------------+
      | Applied on TX side only | 52.6      | 5.2 %            | 18.5 %           |
      +-------------------------+-----------+------------------+------------------+
      | Applied on RX side only | 94.9      | 11.9 %           | 27.2 %           |
      +-------------------------+-----------+------------------+------------------+
      | Applied on both sides   | 95.1      | 8.4 %            | 27.3 %           |
      +-------------------------+-----------+------------------+------------------+
      
      Bottleneck in RX side is released, reached linerate (~1.8x speedup).
      ~30% less cpu util on TX.
      
      * CPU util on active cores only.
      
      Setups details (similar for both sides):
      
      NIC: ConnectX6-DX dual port, 100 Gbps each.
      Single port used in the tests.
      
      $ lscpu
      Architecture:        x86_64
      CPU op-mode(s):      32-bit, 64-bit
      Byte Order:          Little Endian
      CPU(s):              256
      On-line CPU(s) list: 0-255
      Thread(s) per core:  2
      Core(s) per socket:  64
      Socket(s):           2
      NUMA node(s):        16
      Vendor ID:           AuthenticAMD
      CPU family:          25
      Model:               1
      Model name:          AMD EPYC 7763 64-Core Processor
      Stepping:            1
      CPU MHz:             2594.804
      BogoMIPS:            4890.73
      Virtualization:      AMD-V
      L1d cache:           32K
      L1i cache:           32K
      L2 cache:            512K
      L3 cache:            32768K
      NUMA node0 CPU(s):   0-7,128-135
      NUMA node1 CPU(s):   8-15,136-143
      NUMA node2 CPU(s):   16-23,144-151
      NUMA node3 CPU(s):   24-31,152-159
      NUMA node4 CPU(s):   32-39,160-167
      NUMA node5 CPU(s):   40-47,168-175
      NUMA node6 CPU(s):   48-55,176-183
      NUMA node7 CPU(s):   56-63,184-191
      NUMA node8 CPU(s):   64-71,192-199
      NUMA node9 CPU(s):   72-79,200-207
      NUMA node10 CPU(s):  80-87,208-215
      NUMA node11 CPU(s):  88-95,216-223
      NUMA node12 CPU(s):  96-103,224-231
      NUMA node13 CPU(s):  104-111,232-239
      NUMA node14 CPU(s):  112-119,240-247
      NUMA node15 CPU(s):  120-127,248-255
      ..
      
      $ numactl -H
      ..
      node distances:
      node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
        0:  10  11  11  11  12  12  12  12  32  32  32  32  32  32  32  32
        1:  11  10  11  11  12  12  12  12  32  32  32  32  32  32  32  32
        2:  11  11  10  11  12  12  12  12  32  32  32  32  32  32  32  32
        3:  11  11  11  10  12  12  12  12  32  32  32  32  32  32  32  32
        4:  12  12  12  12  10  11  11  11  32  32  32  32  32  32  32  32
        5:  12  12  12  12  11  10  11  11  32  32  32  32  32  32  32  32
        6:  12  12  12  12  11  11  10  11  32  32  32  32  32  32  32  32
        7:  12  12  12  12  11  11  11  10  32  32  32  32  32  32  32  32
        8:  32  32  32  32  32  32  32  32  10  11  11  11  12  12  12  12
        9:  32  32  32  32  32  32  32  32  11  10  11  11  12  12  12  12
       10:  32  32  32  32  32  32  32  32  11  11  10  11  12  12  12  12
       11:  32  32  32  32  32  32  32  32  11  11  11  10  12  12  12  12
       12:  32  32  32  32  32  32  32  32  12  12  12  12  10  11  11  11
       13:  32  32  32  32  32  32  32  32  12  12  12  12  11  10  11  11
       14:  32  32  32  32  32  32  32  32  12  12  12  12  11  11  10  11
       15:  32  32  32  32  32  32  32  32  12  12  12  12  11  11  11  10
      
      $ cat /sys/class/net/ens5f0/device/numa_node
      14
      
      Affinity hints (127 IRQs):
      Before:
      331: 00000000,00000000,00000000,00000000,00010000,00000000,00000000,00000000
      332: 00000000,00000000,00000000,00000000,00020000,00000000,00000000,00000000
      333: 00000000,00000000,00000000,00000000,00040000,00000000,00000000,00000000
      334: 00000000,00000000,00000000,00000000,00080000,00000000,00000000,00000000
      335: 00000000,00000000,00000000,00000000,00100000,00000000,00000000,00000000
      336: 00000000,00000000,00000000,00000000,00200000,00000000,00000000,00000000
      337: 00000000,00000000,00000000,00000000,00400000,00000000,00000000,00000000
      338: 00000000,00000000,00000000,00000000,00800000,00000000,00000000,00000000
      339: 00010000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      340: 00020000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      341: 00040000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      342: 00080000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      343: 00100000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      344: 00200000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      345: 00400000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      346: 00800000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      347: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
      348: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000002
      349: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000004
      350: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000008
      351: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000010
      352: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000020
      353: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000040
      354: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000080
      355: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000100
      356: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000200
      357: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000400
      358: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000800
      359: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00001000
      360: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00002000
      361: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00004000
      362: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00008000
      363: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00010000
      364: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00020000
      365: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00040000
      366: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00080000
      367: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00100000
      368: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00200000
      369: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00400000
      370: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00800000
      371: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,01000000
      372: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,02000000
      373: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,04000000
      374: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,08000000
      375: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,10000000
      376: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,20000000
      377: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,40000000
      378: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,80000000
      379: 00000000,00000000,00000000,00000000,00000000,00000000,00000001,00000000
      380: 00000000,00000000,00000000,00000000,00000000,00000000,00000002,00000000
      381: 00000000,00000000,00000000,00000000,00000000,00000000,00000004,00000000
      382: 00000000,00000000,00000000,00000000,00000000,00000000,00000008,00000000
      383: 00000000,00000000,00000000,00000000,00000000,00000000,00000010,00000000
      384: 00000000,00000000,00000000,00000000,00000000,00000000,00000020,00000000
      385: 00000000,00000000,00000000,00000000,00000000,00000000,00000040,00000000
      386: 00000000,00000000,00000000,00000000,00000000,00000000,00000080,00000000
      387: 00000000,00000000,00000000,00000000,00000000,00000000,00000100,00000000
      388: 00000000,00000000,00000000,00000000,00000000,00000000,00000200,00000000
      389: 00000000,00000000,00000000,00000000,00000000,00000000,00000400,00000000
      390: 00000000,00000000,00000000,00000000,00000000,00000000,00000800,00000000
      391: 00000000,00000000,00000000,00000000,00000000,00000000,00001000,00000000
      392: 00000000,00000000,00000000,00000000,00000000,00000000,00002000,00000000
      393: 00000000,00000000,00000000,00000000,00000000,00000000,00004000,00000000
      394: 00000000,00000000,00000000,00000000,00000000,00000000,00008000,00000000
      395: 00000000,00000000,00000000,00000000,00000000,00000000,00010000,00000000
      396: 00000000,00000000,00000000,00000000,00000000,00000000,00020000,00000000
      397: 00000000,00000000,00000000,00000000,00000000,00000000,00040000,00000000
      398: 00000000,00000000,00000000,00000000,00000000,00000000,00080000,00000000
      399: 00000000,00000000,00000000,00000000,00000000,00000000,00100000,00000000
      400: 00000000,00000000,00000000,00000000,00000000,00000000,00200000,00000000
      401: 00000000,00000000,00000000,00000000,00000000,00000000,00400000,00000000
      402: 00000000,00000000,00000000,00000000,00000000,00000000,00800000,00000000
      403: 00000000,00000000,00000000,00000000,00000000,00000000,01000000,00000000
      404: 00000000,00000000,00000000,00000000,00000000,00000000,02000000,00000000
      405: 00000000,00000000,00000000,00000000,00000000,00000000,04000000,00000000
      406: 00000000,00000000,00000000,00000000,00000000,00000000,08000000,00000000
      407: 00000000,00000000,00000000,00000000,00000000,00000000,10000000,00000000
      408: 00000000,00000000,00000000,00000000,00000000,00000000,20000000,00000000
      409: 00000000,00000000,00000000,00000000,00000000,00000000,40000000,00000000
      410: 00000000,00000000,00000000,00000000,00000000,00000000,80000000,00000000
      411: 00000000,00000000,00000000,00000000,00000000,00000001,00000000,00000000
      412: 00000000,00000000,00000000,00000000,00000000,00000002,00000000,00000000
      413: 00000000,00000000,00000000,00000000,00000000,00000004,00000000,00000000
      414: 00000000,00000000,00000000,00000000,00000000,00000008,00000000,00000000
      415: 00000000,00000000,00000000,00000000,00000000,00000010,00000000,00000000
      416: 00000000,00000000,00000000,00000000,00000000,00000020,00000000,00000000
      417: 00000000,00000000,00000000,00000000,00000000,00000040,00000000,00000000
      418: 00000000,00000000,00000000,00000000,00000000,00000080,00000000,00000000
      419: 00000000,00000000,00000000,00000000,00000000,00000100,00000000,00000000
      420: 00000000,00000000,00000000,00000000,00000000,00000200,00000000,00000000
      421: 00000000,00000000,00000000,00000000,00000000,00000400,00000000,00000000
      422: 00000000,00000000,00000000,00000000,00000000,00000800,00000000,00000000
      423: 00000000,00000000,00000000,00000000,00000000,00001000,00000000,00000000
      424: 00000000,00000000,00000000,00000000,00000000,00002000,00000000,00000000
      425: 00000000,00000000,00000000,00000000,00000000,00004000,00000000,00000000
      426: 00000000,00000000,00000000,00000000,00000000,00008000,00000000,00000000
      427: 00000000,00000000,00000000,00000000,00000000,00010000,00000000,00000000
      428: 00000000,00000000,00000000,00000000,00000000,00020000,00000000,00000000
      429: 00000000,00000000,00000000,00000000,00000000,00040000,00000000,00000000
      430: 00000000,00000000,00000000,00000000,00000000,00080000,00000000,00000000
      431: 00000000,00000000,00000000,00000000,00000000,00100000,00000000,00000000
      432: 00000000,00000000,00000000,00000000,00000000,00200000,00000000,00000000
      433: 00000000,00000000,00000000,00000000,00000000,00400000,00000000,00000000
      434: 00000000,00000000,00000000,00000000,00000000,00800000,00000000,00000000
      435: 00000000,00000000,00000000,00000000,00000000,01000000,00000000,00000000
      436: 00000000,00000000,00000000,00000000,00000000,02000000,00000000,00000000
      437: 00000000,00000000,00000000,00000000,00000000,04000000,00000000,00000000
      438: 00000000,00000000,00000000,00000000,00000000,08000000,00000000,00000000
      439: 00000000,00000000,00000000,00000000,00000000,10000000,00000000,00000000
      440: 00000000,00000000,00000000,00000000,00000000,20000000,00000000,00000000
      441: 00000000,00000000,00000000,00000000,00000000,40000000,00000000,00000000
      442: 00000000,00000000,00000000,00000000,00000000,80000000,00000000,00000000
      443: 00000000,00000000,00000000,00000000,00000001,00000000,00000000,00000000
      444: 00000000,00000000,00000000,00000000,00000002,00000000,00000000,00000000
      445: 00000000,00000000,00000000,00000000,00000004,00000000,00000000,00000000
      446: 00000000,00000000,00000000,00000000,00000008,00000000,00000000,00000000
      447: 00000000,00000000,00000000,00000000,00000010,00000000,00000000,00000000
      448: 00000000,00000000,00000000,00000000,00000020,00000000,00000000,00000000
      449: 00000000,00000000,00000000,00000000,00000040,00000000,00000000,00000000
      450: 00000000,00000000,00000000,00000000,00000080,00000000,00000000,00000000
      451: 00000000,00000000,00000000,00000000,00000100,00000000,00000000,00000000
      452: 00000000,00000000,00000000,00000000,00000200,00000000,00000000,00000000
      453: 00000000,00000000,00000000,00000000,00000400,00000000,00000000,00000000
      454: 00000000,00000000,00000000,00000000,00000800,00000000,00000000,00000000
      455: 00000000,00000000,00000000,00000000,00001000,00000000,00000000,00000000
      456: 00000000,00000000,00000000,00000000,00002000,00000000,00000000,00000000
      457: 00000000,00000000,00000000,00000000,00004000,00000000,00000000,00000000
      
      After:
      331: 00000000,00000000,00000000,00000000,00010000,00000000,00000000,00000000
      332: 00000000,00000000,00000000,00000000,00020000,00000000,00000000,00000000
      333: 00000000,00000000,00000000,00000000,00040000,00000000,00000000,00000000
      334: 00000000,00000000,00000000,00000000,00080000,00000000,00000000,00000000
      335: 00000000,00000000,00000000,00000000,00100000,00000000,00000000,00000000
      336: 00000000,00000000,00000000,00000000,00200000,00000000,00000000,00000000
      337: 00000000,00000000,00000000,00000000,00400000,00000000,00000000,00000000
      338: 00000000,00000000,00000000,00000000,00800000,00000000,00000000,00000000
      339: 00010000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      340: 00020000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      341: 00040000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      342: 00080000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      343: 00100000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      344: 00200000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      345: 00400000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      346: 00800000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      347: 00000000,00000000,00000000,00000000,00000001,00000000,00000000,00000000
      348: 00000000,00000000,00000000,00000000,00000002,00000000,00000000,00000000
      349: 00000000,00000000,00000000,00000000,00000004,00000000,00000000,00000000
      350: 00000000,00000000,00000000,00000000,00000008,00000000,00000000,00000000
      351: 00000000,00000000,00000000,00000000,00000010,00000000,00000000,00000000
      352: 00000000,00000000,00000000,00000000,00000020,00000000,00000000,00000000
      353: 00000000,00000000,00000000,00000000,00000040,00000000,00000000,00000000
      354: 00000000,00000000,00000000,00000000,00000080,00000000,00000000,00000000
      355: 00000000,00000000,00000000,00000000,00000100,00000000,00000000,00000000
      356: 00000000,00000000,00000000,00000000,00000200,00000000,00000000,00000000
      357: 00000000,00000000,00000000,00000000,00000400,00000000,00000000,00000000
      358: 00000000,00000000,00000000,00000000,00000800,00000000,00000000,00000000
      359: 00000000,00000000,00000000,00000000,00001000,00000000,00000000,00000000
      360: 00000000,00000000,00000000,00000000,00002000,00000000,00000000,00000000
      361: 00000000,00000000,00000000,00000000,00004000,00000000,00000000,00000000
      362: 00000000,00000000,00000000,00000000,00008000,00000000,00000000,00000000
      363: 00000000,00000000,00000000,00000000,01000000,00000000,00000000,00000000
      364: 00000000,00000000,00000000,00000000,02000000,00000000,00000000,00000000
      365: 00000000,00000000,00000000,00000000,04000000,00000000,00000000,00000000
      366: 00000000,00000000,00000000,00000000,08000000,00000000,00000000,00000000
      367: 00000000,00000000,00000000,00000000,10000000,00000000,00000000,00000000
      368: 00000000,00000000,00000000,00000000,20000000,00000000,00000000,00000000
      369: 00000000,00000000,00000000,00000000,40000000,00000000,00000000,00000000
      370: 00000000,00000000,00000000,00000000,80000000,00000000,00000000,00000000
      371: 00000001,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      372: 00000002,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      373: 00000004,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      374: 00000008,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      375: 00000010,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      376: 00000020,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      377: 00000040,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      378: 00000080,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      379: 00000100,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      380: 00000200,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      381: 00000400,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      382: 00000800,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      383: 00001000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      384: 00002000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      385: 00004000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      386: 00008000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      387: 01000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      388: 02000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      389: 04000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      390: 08000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      391: 10000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      392: 20000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      393: 40000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      394: 80000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
      395: 00000000,00000000,00000000,00000000,00000000,00000001,00000000,00000000
      396: 00000000,00000000,00000000,00000000,00000000,00000002,00000000,00000000
      397: 00000000,00000000,00000000,00000000,00000000,00000004,00000000,00000000
      398: 00000000,00000000,00000000,00000000,00000000,00000008,00000000,00000000
      399: 00000000,00000000,00000000,00000000,00000000,00000010,00000000,00000000
      400: 00000000,00000000,00000000,00000000,00000000,00000020,00000000,00000000
      401: 00000000,00000000,00000000,00000000,00000000,00000040,00000000,00000000
      402: 00000000,00000000,00000000,00000000,00000000,00000080,00000000,00000000
      403: 00000000,00000000,00000000,00000000,00000000,00000100,00000000,00000000
      404: 00000000,00000000,00000000,00000000,00000000,00000200,00000000,00000000
      405: 00000000,00000000,00000000,00000000,00000000,00000400,00000000,00000000
      406: 00000000,00000000,00000000,00000000,00000000,00000800,00000000,00000000
      407: 00000000,00000000,00000000,00000000,00000000,00001000,00000000,00000000
      408: 00000000,00000000,00000000,00000000,00000000,00002000,00000000,00000000
      409: 00000000,00000000,00000000,00000000,00000000,00004000,00000000,00000000
      410: 00000000,00000000,00000000,00000000,00000000,00008000,00000000,00000000
      411: 00000000,00000000,00000000,00000000,00000000,00010000,00000000,00000000
      412: 00000000,00000000,00000000,00000000,00000000,00020000,00000000,00000000
      413: 00000000,00000000,00000000,00000000,00000000,00040000,00000000,00000000
      414: 00000000,00000000,00000000,00000000,00000000,00080000,00000000,00000000
      415: 00000000,00000000,00000000,00000000,00000000,00100000,00000000,00000000
      416: 00000000,00000000,00000000,00000000,00000000,00200000,00000000,00000000
      417: 00000000,00000000,00000000,00000000,00000000,00400000,00000000,00000000
      418: 00000000,00000000,00000000,00000000,00000000,00800000,00000000,00000000
      419: 00000000,00000000,00000000,00000000,00000000,01000000,00000000,00000000
      420: 00000000,00000000,00000000,00000000,00000000,02000000,00000000,00000000
      421: 00000000,00000000,00000000,00000000,00000000,04000000,00000000,00000000
      422: 00000000,00000000,00000000,00000000,00000000,08000000,00000000,00000000
      423: 00000000,00000000,00000000,00000000,00000000,10000000,00000000,00000000
      424: 00000000,00000000,00000000,00000000,00000000,20000000,00000000,00000000
      425: 00000000,00000000,00000000,00000000,00000000,40000000,00000000,00000000
      426: 00000000,00000000,00000000,00000000,00000000,80000000,00000000,00000000
      427: 00000000,00000001,00000000,00000000,00000000,00000000,00000000,00000000
      428: 00000000,00000002,00000000,00000000,00000000,00000000,00000000,00000000
      429: 00000000,00000004,00000000,00000000,00000000,00000000,00000000,00000000
      430: 00000000,00000008,00000000,00000000,00000000,00000000,00000000,00000000
      431: 00000000,00000010,00000000,00000000,00000000,00000000,00000000,00000000
      432: 00000000,00000020,00000000,00000000,00000000,00000000,00000000,00000000
      433: 00000000,00000040,00000000,00000000,00000000,00000000,00000000,00000000
      434: 00000000,00000080,00000000,00000000,00000000,00000000,00000000,00000000
      435: 00000000,00000100,00000000,00000000,00000000,00000000,00000000,00000000
      436: 00000000,00000200,00000000,00000000,00000000,00000000,00000000,00000000
      437: 00000000,00000400,00000000,00000000,00000000,00000000,00000000,00000000
      438: 00000000,00000800,00000000,00000000,00000000,00000000,00000000,00000000
      439: 00000000,00001000,00000000,00000000,00000000,00000000,00000000,00000000
      440: 00000000,00002000,00000000,00000000,00000000,00000000,00000000,00000000
      441: 00000000,00004000,00000000,00000000,00000000,00000000,00000000,00000000
      442: 00000000,00008000,00000000,00000000,00000000,00000000,00000000,00000000
      443: 00000000,00010000,00000000,00000000,00000000,00000000,00000000,00000000
      444: 00000000,00020000,00000000,00000000,00000000,00000000,00000000,00000000
      445: 00000000,00040000,00000000,00000000,00000000,00000000,00000000,00000000
      446: 00000000,00080000,00000000,00000000,00000000,00000000,00000000,00000000
      447: 00000000,00100000,00000000,00000000,00000000,00000000,00000000,00000000
      448: 00000000,00200000,00000000,00000000,00000000,00000000,00000000,00000000
      449: 00000000,00400000,00000000,00000000,00000000,00000000,00000000,00000000
      450: 00000000,00800000,00000000,00000000,00000000,00000000,00000000,00000000
      451: 00000000,01000000,00000000,00000000,00000000,00000000,00000000,00000000
      452: 00000000,02000000,00000000,00000000,00000000,00000000,00000000,00000000
      453: 00000000,04000000,00000000,00000000,00000000,00000000,00000000,00000000
      454: 00000000,08000000,00000000,00000000,00000000,00000000,00000000,00000000
      455: 00000000,10000000,00000000,00000000,00000000,00000000,00000000,00000000
      456: 00000000,20000000,00000000,00000000,00000000,00000000,00000000,00000000
      457: 00000000,40000000,00000000,00000000,00000000,00000000,00000000,00000000
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      [Tweaked API use]
      Suggested-by: default avatarYury Norov <yury.norov@gmail.com>
      Signed-off-by: default avatarValentin Schneider <vschneid@redhat.com>
      Reviewed-by: default avatarYury Norov <yury.norov@gmail.com>
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2acda577
    • Valentin Schneider's avatar
      sched/topology: Introduce for_each_numa_hop_mask() · 06ac0172
      Valentin Schneider authored
      The recently introduced sched_numa_hop_mask() exposes cpumasks of CPUs
      reachable within a given distance budget, wrap the logic for iterating over
      all (distance, mask) values inside an iterator macro.
      Signed-off-by: default avatarValentin Schneider <vschneid@redhat.com>
      Reviewed-by: default avatarYury Norov <yury.norov@gmail.com>
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      06ac0172
    • Valentin Schneider's avatar
      sched/topology: Introduce sched_numa_hop_mask() · 9feae658
      Valentin Schneider authored
      Tariq has pointed out that drivers allocating IRQ vectors would benefit
      from having smarter NUMA-awareness - cpumask_local_spread() only knows
      about the local node and everything outside is in the same bucket.
      
      sched_domains_numa_masks is pretty much what we want to hand out (a cpumask
      of CPUs reachable within a given distance budget), introduce
      sched_numa_hop_mask() to export those cpumasks.
      
      Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.comSigned-off-by: default avatarValentin Schneider <vschneid@redhat.com>
      Reviewed-by: default avatarYury Norov <yury.norov@gmail.com>
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9feae658
    • Yury Norov's avatar
      lib/cpumask: reorganize cpumask_local_spread() logic · b1beed72
      Yury Norov authored
      Now after moving all NUMA logic into sched_numa_find_nth_cpu(),
      else-branch of cpumask_local_spread() is just a function call, and
      we can simplify logic by using ternary operator.
      
      While here, replace BUG() with WARN_ON().
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      Acked-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarPeter Lafreniere <peter@n8pjl.ca>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b1beed72
    • Yury Norov's avatar
      cpumask: improve on cpumask_local_spread() locality · 406d394a
      Yury Norov authored
      Switch cpumask_local_spread() to use newly added sched_numa_find_nth_cpu(),
      which takes into account distances to each node in the system.
      
      For the following NUMA configuration:
      
      root@debian:~# numactl -H
      available: 4 nodes (0-3)
      node 0 cpus: 0 1 2 3
      node 0 size: 3869 MB
      node 0 free: 3740 MB
      node 1 cpus: 4 5
      node 1 size: 1969 MB
      node 1 free: 1937 MB
      node 2 cpus: 6 7
      node 2 size: 1967 MB
      node 2 free: 1873 MB
      node 3 cpus: 8 9 10 11 12 13 14 15
      node 3 size: 7842 MB
      node 3 free: 7723 MB
      node distances:
      node   0   1   2   3
        0:  10  50  30  70
        1:  50  10  70  30
        2:  30  70  10  50
        3:  70  30  50  10
      
      The new cpumask_local_spread() traverses cpus for each node like this:
      
      node 0:   0   1   2   3   6   7   4   5   8   9  10  11  12  13  14  15
      node 1:   4   5   8   9  10  11  12  13  14  15   0   1   2   3   6   7
      node 2:   6   7   0   1   2   3   8   9  10  11  12  13  14  15   4   5
      node 3:   8   9  10  11  12  13  14  15   4   5   6   7   0   1   2   3
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      Acked-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarPeter Lafreniere <peter@n8pjl.ca>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      406d394a
    • Yury Norov's avatar
      sched: add sched_numa_find_nth_cpu() · cd7f5535
      Yury Norov authored
      The function finds Nth set CPU in a given cpumask starting from a given
      node.
      
      Leveraging the fact that each hop in sched_domains_numa_masks includes the
      same or greater number of CPUs than the previous one, we can use binary
      search on hops instead of linear walk, which makes the overall complexity
      of O(log n) in terms of number of cpumask_weight() calls.
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      Acked-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarPeter Lafreniere <peter@n8pjl.ca>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cd7f5535
    • Yury Norov's avatar
      cpumask: introduce cpumask_nth_and_andnot · 62f4386e
      Yury Norov authored
      Introduce cpumask_nth_and_andnot() based on find_nth_and_andnot_bit().
      It's used in the following patch to traverse cpumasks without storing
      intermediate result in temporary cpumask.
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      Acked-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarPeter Lafreniere <peter@n8pjl.ca>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      62f4386e
    • Yury Norov's avatar
      lib/find: introduce find_nth_and_andnot_bit · 43245117
      Yury Norov authored
      In the following patches the function is used to implement in-place bitmaps
      traversing without storing intermediate result in temporary bitmaps.
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      Acked-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarPeter Lafreniere <peter@n8pjl.ca>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      43245117