1. 15 Feb, 2021 30 commits
    • Stefan Chulski's avatar
      net: mvpp2: improve mvpp2_get_sram return · 9ad78d81
      Stefan Chulski authored
      Use PTR_ERR_OR_ZERO instead of IS_ERR and PTR_ERR.
      Non functional change.
      Signed-off-by: default avatarStefan Chulski <stefanc@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ad78d81
    • Stefan Chulski's avatar
      net: mvpp2: improve Packet Processor version check · f704177e
      Stefan Chulski authored
      Use >= MVPP22 instead of != MVPP21.
      Non functional change.
      Signed-off-by: default avatarStefan Chulski <stefanc@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f704177e
    • Stefan Chulski's avatar
      net: mvpp2: simplify PPv2 version ID read · 8b986866
      Stefan Chulski authored
      PPv2.1 contain 0 in Version ID register, priv->hw_version check
      can be removed.
      Signed-off-by: default avatarStefan Chulski <stefanc@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b986866
    • David S. Miller's avatar
      Merge branch 'Propagate-extack-for-switchdev-LANs-from-DSA' · 7f6334f7
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      Propagate extack for switchdev VLANs from DSA
      
      This series moves the restriction messages printed by the DSA core, and
      by some individual device drivers, into the netlink extended ack
      structure, to be communicated to user space where possible, or still
      printed to the kernel log from the bridge layer.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f6334f7
    • Vladimir Oltean's avatar
      net: dsa: propagate extack to .port_vlan_filtering · 89153ed6
      Vladimir Oltean authored
      Some drivers can't dynamically change the VLAN filtering option, or
      impose some restrictions, it would be nice to propagate this info
      through netlink instead of printing it to a kernel log that might never
      be read. Also netlink extack includes the module that emitted the
      message, which means that it's easier to figure out which ones are
      driver-generated errors as opposed to command misuse.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      89153ed6
    • Vladimir Oltean's avatar
      net: dsa: propagate extack to .port_vlan_add · 31046a5f
      Vladimir Oltean authored
      Allow drivers to communicate their restrictions to user space directly,
      instead of printing to the kernel log. Where the conversion would have
      been lossy and things like VLAN ID could no longer be conveyed (due to
      the lack of support for printf format specifier in netlink extack), I
      chose to keep the messages in full form to the kernel log only, and
      leave it up to individual driver maintainers to move more messages to
      extack.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      31046a5f
    • Vladimir Oltean's avatar
      net: bridge: propagate extack through switchdev_port_attr_set · dcbdf135
      Vladimir Oltean authored
      The benefit is the ability to propagate errors from switchdev drivers
      for the SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING and
      SWITCHDEV_ATTR_ID_BRIDGE_VLAN_PROTOCOL attributes.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dcbdf135
    • Vladimir Oltean's avatar
      net: bridge: propagate extack through store_bridge_parm · 9e781401
      Vladimir Oltean authored
      The bridge sysfs interface stores parameters for the STP, VLAN,
      multicast etc subsystems using a predefined function prototype.
      Sometimes the underlying function being called supports a netlink
      extended ack message, and we ignore it.
      
      Let's expand the store_bridge_parm function prototype to include the
      extack, and just print it to console, but at least propagate it where
      applicable. Where not applicable, create a shim function in the
      br_sysfs_br.c file that discards the extra function argument.
      
      This patch allows us to propagate the extack argument to
      br_vlan_set_default_pvid, br_vlan_set_proto and br_vlan_filter_toggle,
      and from there, further up in br_changelink from br_netlink.c.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9e781401
    • Vladimir Oltean's avatar
      net: bridge: remove __br_vlan_filter_toggle · 7a572964
      Vladimir Oltean authored
      This function is identical with br_vlan_filter_toggle.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Acked-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7a572964
    • David S. Miller's avatar
      Merge branch 'PTP-for-DSA-tag_ocelot_8021q' · c48f8607
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      PTP for DSA tag_ocelot_8021q
      
      Changes in v2:
      Add stub definition for ocelot_port_inject_frame when switch driver is
      not compiled in.
      
      This is part two of the errata workaround begun here:
      https://patchwork.kernel.org/project/netdevbpf/cover/20210129010009.3959398-1-olteanv@gmail.com/
      
      Now that we have basic traffic support when we operate the Ocelot DSA
      switches without an NPI port, it would be nice to regain some of the
      features lost due to the lack of the NPI port functionality. An
      important one is PTP timestamping, which is intimately tied to the DSA
      frame header added by the NPI port: on TX, we put a "timestamp request
      ID" in the Injection Frame Header, while on RX, the Extraction Frame
      Header contains a partial 32-bit PTP timestamp. Get rid of the NPI port
      and replace it with a VLAN-based tagger, and you lose PTP, right?
      
      Well, not quite, this is what this patch series is about. The NPI port
      is basically a regular Ethernet port configured to service the packets
      in and out of the switch's CPU port module (which has other non-DSA I/O
      mechanisms too, such as register-based MMIO and DMA). If we disable the
      NPI port, we can in theory still access the packets delivered to the CPU
      port module by doing exactly what the ocelot switchdev driver does:
      extracting Ethernet packets through registers (yes, it is as icky as it
      sounds).
      
      However, there's a catch. The Felix switch was integrated into NXP
      LS1028A with the idea in mind that it will operate as DSA, i.e. using
      the CPU port module connected to the NPI port, not having I/O over
      register-based MMIO which is painfully slow and CPU intensive. So
      register-based packet I/O not supposed to work - those registers aren't
      even documented in the hardware reference manual for Felix. However
      they kinda do, with the exception of the fact that an RX interrupt was
      really not wired to the CPU cores - so we don't know when the CPU port
      module receives a new packet. But we can hack even around that, by
      replicating every packet that goes to the CPU port module and making it
      also go to a plain internal Ethernet port. Then drop the Ethernet packet
      and read the other copy of it from the CPU port module, this time
      annotated with the much-wanted RX timestamp.
      
      This is all fine and it works, but it does raise some questions about
      what DSA even is anymore, if we start having switches that inject some
      of their packets over Ethernet and some through registers, where do we
      draw the line. In principle I believe these concerns are founded, but at
      the same time, the way that the Felix driver uses register MMIO based
      packet I/O is fundamentally the same as any other DSA driver capable of
      PTP makes use of a side-channel for timestamps like a FIFO (just that
      this one is a lot more complicated, and comes with the entire actual
      packet, not just the timestamp).
      
      Nonetheless, I tried to keep the extra pressure added by this ERR
      workaround upon the DSA subsystem as small as possible, so some of the
      patches are just a revisit of some of Andrew's complaints w.r.t. the
      fact that tag_ocelot already violates any driver <-> tagger boundary,
      and as a consequence, is not able to be used on testbeds such as
      dsa_loop (which it now can). So now, the tag_ocelot and tag_ocelot_8021q
      drivers should be dsa_loop-clean, and have the ERR workarounds as
      self-contained as possible, using all the designated features for PTP
      timestamping and nothing more.
      
      Comments appreciated.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c48f8607
    • Vladimir Oltean's avatar
      net: dsa: tag_ocelot_8021q: add support for PTP timestamping · 0a6f17c6
      Vladimir Oltean authored
      For TX timestamping, we use the felix_txtstamp method which is common
      with the regular (non-8021q) ocelot tagger. This method says that skb
      deferral is needed, prepares a timestamp request ID, and puts a clone of
      the skb in a queue waiting for the timestamp IRQ.
      
      felix_txtstamp is called by dsa_skb_tx_timestamp() just before the
      tagger's xmit method. In the tagger xmit, we divert the packets
      classified by dsa_skb_tx_timestamp() as PTP towards the MMIO-based
      injection registers, and we declare them as dead towards dsa_slave_xmit.
      If not PTP, we proceed with normal tag_8021q stuff.
      
      Then the timestamp IRQ fires, the clone queued up from felix_txtstamp is
      matched to the TX timestamp retrieved from the switch's FIFO based on
      the timestamp request ID, and the clone is delivered to the stack.
      
      On RX, thanks to the VCAP IS2 rule that redirects the frames with an
      EtherType for 1588 towards two destinations:
      - the CPU port module (for MMIO based extraction) and
      - if the "no XTR IRQ" workaround is in place, the dsa_8021q CPU port
      the relevant data path processing starts in the ptp_classify_raw BPF
      classifier installed by DSA in the RX data path (post tagger, which is
      completely unaware that it saw a PTP packet).
      
      This time we can't reuse the same implementation of .port_rxtstamp that
      also works with the default ocelot tagger. That is because felix_rxtstamp
      is given an skb with a freshly stripped DSA header, and it says "I don't
      need deferral for its RX timestamp, it's right in it, let me show you";
      and it just points to the header right behind skb->data, from where it
      unpacks the timestamp and annotates the skb with it.
      
      The same thing cannot happen with tag_ocelot_8021q, because for one
      thing, the skb did not have an extraction frame header in the first
      place, but a VLAN tag with no timestamp information. So the code paths
      in felix_rxtstamp for the regular and 8021q tagger are completely
      independent. With tag_8021q, the timestamp must come from the packet's
      duplicate delivered to the CPU port module, but there is potentially
      complex logic to be handled [ and prone to reordering ] if we were to
      just start reading packets from the CPU port module, and try to match
      them to the one we received over Ethernet and which needs an RX
      timestamp. So we do something simple: we tell DSA "give me some time to
      think" (we request skb deferral by returning false from .port_rxtstamp)
      and we just drop the frame we got over Ethernet with no attempt to match
      it to anything - we just treat it as a notification that there's data to
      be processed from the CPU port module's queues. Then we proceed to read
      the packets from those, one by one, which we deliver up the stack,
      timestamped, using netif_rx - the same function that any driver would
      use anyway if it needed RX timestamp deferral. So the assumption is that
      we'll come across the PTP packet that triggered the CPU extraction
      notification eventually, but we don't know when exactly. Thanks to the
      VCAP IS2 trap/redirect rule and the exclusion of the CPU port module
      from the flooding replicators, only PTP frames should be present in the
      CPU port module's RX queues anyway.
      
      There is just one conflict between the VCAP IS2 trapping rule and the
      semantics of the BPF classifier. Namely, ptp_classify_raw() deems
      general messages as non-timestampable, but still, those are trapped to
      the CPU port module since they have an EtherType of ETH_P_1588. So, if
      the "no XTR IRQ" workaround is in place, we need to run another BPF
      classifier on the frames extracted over MMIO, to avoid duplicates being
      sent to the stack (once over Ethernet, once over MMIO). It doesn't look
      like it's possible to install VCAP IS2 rules based on keys extracted
      from the 1588 frame headers.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a6f17c6
    • Vladimir Oltean's avatar
      net: dsa: felix: setup MMIO filtering rules for PTP when using tag_8021q · c8c0ba4f
      Vladimir Oltean authored
      Since the tag_8021q tagger is software-defined, it has no means by
      itself for retrieving hardware timestamps of PTP event messages.
      
      Because we do want to support PTP on ocelot even with tag_8021q, we need
      to use the CPU port module for that. The RX timestamp is present in the
      Extraction Frame Header. And because we can't use NPI mode which redirects
      the CPU queues to an "external CPU" (meaning the ARM CPU running Linux),
      then we need to poll the CPU port module through the MMIO registers to
      retrieve TX and RX timestamps.
      
      Sadly, on NXP LS1028A, the Felix switch was integrated into the SoC
      without wiring the extraction IRQ line to the ARM GIC. So, if we want to
      be notified of any PTP packets received on the CPU port module, we have
      a problem.
      
      There is a possible workaround, which is to use the Ethernet CPU port as
      a notification channel that packets are available on the CPU port module
      as well. When a PTP packet is received by the DSA tagger (without timestamp,
      of course), we go to the CPU extraction queues, poll for it there, then
      we drop the original Ethernet packet and masquerade the packet retrieved
      over MMIO (plus the timestamp) as the original when we inject it up the
      stack.
      
      Create a quirk in struct felix is selected by the Felix driver (but not
      by Seville, since that doesn't support PTP at all). We want to do this
      such that the workaround is minimally invasive for future switches that
      don't require this workaround.
      
      The only traffic for which we need timestamps is PTP traffic, so add a
      redirection rule to the CPU port module for this. Currently we only have
      the need for PTP over L2, so redirection rules for UDP ports 319 and 320
      are TBD for now.
      
      Note that for the workaround of matching of PTP-over-Ethernet-port with
      PTP-over-MMIO queues to work properly, both channels need to be
      absolutely lossless. There are two parts to achieving that:
      - We keep flow control enabled on the tag_8021q CPU port
      - We put the DSA master interface in promiscuous mode, so it will never
        drop a PTP frame (for the profiles we are interested in, these are
        sent to the multicast MAC addresses of 01-80-c2-00-00-0e and
        01-1b-19-00-00-00).
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8c0ba4f
    • Vladimir Oltean's avatar
      net: mscc: ocelot: refactor ocelot_xtr_irq_handler into ocelot_xtr_poll · 924ee317
      Vladimir Oltean authored
      Since the felix DSA driver will need to poll the CPU port module for
      extracted frames as well, let's create some common functions that read
      an Extraction Frame Header, and then an skb, from a CPU extraction
      group.
      
      We abuse the struct ocelot_ops :: port_to_netdev function a little bit,
      in order to retrieve the DSA port net_device or the ocelot switchdev
      net_device based on the source port information from the Extraction
      Frame Header, but it's all in the benefit of code simplification -
      netdev_alloc_skb needs it. Originally, the port_to_netdev method was
      intended for parsing act->dev from tc flower offload code.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      924ee317
    • Vladimir Oltean's avatar
      net: dsa: tag_ocelot: create separate tagger for Seville · 7c4bb540
      Vladimir Oltean authored
      The ocelot tagger is a hot mess currently, it relies on memory
      initialized by the attached driver for basic frame transmission.
      This is against all that DSA tagging protocols stand for, which is that
      the transmission and reception of a DSA-tagged frame, the data path,
      should be independent from the switch control path, because the tag
      protocol is in principle hot-pluggable and reusable across switches
      (even if in practice it wasn't until very recently). But if another
      driver like dsa_loop wants to make use of tag_ocelot, it couldn't.
      
      This was done to have common code between Felix and Ocelot, which have
      one bit difference in the frame header format. Quoting from commit
      67c24049 ("net: dsa: felix: create a template for the DSA tags on
      xmit"):
      
          Other alternatives have been analyzed, such as:
          - Create a separate tag_seville.c: too much code duplication for just 1
            bit field difference.
          - Create a separate DSA_TAG_PROTO_SEVILLE under tag_ocelot.c, just like
            tag_brcm.c, which would have a separate .xmit function. Again, too
            much code duplication for just 1 bit field difference.
          - Allocate the template from the init function of the tag_ocelot.c
            module, instead of from the driver: couldn't figure out a method of
            accessing the correct port template corresponding to the correct
            tagger in the .xmit function.
      
      The really interesting part is that Seville should have had its own
      tagging protocol defined - it is not compatible on the wire with Ocelot,
      even for that single bit. In principle, a packet generated by
      DSA_TAG_PROTO_OCELOT when booted on NXP LS1028A would look in a certain
      way, but when booted on NXP T1040 it would look differently. The reverse
      is also true: a packet generated by a Seville switch would be
      interpreted incorrectly by Wireshark if it was told it was generated by
      an Ocelot switch.
      
      Actually things are a bit more nuanced. If we concentrate only on the
      DSA tag, what I said above is true, but Ocelot/Seville also support an
      optional DSA tag prefix, which can be short or long, and it is possible
      to distinguish the two taggers based on an integer constant put in that
      prefix. Nonetheless, creating a separate tagger is still justified,
      since the tag prefix is optional, and without it, there is again no way
      to distinguish.
      
      Claiming backwards binary compatibility is a bit more tough, since I've
      already changed the format of tag_ocelot once, in commit 5124197c
      ("net: dsa: tag_ocelot: use a short prefix on both ingress and egress").
      Therefore I am not very concerned with treating this as a bugfix and
      backporting it to stable kernels (which would be another mess due to the
      fact that there would be lots of conflicts with the other DSA_TAG_PROTO*
      definitions). It's just simpler to say that the string values of the
      taggers have ABI value starting with kernel 5.12, which will be when the
      changing of tag protocol via /sys/class/net/<dsa-master>/dsa/tagging
      goes live.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7c4bb540
    • Vladimir Oltean's avatar
      net: dsa: tag_ocelot: single out PTP-related transmit tag processing · 62bf5fde
      Vladimir Oltean authored
      There is one place where we cannot avoid accessing driver data, and that
      is 2-step PTP TX timestamping, since the switch wants us to provide a
      timestamp request ID through the injection header, which naturally must
      come from a sequence number kept by the driver (it is generated by the
      .port_txtstamp method prior to the tagger's xmit).
      
      However, since other drivers like dsa_loop do not claim PTP support
      anyway, the DSA_SKB_CB(skb)->clone will always be NULL anyway, so if we
      move all PTP-related dereferences of struct ocelot and struct ocelot_port
      into a separate function, we can effectively ensure that this is dead
      code when the ocelot tagger is attached to non-ocelot switches, and the
      stateful portion of the tagger is more self-contained.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      62bf5fde
    • Vladimir Oltean's avatar
      net: mscc: ocelot: use common tag parsing code with DSA · 40d3f295
      Vladimir Oltean authored
      The Injection Frame Header and Extraction Frame Header that the switch
      prepends to frames over the NPI port is also prepended to frames
      delivered over the CPU port module's queues.
      
      Let's unify the handling of the frame headers by making the ocelot
      driver call some helpers exported by the DSA tagger. Among other things,
      this allows us to get rid of the strange cpu_to_be32 when transmitting
      the Injection Frame Header on ocelot, since the packing API uses
      network byte order natively (when "quirks" is 0).
      
      The comments above ocelot_gen_ifh talk about setting pop_cnt to 3, and
      the cpu extraction queue mask to something, but the code doesn't do it,
      so we don't do it either.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      40d3f295
    • Vladimir Oltean's avatar
      net: dsa: tag_ocelot: avoid accessing ds->priv in ocelot_rcv · 8a678bb2
      Vladimir Oltean authored
      Taggers should be written to do something valid irrespective of the
      switch driver that they are attached to. This is even more true now,
      because since the introduction of the .change_tag_protocol method, a
      certain tagger is not necessarily strictly associated with a driver any
      longer, and I would like to be able to test all taggers with dsa_loop in
      the future.
      
      In the case of ocelot, it needs to move the classified VLAN from the DSA
      tag into the skb if the port is VLAN-aware. We can allow it to do that
      by looking at the dp->vlan_filtering property, no need to invoke
      structures which are specific to ocelot.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a678bb2
    • Vladimir Oltean's avatar
      net: mscc: ocelot: refactor ocelot_port_inject_frame out of ocelot_port_xmit · 137ffbc4
      Vladimir Oltean authored
      The felix DSA driver will inject some frames through register MMIO, same
      as ocelot switchdev currently does. So we need to be able to reuse the
      common code.
      
      Also create some shim definitions, since the DSA tagger can be compiled
      without support for the switch driver.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      137ffbc4
    • Vladimir Oltean's avatar
      net: mscc: ocelot: use DIV_ROUND_UP helper in ocelot_port_inject_frame · 5f016f42
      Vladimir Oltean authored
      This looks a bit nicer than the open-coded "(x + 3) % 4" idiom.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5f016f42
    • Vladimir Oltean's avatar
      net: mscc: ocelot: better error handling in ocelot_xtr_irq_handler · a94306ce
      Vladimir Oltean authored
      The ocelot_rx_frame_word() function can return a negative error code,
      however this isn't being checked for consistently. Errors being ignored
      have not been seen in practice though.
      
      Also, some constructs can be simplified by using "goto" instead of
      repeated "break" statements.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a94306ce
    • Vladimir Oltean's avatar
      net: mscc: ocelot: only drain extraction queue on error · d7795f8f
      Vladimir Oltean authored
      It appears that the intention of this snippet of code is to not exit
      ocelot_xtr_irq_handler() while in the middle of extracting a frame.
      The problem in extracting it word by word is that future extraction
      attempts are really easy to get desynchronized, since the IRQ handler
      assumes that the first 16 bytes are the IFH, which give further
      information about the frame, such as frame length.
      
      But during normal operation, "err" will not be 0, but 4, set from here:
      
      		for (i = 0; i < OCELOT_TAG_LEN / 4; i++) {
      			err = ocelot_rx_frame_word(ocelot, grp, true, &ifh[i]);
      			if (err != 4)
      				break;
      		}
      
      		if (err != 4)
      			break;
      
      In that case, draining the extraction queue is a no-op. So explicitly
      make this code execute only on negative err.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7795f8f
    • Vladimir Oltean's avatar
      net: mscc: ocelot: stop returning IRQ_NONE in ocelot_xtr_irq_handler · f833ca29
      Vladimir Oltean authored
      Since the xtr (extraction) IRQ of the ocelot switch is not shared, then
      if it fired, it means that some data must be present in the queues of
      the CPU port module. So simplify the code.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f833ca29
    • David S. Miller's avatar
      Merge branch 'bnxt_en-next' · 14026192
      David S. Miller authored
      Michael Chan says:
      
      ====================
      bnxt_en: Error recovery optimizations.
      
      This series implements some optimizations to error recovery.  One
      patch adds an echo/reply mechanism with firmware to enhance error
      detection.  The other patches speed up the recovery process by
      polling config space earlier and to selectively initialize
      context memory during re-initialization.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      14026192
    • Michael Chan's avatar
      bnxt_en: Improve logging of error recovery settings information. · f4d95c3c
      Michael Chan authored
      We currently only log the error recovery settings if it is enabled.
      In some cases, firmware disables error recovery after it was
      initially enabled.  Without logging anything, the user will not be
      aware of this change in setting.
      
      Log it when error recovery is disabled.  Also, change the reset count
      value from hexadecimal to decimal.
      Reviewed-by: default avatarEdwin Peer <edwin.peer@broadcom.com>
      Reviewed-by: default avatarPavan Chebbi <pavan.chebbi@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f4d95c3c
    • Michael Chan's avatar
      bnxt_en: Reply to firmware's echo request async message. · df97b34d
      Michael Chan authored
      This is a new async message that the firmware can send to check if it
      can communicate with the driver.  This is an added error detection
      scheme that firmware can use if it suspects errors in the PCIe
      interface.  When the driver receives this async message, it will reply
      back echoing some data in the async message.  If the firmware is not
      getting the reply with the proper data after some retries, error
      recovery will kick in.
      Reviewed-by: default avatarAndy Gospodarek <gospo@broadcom.com>
      Reviewed-by: default avatarEdwin Peer <edwin.peer@broadcom.com>
      Reviewed-by: default avatarVasundhara Volam <vasundhara-v.volam@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df97b34d
    • Michael Chan's avatar
      bnxt_en: Initialize "context kind" field for context memory blocks. · 41435c39
      Michael Chan authored
      If firmware provides the offset to the "context kind" field of the
      relevant context memory blocks, we'll initialize just that field for
      each block instead of initializing all of context memory.
      
      Populate the bnxt_mem_init structure with the proper offset returned
      by firmware.  If it is older firmware and the information is not
      available, we set the offset to an invalid value and fall back to
      the old behavior of initializing every byte.  Otherwise, we initialize
      only the "context kind" byte at the offset.
      Reviewed-by: default avatarEdwin Peer <edwin.peer@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      41435c39
    • Michael Chan's avatar
      bnxt_en: Add context memory initialization infrastructure. · e9696ff3
      Michael Chan authored
      Currently, the driver calls memset() to set all relevant context memory
      used by the chip to the initial value.  This can take many milliseconds
      with the potentially large number of context pages allocated for the
      chip.
      
      To make this faster, we only need to initialize the "context kind" field
      of each block of context memory.  This patch sets up the infrastructure
      to do that with the bnxt_mem_init structure.  In the next patch, we'll
      add the logic to obtain the offset of the "context kind" from the
      firmware.  This patch is not changing the current behavior of calling
      memset() to initialize all relevant context memory.
      Reviewed-by: default avatarPavan Chebbi <pavan.chebbi@broadcom.com>
      Reviewed-by: default avatarEdwin Peer <edwin.peer@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e9696ff3
    • Michael Chan's avatar
      bnxt_en: Implement faster recovery for firmware fatal error. · dab62e7c
      Michael Chan authored
      During some fatal firmware error conditions, the PCI config space
      register 0x2e which normally contains the subsystem ID will become
      0xffff.  This register will revert back to the normal value after
      the chip has completed core reset.  If we detect this condition,
      we can poll this config register immediately for the value to revert.
      Because we use config read cycles to poll this register, there is no
      possibility of Master Abort if we happen to read it during core reset.
      This speeds up recovery significantly as we don't have to wait for the
      conservative min_time before polling MMIO to see if the firmware has
      come out of reset.  As soon as this register changes value we can
      proceed to re-initialize the device.
      Reviewed-by: default avatarEdwin Peer <edwin.peer@broadcom.com>
      Reviewed-by: default avatarVasundhara Volam <vasundhara-v.volam@broadcom.com>
      Reviewed-by: default avatarAndy Gospodarek <gospo@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dab62e7c
    • Edwin Peer's avatar
      bnxt_en: selectively allocate context memories · be6d755f
      Edwin Peer authored
      Newer devices may have local context memory instead of relying on the
      host for backing store. In these cases, HWRM_FUNC_BACKING_STORE_QCAPS
      will return a zero entry size to indicate contexts for which the host
      should not allocate backing store.
      
      Selectively allocate context memory based on device capabilities and
      only enable backing store for the appropriate contexts.
      Signed-off-by: default avatarEdwin Peer <edwin.peer@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      be6d755f
    • Michael Chan's avatar
      bnxt_en: Update firmware interface spec to 1.10.2.16. · 31f67c2e
      Michael Chan authored
      The main changes are the echo request/response from firmware for error
      detection and the NO_FCS feature to transmit frames without FCS.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      31f67c2e
  2. 13 Feb, 2021 10 commits
    • David S. Miller's avatar
      Merge branch 'skbuff-introduce-skbuff_heads-bulking-and-reusing' · c4762993
      David S. Miller authored
      Alexander Lobakin says:
      
      ====================
      skbuff: introduce skbuff_heads bulking and reusing
      
      Currently, all sorts of skb allocation always do allocate
      skbuff_heads one by one via kmem_cache_alloc().
      On the other hand, we have percpu napi_alloc_cache to store
      skbuff_heads queued up for freeing and flush them by bulks.
      
      We can use this cache not only for bulk-wiping, but also to obtain
      heads for new skbs and avoid unconditional allocations, as well as
      for bulk-allocating (like XDP's cpumap code and veth driver already
      do).
      
      As this might affect latencies, cache pressure and lots of hardware
      and driver-dependent stuff, this new feature is mostly optional and
      can be issued via:
       - a new napi_build_skb() function (as a replacement for build_skb());
       - existing {,__}napi_alloc_skb() and napi_get_frags() functions;
       - __alloc_skb() with passing SKB_ALLOC_NAPI in flags.
      
      iperf3 showed 35-70 Mbps bumps for both TCP and UDP while performing
      VLAN NAT on 1.2 GHz MIPS board. The boost is likely to be bigger
      on more powerful hosts and NICs with tens of Mpps.
      
      Note on skbuff_heads from distant slabs or pfmemalloc'ed slabs:
       - kmalloc()/kmem_cache_alloc() itself allows by default allocating
         memory from the remote nodes to defragment their slabs. This is
         controlled by sysctl, but according to this, skbuff_head from a
         remote node is an OK case;
       - The easiest way to check if the slab of skbuff_head is remote or
         pfmemalloc'ed is:
      
      	if (!dev_page_is_reusable(virt_to_head_page(skb)))
      		/* drop it */;
      
         ...*but*, regarding that most slabs are built of compound pages,
         virt_to_head_page() will hit unlikely-branch every single call.
         This check costed at least 20 Mbps in test scenarios and seems
         like it'd be better to _not_ do this.
      
      Since v5 [4]:
       - revert flags-to-bool conversion and simplify flags testing in
         __alloc_skb() (Alexander Duyck).
      
      Since v4 [3]:
       - rebase on top of net-next and address kernel build robot issue;
       - reorder checks a bit in __alloc_skb() to make new condition even
         more harmless.
      
      Since v3 [2]:
       - make the feature mostly optional, so driver developers could
         decide whether to use it or not (Paolo Abeni).
         This reuses the old flag for __alloc_skb() and introduces
         a new napi_build_skb();
       - reduce bulk-allocation size from 32 to 16 elements (also Paolo).
         This equals to the value of XDP's devmap and veth batch processing
         (which were tested a lot) and should be sane enough;
       - don't waste cycles on explicit in_serving_softirq() check.
      
      Since v2 [1]:
       - also cover {,__}alloc_skb() and {,__}build_skb() cases (became handy
         after the changes that pass tiny skbs requests to kmalloc layer);
       - cover the cache with KASAN instrumentation (suggested by Eric
         Dumazet, help of Dmitry Vyukov);
       - completely drop redundant __kfree_skb_flush() (also Eric);
       - lots of code cleanups;
       - expand the commit message with NUMA and pfmemalloc points (Jakub).
      
      Since v1 [0]:
       - use one unified cache instead of two separate to greatly simplify
         the logics and reduce hotpath overhead (Edward Cree);
       - new: recycle also GRO_MERGED_FREE skbs instead of immediate
         freeing;
       - correct performance numbers after optimizations and performing
         lots of tests for different use cases.
      
      [0] https://lore.kernel.org/netdev/20210111182655.12159-1-alobakin@pm.me
      [1] https://lore.kernel.org/netdev/20210113133523.39205-1-alobakin@pm.me
      [2] https://lore.kernel.org/netdev/20210209204533.327360-1-alobakin@pm.me
      [3] https://lore.kernel.org/netdev/20210210162732.80467-1-alobakin@pm.me
      [4] https://lore.kernel.org/netdev/20210211185220.9753-1-alobakin@pm.me
      ====================
      Reviewed-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c4762993
    • Alexander Lobakin's avatar
      skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing · 9243adfc
      Alexander Lobakin authored
      napi_frags_finish() and napi_skb_finish() can only be called inside
      NAPI Rx context, so we can feed NAPI cache with skbuff_heads that
      got NAPI_MERGED_FREE verdict instead of immediate freeing.
      Replace __kfree_skb() with __kfree_skb_defer() in napi_skb_finish()
      and move napi_skb_free_stolen_head() to skbuff.c, so it can drop skbs
      to NAPI cache.
      As many drivers call napi_alloc_skb()/napi_get_frags() on their
      receive path, this becomes especially useful.
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9243adfc
    • Alexander Lobakin's avatar
      skbuff: allow to use NAPI cache from __napi_alloc_skb() · cfb8ec65
      Alexander Lobakin authored
      {,__}napi_alloc_skb() is mostly used either for optional non-linear
      receive methods (usually controlled via Ethtool private flags and off
      by default) and/or for Rx copybreaks.
      Use __napi_build_skb() here for obtaining skbuff_heads from NAPI cache
      instead of inplace allocations. This includes both kmalloc and page
      frag paths.
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cfb8ec65
    • Alexander Lobakin's avatar
      skbuff: allow to optionally use NAPI cache from __alloc_skb() · d13612b5
      Alexander Lobakin authored
      Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get
      an skbuff_head from the NAPI cache instead of inplace allocation
      inside __alloc_skb().
      This implies that the function is called from softirq or BH-off
      context, not for allocating a clone or from a distant node.
      
      Cc: Alexander Duyck <alexander.duyck@gmail.com> # Simplified flags check
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d13612b5
    • Alexander Lobakin's avatar
      skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache heads · f450d539
      Alexander Lobakin authored
      Instead of just bulk-flushing skbuff_heads queued up through
      napi_consume_skb() or __kfree_skb_defer(), try to reuse them
      on allocation path.
      If the cache is empty on allocation, bulk-allocate the first
      16 elements, which is more efficient than per-skb allocation.
      If the cache is full on freeing, bulk-wipe the second half of
      the cache (32 elements).
      This also includes custom KASAN poisoning/unpoisoning to be
      double sure there are no use-after-free cases.
      
      To not change current behaviour, introduce a new function,
      napi_build_skb(), to optionally use a new approach later
      in drivers.
      
      Note on selected bulk size, 16:
       - this equals to XDP_BULK_QUEUE_SIZE, DEV_MAP_BULK_SIZE
         and especially VETH_XDP_BATCH, which is also used to
         bulk-allocate skbuff_heads and was tested on powerful
         setups;
       - this also showed the best performance in the actual
         test series (from the array of {8, 16, 32}).
      
      Suggested-by: Edward Cree <ecree.xilinx@gmail.com> # Divide on two halves
      Suggested-by: Eric Dumazet <edumazet@google.com>   # KASAN poisoning
      Cc: Dmitry Vyukov <dvyukov@google.com>             # Help with KASAN
      Cc: Paolo Abeni <pabeni@redhat.com>                # Reduced batch size
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f450d539
    • Alexander Lobakin's avatar
      skbuff: move NAPI cache declarations upper in the file · 50fad4b5
      Alexander Lobakin authored
      NAPI cache structures will be used for allocating skbuff_heads,
      so move their declarations a bit upper.
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      50fad4b5
    • Alexander Lobakin's avatar
      skbuff: remove __kfree_skb_flush() · fec6e49b
      Alexander Lobakin authored
      This function isn't much needed as NAPI skb queue gets bulk-freed
      anyway when there's no more room, and even may reduce the efficiency
      of bulk operations.
      It will be even less needed after reusing skb cache on allocation path,
      so remove it and this way lighten network softirqs a bit.
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fec6e49b
    • Alexander Lobakin's avatar
      skbuff: use __build_skb_around() in __alloc_skb() · f9d6725b
      Alexander Lobakin authored
      Just call __build_skb_around() instead of open-coding it.
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9d6725b
    • Alexander Lobakin's avatar
      skbuff: simplify __alloc_skb() a bit · df1ae022
      Alexander Lobakin authored
      Use unlikely() annotations for skbuff_head and data similarly to the
      two other allocation functions and remove totally redundant goto.
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df1ae022
    • Alexander Lobakin's avatar
      skbuff: make __build_skb_around() return void · 483126b3
      Alexander Lobakin authored
      __build_skb_around() can never fail and always returns passed skb.
      Make it return void to simplify and optimize the code.
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      483126b3