1. 07 Feb, 2023 4 commits
  2. 06 Feb, 2023 36 commits
    • David S. Miller's avatar
      Merge branch 'tuntap-socket-uid' · c21adf25
      David S. Miller authored
      Pietro Borrello says:
      
      ====================
      tuntap: correctly initialize socket uid
      
      sock_init_data() assumes that the `struct socket` passed in input is
      contained in a `struct socket_alloc` allocated with sock_alloc().
      However, tap_open() and tun_chr_open() pass a `struct socket` embedded
      in a `struct tap_queue` and `struct tun_file` respectively, both
      allocated with sk_alloc().
      This causes a type confusion when issuing a container_of() with
      SOCK_INODE() in sock_init_data() which results in assigning a wrong
      sk_uid to the `struct sock` in input.
      
      Due to the type confusion, both sockets happen to have their uid set
      to 0, i.e. root.
      While it will be often correct, as tuntap devices require
      CAP_NET_ADMIN, it may not always be the case.
      Not sure how widespread is the impact of this, it seems the socket uid
      may be used for network filtering and routing, thus tuntap sockets may
      be incorrectly managed.
      Additionally, it seems a socket with an incorrect uid may be returned
      to the vhost driver when issuing a get_socket() on a tuntap device in
      vhost_net_set_backend().
      
      Fix the bugs by adding and using sock_init_data_uid(), which
      explicitly takes a uid as argument.
      Signed-off-by: default avatarPietro Borrello <borrello@diag.uniroma1.it>
      ---
      Changes in v3:
      - Fix the bug by defining and using sock_init_data_uid()
      - Link to v2: https://lore.kernel.org/r/20230131-tuntap-sk-uid-v2-0-29ec15592813@diag.uniroma1.it
      
      Changes in v2:
      - Shorten and format comments
      - Link to v1: https://lore.kernel.org/r/20230131-tuntap-sk-uid-v1-0-af4f9f40979d@diag.uniroma1.it
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c21adf25
    • Pietro Borrello's avatar
      tap: tap_open(): correctly initialize socket uid · 66b2c338
      Pietro Borrello authored
      sock_init_data() assumes that the `struct socket` passed in input is
      contained in a `struct socket_alloc` allocated with sock_alloc().
      However, tap_open() passes a `struct socket` embedded in a `struct
      tap_queue` allocated with sk_alloc().
      This causes a type confusion when issuing a container_of() with
      SOCK_INODE() in sock_init_data() which results in assigning a wrong
      sk_uid to the `struct sock` in input.
      On default configuration, the type confused field overlaps with
      padding bytes between `int vnet_hdr_sz` and `struct tap_dev __rcu
      *tap` in `struct tap_queue`, which makes the uid of all tap sockets 0,
      i.e., the root one.
      Fix the assignment by using sock_init_data_uid().
      
      Fixes: 86741ec2 ("net: core: Add a UID field to struct sock.")
      Signed-off-by: default avatarPietro Borrello <borrello@diag.uniroma1.it>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66b2c338
    • Pietro Borrello's avatar
      tun: tun_chr_open(): correctly initialize socket uid · a096ccca
      Pietro Borrello authored
      sock_init_data() assumes that the `struct socket` passed in input is
      contained in a `struct socket_alloc` allocated with sock_alloc().
      However, tun_chr_open() passes a `struct socket` embedded in a `struct
      tun_file` allocated with sk_alloc().
      This causes a type confusion when issuing a container_of() with
      SOCK_INODE() in sock_init_data() which results in assigning a wrong
      sk_uid to the `struct sock` in input.
      On default configuration, the type confused field overlaps with the
      high 4 bytes of `struct tun_struct __rcu *tun` of `struct tun_file`,
      NULL at the time of call, which makes the uid of all tun sockets 0,
      i.e., the root one.
      Fix the assignment by using sock_init_data_uid().
      
      Fixes: 86741ec2 ("net: core: Add a UID field to struct sock.")
      Signed-off-by: default avatarPietro Borrello <borrello@diag.uniroma1.it>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a096ccca
    • Pietro Borrello's avatar
      net: add sock_init_data_uid() · 584f3742
      Pietro Borrello authored
      Add sock_init_data_uid() to explicitly initialize the socket uid.
      To initialise the socket uid, sock_init_data() assumes a the struct
      socket* sock is always embedded in a struct socket_alloc, used to
      access the corresponding inode uid. This may not be true.
      Examples are sockets created in tun_chr_open() and tap_open().
      
      Fixes: 86741ec2 ("net: core: Add a UID field to struct sock.")
      Signed-off-by: default avatarPietro Borrello <borrello@diag.uniroma1.it>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      584f3742
    • David S. Miller's avatar
      Merge branch 'ENETC-mqprio-taprio-cleanup' · b601135e
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      net: ENETC mqprio/taprio cleanup
      
      Please excuse the increased patch set size compared to v4's 15 patches,
      but Claudiu stirred up the pot :) when he pointed out that the mqprio
      TXQ validation procedure is still incorrect, so I had to fix that, and
      then do some consolidation work so that taprio doesn't duplicate
      mqprio's bugs. Compared to v4, 3 patches are new and 1 was dropped for now
      ("net/sched: taprio: mask off bits in gate mask that exceed number of TCs"),
      since there's not really much to gain from it. Since the previous patch
      set has largely been reviewed, I hope that a delta overview will help
      and make up for the large size.
      
      v4->v5:
      - new patches:
        "[08/17] net/sched: mqprio: allow reverse TC:TXQ mappings"
        "[11/17] net/sched: taprio: centralize mqprio qopt validation"
        "[12/17] net/sched: refactor mqprio qopt reconstruction to a library function"
      - changed patches worth revisiting:
        "[09/17] net/sched: mqprio: allow offloading drivers to request queue
        count validation"
      v4 at:
      https://patchwork.kernel.org/project/netdevbpf/cover/20230130173145.475943-1-vladimir.oltean@nxp.com/
      
      v3->v4:
      - adjusted patch 07/15 to not remove "#include <net/pkt_sched.h>" from
        ti cpsw
      https://patchwork.kernel.org/project/netdevbpf/cover/20230127001516.592984-1-vladimir.oltean@nxp.com/
      
      v2->v3:
      - move min_num_stack_tx_queues definition so it doesn't conflict with
        the ethtool mm patches I haven't submitted yet for enetc (and also to
        make use of a 4 byte hole)
      - warn and mask off excess TCs in gate mask instead of failing
      - finally CC qdisc maintainers
      v2 at:
      https://patchwork.kernel.org/project/netdevbpf/patch/20230126125308.1199404-16-vladimir.oltean@nxp.com/
      
      v1->v2:
      - patches 1->4 are new
      - update some header inclusions in drivers
      - fix typo (said "taprio" instead of "mqprio")
      - better enetc mqprio error handling
      - dynamically reconstruct mqprio configuration in taprio offload
      - also let stmmac and tsnep use per-TXQ gate_mask
      v1 (RFC) at:
      https://patchwork.kernel.org/project/netdevbpf/cover/20230120141537.1350744-1-vladimir.oltean@nxp.com/
      
      The main goal of this patch set is to make taprio pass the mqprio queue
      configuration structure down to ndo_setup_tc() - patch 13/17. But mqprio
      itself is not in the best shape currently, so there are some
      consolidation patches on that as well.
      
      Next, there are some consolidation patches in the enetc driver's
      handling of TX queues and their traffic class assignment. Then, there is
      a consolidation between the TX queue configuration for mqprio and
      taprio.
      
      Finally, there is a change in the meaning of the gate_mask passed by
      taprio through ndo_setup_tc(). We introduce a capability through which
      drivers can request the gate mask to be per TXQ. The default is changed
      so that it is per TC.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b601135e
    • Vladimir Oltean's avatar
      net: enetc: act upon mqprio queue config in taprio offload · 06b1c911
      Vladimir Oltean authored
      We assume that the mqprio queue configuration from taprio has a simple
      1:1 mapping between prio and traffic class, and one TX queue per TC.
      That might not be the case. Actually parse and act upon the mqprio
      config.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06b1c911
    • Vladimir Oltean's avatar
      net: enetc: act upon the requested mqprio queue configuration · 1a353111
      Vladimir Oltean authored
      Regardless of the requested queue count per traffic class, the enetc
      driver allocates a number of TX rings equal to the number of TCs, and
      hardcodes a queue configuration of "1@0 1@1 ... 1@max-tc". Other
      configurations are silently ignored and treated the same.
      
      Improve that by allowing what the user requests to be actually
      fulfilled. This allows more than one TX ring per traffic class.
      For example:
      
      $ tc qdisc add dev eno0 root handle 1: mqprio num_tc 4 \
      	map 0 0 1 1 2 2 3 3 queues 2@0 2@2 2@4 2@6
      [  146.267648] fsl_enetc 0000:00:00.0 eno0: TX ring 0 prio 0
      [  146.273451] fsl_enetc 0000:00:00.0 eno0: TX ring 1 prio 0
      [  146.283280] fsl_enetc 0000:00:00.0 eno0: TX ring 2 prio 1
      [  146.293987] fsl_enetc 0000:00:00.0 eno0: TX ring 3 prio 1
      [  146.300467] fsl_enetc 0000:00:00.0 eno0: TX ring 4 prio 2
      [  146.306866] fsl_enetc 0000:00:00.0 eno0: TX ring 5 prio 2
      [  146.313261] fsl_enetc 0000:00:00.0 eno0: TX ring 6 prio 3
      [  146.319622] fsl_enetc 0000:00:00.0 eno0: TX ring 7 prio 3
      $ tc qdisc del dev eno0 root
      [  178.238418] fsl_enetc 0000:00:00.0 eno0: TX ring 0 prio 0
      [  178.244369] fsl_enetc 0000:00:00.0 eno0: TX ring 1 prio 0
      [  178.251486] fsl_enetc 0000:00:00.0 eno0: TX ring 2 prio 0
      [  178.258006] fsl_enetc 0000:00:00.0 eno0: TX ring 3 prio 0
      [  178.265038] fsl_enetc 0000:00:00.0 eno0: TX ring 4 prio 0
      [  178.271557] fsl_enetc 0000:00:00.0 eno0: TX ring 5 prio 0
      [  178.277910] fsl_enetc 0000:00:00.0 eno0: TX ring 6 prio 0
      [  178.284281] fsl_enetc 0000:00:00.0 eno0: TX ring 7 prio 0
      $ tc qdisc add dev eno0 root handle 1: mqprio num_tc 8 \
      	map 0 1 2 3 4 5 6 7 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 1
      [  186.113162] fsl_enetc 0000:00:00.0 eno0: TX ring 0 prio 0
      [  186.118764] fsl_enetc 0000:00:00.0 eno0: TX ring 1 prio 1
      [  186.124374] fsl_enetc 0000:00:00.0 eno0: TX ring 2 prio 2
      [  186.130765] fsl_enetc 0000:00:00.0 eno0: TX ring 3 prio 3
      [  186.136404] fsl_enetc 0000:00:00.0 eno0: TX ring 4 prio 4
      [  186.142049] fsl_enetc 0000:00:00.0 eno0: TX ring 5 prio 5
      [  186.147674] fsl_enetc 0000:00:00.0 eno0: TX ring 6 prio 6
      [  186.153305] fsl_enetc 0000:00:00.0 eno0: TX ring 7 prio 7
      
      The driver used to set TC_MQPRIO_HW_OFFLOAD_TCS, near which there is
      this comment in the UAPI header:
      
              TC_MQPRIO_HW_OFFLOAD_TCS,       /* offload TCs, no queue counts */
      
      which is what enetc was doing up until now (and no longer is; we offload
      queue counts too), remove that assignment.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a353111
    • Vladimir Oltean's avatar
      net: enetc: request mqprio to validate the queue counts · 735ef62c
      Vladimir Oltean authored
      The enetc driver does not validate the mqprio queue configuration, so it
      currently allows things like this:
      
      $ tc qdisc add dev swp0 root handle 1: mqprio num_tc 8 \
      	map 0 1 2 3 4 5 6 7 queues 3@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 1
      
      But also things like this, completely omitting the queue configuration:
      
      $ tc qdisc add dev eno0 root handle 1: mqprio num_tc 8 \
      	map 0 1 2 3 4 5 6 7 hw 1
      
      By requesting validation via the mqprio capability structure, this is no
      longer allowed, and we bring what is accepted by hardware in line with
      what is accepted by software.
      
      The check that num_tc <= real_num_tx_queues also becomes superfluous and
      can be dropped, because mqprio_validate_queue_counts() validates that no
      TXQ range exceeds real_num_tx_queues. That is a stronger check, because
      there is at least 1 TXQ per TC, so there are at least as many TXQs as TCs.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      735ef62c
    • Vladimir Oltean's avatar
      net/sched: taprio: only pass gate mask per TXQ for igc, stmmac, tsnep, am65_cpsw · 522d15ea
      Vladimir Oltean authored
      There are 2 classes of in-tree drivers currently:
      
      - those who act upon struct tc_taprio_sched_entry :: gate_mask as if it
        holds a bit mask of TXQs
      
      - those who act upon the gate_mask as if it holds a bit mask of TCs
      
      When it comes to the standard, IEEE 802.1Q-2018 does say this in the
      second paragraph of section 8.6.8.4 Enhancements for scheduled traffic:
      
      | A gate control list associated with each Port contains an ordered list
      | of gate operations. Each gate operation changes the transmission gate
      | state for the gate associated with each of the Port's traffic class
      | queues and allows associated control operations to be scheduled.
      
      In typically obtuse language, it refers to a "traffic class queue"
      rather than a "traffic class" or a "queue". But careful reading of
      802.1Q clarifies that "traffic class" and "queue" are in fact
      synonymous (see 8.6.6 Queuing frames):
      
      | A queue in this context is not necessarily a single FIFO data structure.
      | A queue is a record of all frames of a given traffic class awaiting
      | transmission on a given Bridge Port. The structure of this record is not
      | specified.
      
      i.o.w. their definition of "queue" isn't the Linux TX queue.
      
      The gate_mask really is input into taprio via its UAPI as a mask of
      traffic classes, but taprio_sched_to_offload() converts it into a TXQ
      mask.
      
      The breakdown of drivers which handle TC_SETUP_QDISC_TAPRIO is:
      
      - hellcreek, felix, sja1105: these are DSA switches, it's not even very
        clear what TXQs correspond to, other than purely software constructs.
        Only the mqprio configuration with 8 TCs and 1 TXQ per TC makes sense.
        So it's fine to convert these to a gate mask per TC.
      
      - enetc: I have the hardware and can confirm that the gate mask is per
        TC, and affects all TXQs (BD rings) configured for that priority.
      
      - igc: in igc_save_qbv_schedule(), the gate_mask is clearly interpreted
        to be per-TXQ.
      
      - tsnep: Gerhard Engleder clarifies that even though this hardware
        supports at most 1 TXQ per TC, the TXQ indices may be different from
        the TC values themselves, and it is the TXQ indices that matter to
        this hardware. So keep it per-TXQ as well.
      
      - stmmac: I have a GMAC datasheet, and in the EST section it does
        specify that the gate events are per TXQ rather than per TC.
      
      - lan966x: again, this is a switch, and while not a DSA one, the way in
        which it implements lan966x_mqprio_add() - by only allowing num_tc ==
        NUM_PRIO_QUEUES (8) - makes it clear to me that TXQs are a purely
        software construct here as well. They seem to map 1:1 with TCs.
      
      - am65_cpsw: from looking at am65_cpsw_est_set_sched_cmds(), I get the
        impression that the fetch_allow variable is treated like a prio_mask.
        This definitely sounds closer to a per-TC gate mask rather than a
        per-TXQ one, and TI documentation does seem to recomment an identity
        mapping between TCs and TXQs. However, Roger Quadros would like to do
        some testing before making changes, so I'm leaving this driver to
        operate as it did before, for now. Link with more details at the end.
      
      Based on this breakdown, we have 5 drivers with a gate mask per TC and
      4 with a gate mask per TXQ. So let's make the gate mask per TXQ the
      opt-in and the gate mask per TC the default.
      
      Benefit from the TC_QUERY_CAPS feature that Jakub suggested we add, and
      query the device driver before calling the proper ndo_setup_tc(), and
      figure out if it expects one or the other format.
      
      Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230202003621.2679603-15-vladimir.oltean@nxp.com/#25193204
      Cc: Horatiu Vultur <horatiu.vultur@microchip.com>
      Cc: Siddharth Vadapalli <s-vadapalli@ti.com>
      Cc: Roger Quadros <rogerq@kernel.org>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
      Reviewed-by: default avatarGerhard Engleder <gerhard@engleder-embedded.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      522d15ea
    • Vladimir Oltean's avatar
      net/sched: taprio: pass mqprio queue configuration to ndo_setup_tc() · 09c794c0
      Vladimir Oltean authored
      The taprio qdisc does not currently pass the mqprio queue configuration
      down to the offloading device driver. So the driver cannot act upon the
      TXQ counts/offsets per TC, or upon the prio->tc map. It was probably
      assumed that the driver only wants to offload num_tc (see
      TC_MQPRIO_HW_OFFLOAD_TCS), which it can get from netdev_get_num_tc(),
      but there's clearly more to the mqprio configuration than that.
      
      I've considered 2 mechanisms to remedy that. First is to pass a struct
      tc_mqprio_qopt_offload as part of the tc_taprio_qopt_offload. The second
      is to make taprio actually call TC_SETUP_QDISC_MQPRIO, *in addition to*
      TC_SETUP_QDISC_TAPRIO.
      
      The difference is that in the first case, existing drivers (offloading
      or not) all ignore taprio's mqprio portion currently, whereas in the
      second case, we could control whether to call TC_SETUP_QDISC_MQPRIO,
      based on a new capability. The question is which approach would be
      better.
      
      I'm afraid that calling TC_SETUP_QDISC_MQPRIO unconditionally (not based
      on a taprio capability bit) would risk introducing regressions. For
      example, taprio doesn't populate (or validate) qopt->hw, as well as
      mqprio.flags, mqprio.shaper, mqprio.min_rate, mqprio.max_rate.
      
      In comparison, adding a capability is functionally equivalent to just
      passing the mqprio in a way that drivers can ignore it, except it's
      slightly more complicated to use it (need to set the capability).
      
      Ultimately, what made me go for the "mqprio in taprio" variant was that
      it's easier for offloading drivers to interpret the mqprio qopt slightly
      differently when it comes from taprio vs when it comes from mqprio,
      should that ever become necessary.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      09c794c0
    • Vladimir Oltean's avatar
      net/sched: refactor mqprio qopt reconstruction to a library function · 9dd6ad67
      Vladimir Oltean authored
      The taprio qdisc will need to reconstruct a struct tc_mqprio_qopt from
      netdev settings once more in a future patch, but this code was already
      written twice, once in taprio and once in mqprio.
      
      Refactor the code to a helper in the common mqprio library.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9dd6ad67
    • Vladimir Oltean's avatar
      net/sched: taprio: centralize mqprio qopt validation · 1dfe086d
      Vladimir Oltean authored
      There is a lot of code in taprio which is "borrowed" from mqprio.
      It makes sense to put a stop to the "borrowing" and start actually
      reusing code.
      
      Because taprio and mqprio are built as part of different kernel modules,
      code reuse can only take place either by writing it as static inline
      (limiting), putting it in sch_generic.o (not generic enough), or
      creating a third auto-selectable kernel module which only holds library
      code. I opted for the third variant.
      
      In a previous change, mqprio gained support for reverse TC:TXQ mappings,
      something which taprio still denies. Make taprio use the same validation
      logic so that it supports this configuration as well.
      
      The taprio code didn't enforce TXQ overlaps in txtime-assist mode and
      that looks intentional, even if I've no idea why that might be. Preserve
      that, but add a comment.
      
      There isn't any dedicated MAINTAINERS entry for mqprio, so nothing to
      update there.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Reviewed-by: default avatarGerhard Engleder <gerhard@engleder-embedded.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1dfe086d
    • Vladimir Oltean's avatar
      net/sched: mqprio: add extack messages for queue count validation · d404959f
      Vladimir Oltean authored
      To make mqprio more user-friendly, create netlink extended ack messages
      which say exactly what is wrong about the queue counts. This uses the
      new support for printf-formatted extack messages.
      
      Example:
      
      $ tc qdisc add dev eno0 root handle 1: mqprio num_tc 8 \
      	map 0 1 2 3 4 5 6 7 queues 3@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0
      Error: sch_mqprio: TC 0 queues 3@0 overlap with TC 1 queues 1@1.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d404959f
    • Vladimir Oltean's avatar
      net/sched: mqprio: allow offloading drivers to request queue count validation · 19278d76
      Vladimir Oltean authored
      mqprio_parse_opt() proudly has a comment:
      
      	/* If hardware offload is requested we will leave it to the device
      	 * to either populate the queue counts itself or to validate the
      	 * provided queue counts.
      	 */
      
      Unfortunately some device drivers did not get this memo, and don't
      validate the queue counts, or populate them.
      
      In case drivers don't want to populate the queue counts themselves, just
      act upon the requested configuration, it makes sense to introduce a tc
      capability, and make mqprio query it, so they don't have to do the
      validation themselves.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      19278d76
    • Vladimir Oltean's avatar
      net/sched: mqprio: allow reverse TC:TXQ mappings · d7045f52
      Vladimir Oltean authored
      By imposing that the last TXQ of TC i is smaller than the first TXQ of
      any TC j (j := i+1 .. n), mqprio imposes a strict ordering condition for
      the TXQ indices (they must increase as TCs increase).
      
      Claudiu points out that the complexity of the TXQ count validation is
      too high for this logic, i.e. instead of iterating over j, it is
      sufficient that the TXQ indices of TC i and i + 1 are ordered, and that
      will eventually ensure global ordering.
      
      This is true, however it doesn't appear to me that is what the code
      really intended to do. Instead, based on the comments, it just wanted to
      check for overlaps (and this isn't how one does that).
      
      So the following mqprio configuration, which I had recommended to
      Vinicius more than once for igb/igc (to account for the fact that on
      this hardware, lower numbered TXQs have higher dequeue priority than
      higher ones):
      
      num_tc 4 map 0 1 2 3 queues 1@3 1@2 1@1 1@0
      
      is in fact denied today by mqprio.
      
      The full story is that in fact, it's only denied with "hw 0"; if
      hardware offloading is requested, mqprio defers TXQ range overlap
      validation to the device driver (a strange decision in itself).
      
      This is most certainly a bug, but it's not one that has any merit for
      being fixed on "stable" as far as I can tell. This is because mqprio
      always rejected a configuration which was in fact valid, and this has
      shaped the way in which mqprio configuration scripts got built for
      various hardware (see igb/igc in the link below). Therefore, one could
      consider it to be merely an improvement for mqprio to allow reverse
      TC:TXQ mappings.
      
      Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230130173145.475943-9-vladimir.oltean@nxp.com/#25188310
      Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230128010719.2182346-6-vladimir.oltean@nxp.com/#25186442Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Reviewed-by: default avatarGerhard Engleder <gerhard@engleder-embedded.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7045f52
    • Vladimir Oltean's avatar
      net/sched: move struct tc_mqprio_qopt_offload from pkt_cls.h to pkt_sched.h · 9adafe2b
      Vladimir Oltean authored
      Since mqprio is a scheduler and not a classifier, move its offload
      structure to pkt_sched.h, where struct tc_taprio_qopt_offload also lies.
      
      Also update some header inclusions in drivers that access this
      structure, to the best of my abilities.
      
      Cc: Igor Russkikh <irusskikh@marvell.com>
      Cc: Yisen Zhuang <yisen.zhuang@huawei.com>
      Cc: Salil Mehta <salil.mehta@huawei.com>
      Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
      Cc: Tony Nguyen <anthony.l.nguyen@intel.com>
      Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
      Cc: Saeed Mahameed <saeedm@nvidia.com>
      Cc: Leon Romanovsky <leon@kernel.org>
      Cc: Horatiu Vultur <horatiu.vultur@microchip.com>
      Cc: Lars Povlsen <lars.povlsen@microchip.com>
      Cc: Steen Hegelund <Steen.Hegelund@microchip.com>
      Cc: Daniel Machon <daniel.machon@microchip.com>
      Cc: UNGLinuxDriver@microchip.com
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9adafe2b
    • Vladimir Oltean's avatar
      net/sched: mqprio: refactor offloading and unoffloading to dedicated functions · 5cfb45e2
      Vladimir Oltean authored
      Some more logic will be added to mqprio offloading, so split that code
      up from mqprio_init(), which is already large, and create a new
      function, mqprio_enable_offload(), similar to taprio_enable_offload().
      Also create the opposite function mqprio_disable_offload().
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5cfb45e2
    • Vladimir Oltean's avatar
      net/sched: mqprio: refactor nlattr parsing to a separate function · feb2cf3d
      Vladimir Oltean authored
      mqprio_init() is quite large and unwieldy to add more code to.
      Split the netlink attribute parsing to a dedicated function.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      feb2cf3d
    • Praveen Kaligineedi's avatar
      gve: Fix gve interrupt names · 84371145
      Praveen Kaligineedi authored
      IRQs are currently requested before the netdevice is registered
      and a proper name is assigned to the device. Changing interrupt
      name to avoid using the format string in the name.
      
      Interrupt name before change: eth%d-ntfy-block.<blk_id>
      Interrupt name after change: gve-ntfy-blk<blk_id>@pci:<pci_name>
      Signed-off-by: default avatarPraveen Kaligineedi <pkaligineedi@google.com>
      Reviewed-by: default avatarJeroen de Borst <jeroendb@google.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      84371145
    • David S. Miller's avatar
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · d78f8d83
      David S. Miller authored
      Tony Nguyen says:
      
      ====================
      net: implement devlink reload in ice
      
      Michal Swiatkowski says:
      
      This is a part of changes done in patchset [0]. Resource management is
      kind of controversial part, so I split it into two patchsets.
      
      It is the first one, covering refactor and implement reload API call.
      The refactor will unblock some of the patches needed by SIOV or
      subfunction.
      
      Most of this patchset is about implementing driver reload mechanism.
      Part of code from probe and rebuild is used to not duplicate code.
      To allow this reuse probe and rebuild path are split into smaller
      functions.
      
      Patch "ice: split ice_vsi_setup into smaller functions" changes
      boolean variable in function call to integer and adds define
      for it. Instead of having the function called with true/false now it
      can be called with readable defines ICE_VSI_FLAG_INIT or
      ICE_VSI_FLAG_NO_INIT. It was suggested by Jacob Keller and probably this
      mechanism will be implemented across ice driver in follow up patchset.
      
      Previously the code was reviewed here [0].
      
      [0] https://lore.kernel.org/netdev/Y3ckRWtAtZU1BdXm@unreal/T/#m3bb8feba0a62f9b4cd54cd94917b7e2143fc2ecd
      
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d78f8d83
    • Jesper Dangaard Brouer's avatar
      net: introduce skb_poison_list and use in kfree_skb_list · 9dde0cd3
      Jesper Dangaard Brouer authored
      First user of skb_poison_list is in kfree_skb_list_reason, to catch bugs
      earlier like introduced in commit eedade12 ("net: kfree_skb_list use
      kmem_cache_free_bulk"). For completeness mentioned bug have been fixed in
      commit f72ff8b8 ("net: fix kfree_skb_list use of skb_mark_not_on_list").
      
      In case of a bug like mentioned commit we would have seen OOPS with:
       general protection fault, probably for non-canonical address 0xdead000000000870
      And content of one the registers e.g. R13: dead000000000800
      
      In this case skb->len is at offset 112 bytes (0x70) why fault happens at
       0x800+0x70 = 0x870
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9dde0cd3
    • David S. Miller's avatar
      Merge branch 'wangxun-interrupts' · 149e8fb0
      David S. Miller authored
      Jiawen Wu says:
      
      ====================
      Wangxun interrupt and RxTx support
      
      Configure interrupt, setup RxTx ring, support to receive and transmit
      packets.
      
      change log:
      v3:
      - Use upper_32_bits() to avoid compile warning.
      - Remove useless codes.
      v2:
      - Andrew Lunn: https://lore.kernel.org/netdev/Y86kDphvyHj21IxK@lunn.ch/
      - Add a judgment when allocate dma for descriptor.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      149e8fb0
    • Mengyuan Lou's avatar
      net: ngbe: Support Rx and Tx process path · b97f955e
      Mengyuan Lou authored
      Add enable and disable operation process for ngbe open/close.
      Clean Rx and Tx ring interrupts, process packets in the data path.
      Signed-off-by: default avatarMengyuan Lou <mengyuanlou@net-swift.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b97f955e
    • Jiawen Wu's avatar
      net: txgbe: Support Rx and Tx process path · 0d22be52
      Jiawen Wu authored
      Clean Rx and Tx ring interrupts, process packets in the data path.
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d22be52
    • Mengyuan Lou's avatar
      net: libwx: Add tx path to process packets · 09a50880
      Mengyuan Lou authored
      Support to transmit packets without hardware features.
      Signed-off-by: default avatarMengyuan Lou <mengyuanlou@net-swift.com>
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      09a50880
    • Jiawen Wu's avatar
      net: libwx: Support to receive packets in NAPI · 3c47e8ae
      Jiawen Wu authored
      Clean all queues associated with a q_vector, to simple receive packets
      without hardware features.
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3c47e8ae
    • Jiawen Wu's avatar
      net: txgbe: Setup Rx and Tx ring · 0ef7e159
      Jiawen Wu authored
      Improve the configuration of Rx and Tx ring, set Rx flags and implement
      ndo_set_rx_mode ops.
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0ef7e159
    • Jiawen Wu's avatar
      net: libwx: Allocate Rx and Tx resources · 850b9711
      Jiawen Wu authored
      Setup Rx and Tx descriptors for specefic rings.
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      850b9711
    • Jiawen Wu's avatar
      net: libwx: Configure Rx and Tx unit on hardware · 18b5b8a9
      Jiawen Wu authored
      Configure hardware for preparing to process packets. Including configure
      receive and transmit unit of the MAC layer, and setup the specific rings.
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      18b5b8a9
    • Jiawen Wu's avatar
      net: txgbe: Add interrupt support · 5d3ac705
      Jiawen Wu authored
      Determine proper interrupt scheme to enable and handle interrupt.
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5d3ac705
    • Mengyuan Lou's avatar
      net: ngbe: Add irqs request flow · e7956139
      Mengyuan Lou authored
      Add request_irq for tx/rx rings and misc other events.
      If the application is successful, config vertors for interrupts.
      Enable some base interrupts mask in ngbe_irq_enable.
      Signed-off-by: default avatarMengyuan Lou <mengyuanlou@net-swift.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7956139
    • Mengyuan Lou's avatar
      net: libwx: Add irq flow functions · 3f703186
      Mengyuan Lou authored
      Add irq flow functions for ngbe and txgbe.
      Alloc pcie msix irqs for drivers, otherwise fall back to msi/legacy.
      Signed-off-by: default avatarMengyuan Lou <mengyuanlou@net-swift.com>
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3f703186
    • Qingfang DENG's avatar
      net: page_pool: use in_softirq() instead · 542bcea4
      Qingfang DENG authored
      We use BH context only for synchronization, so we don't care if it's
      actually serving softirq or not.
      
      As a side node, in case of threaded NAPI, in_serving_softirq() will
      return false because it's in process context with BH off, making
      page_pool_recycle_in_cache() unreachable.
      Signed-off-by: default avatarQingfang DENG <qingfang.deng@siflower.com.cn>
      Tested-by: default avatarFelix Fietkau <nbd@nbd.name>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      542bcea4
    • David S. Miller's avatar
      Merge tag 'mlx5-updates-2023-02-04' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 637bc8f0
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      mlx5-updates-2023-02-04
      
      This series provides misc updates to mlx5 driver:
      
      1) Trivial LAG code cleanup patches from Roi
      
      2) Rahul improves mlx5's documentation structure
      Separates the documentation into multiple pages related to different
      components in the device driver. Adds Kconfig parameters, devlink
      parameters, and tracepoints that were previously introduced but not added
      to the documentation. Introduces a new page on ethtool statistics counters
      with information about counters previously implemented in the mlx5_core
      driver but not documented in the kernel tree.
      
      3) From Raed, policy/state selector support for IPSec.
      
      4) From Fragos, add support for XDR speed in IPoIB mlx5 netdev
      
      5) Few more misc cleanups and trivial changes
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      637bc8f0
    • Parav Pandit's avatar
      virtio-net: Maintain reverse cleanup order · 27369c9c
      Parav Pandit authored
      To easily audit the code, better to keep the device stop()
      sequence to be mirror of the device open() sequence.
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarParav Pandit <parav@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      27369c9c
    • David S. Miller's avatar
      Merge branch 'bridge-mdb-limit' · cb3086ce
      David S. Miller authored
      Petr Machata says:
      
      ====================
      bridge: Limit number of MDB entries per port, port-vlan
      
      The MDB maintained by the bridge is limited. When the bridge is configured
      for IGMP / MLD snooping, a buggy or malicious client can easily exhaust its
      capacity. In SW datapath, the capacity is configurable through the
      IFLA_BR_MCAST_HASH_MAX parameter, but ultimately is finite. Obviously a
      similar limit exists in the HW datapath for purposes of offloading.
      
      In order to prevent the issue of unilateral exhaustion of MDB resources,
      introduce two parameters in each of two contexts:
      
      - Per-port and (when BROPT_MCAST_VLAN_SNOOPING_ENABLED is enabled)
        per-port-VLAN number of MDB entries that the port is member in.
      
      - Per-port and (when BROPT_MCAST_VLAN_SNOOPING_ENABLED is enabled)
        per-port-VLAN maximum permitted number of MDB entries, or 0 for
        no limit.
      
      Per-port number of entries keeps track of the total number of MDB entries
      configured on a given port. The per-port-VLAN value then keeps track of the
      subset of MDB entries configured specifically for the given VLAN, on that
      port. The number is adjusted as port_groups are created and deleted, and
      therefore under multicast lock.
      
      A maximum value, if non-zero, then places a limit on the number of entries
      that can be configured in a given context. Attempts to add entries above
      the maximum are rejected.
      
      Rejection reason of netlink-based requests to add MDB entries is
      communicated through extack. This channel is unavailable for rejections
      triggered from the control path. To address this lack of visibility, the
      patchset adds a tracepoint, bridge:br_mdb_full:
      
      	# perf record -e bridge:br_mdb_full &
      	# [...]
      	# perf script | cut -d: -f4-
      	 dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 0
      	 dev v2 af 10 src :: grp ff0e::112/00:00:00:00:00:00 vid 0
      	 dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 10
      	 dev v2 af 10 src 2001:db8:1::1 grp ff0e::1/00:00:00:00:00:00 vid 10
      	 dev v2 af 2 src ::ffff:192.0.2.1 grp ::ffff:239.1.1.1/00:00:00:00:00:00 vid 10
      
      Another option to consume the tracepoint is e.g. through the bpftrace tool:
      
      	# bpftrace -e ' tracepoint:bridge:br_mdb_full /args->af != 0/ {
      			    printf("dev %s src %s grp %s vid %u\n",
      				   str(args->dev), ntop(args->src),
      				   ntop(args->grp), args->vid);
      			}
      			tracepoint:bridge:br_mdb_full /args->af == 0/ {
      			    printf("dev %s grp %s vid %u\n",
      				   str(args->dev),
      				   macaddr(args->grpmac), args->vid);
      			}'
      
      This tracepoint is triggered for mcast_hash_max exhaustions as well.
      
      The following is an example of how the feature is used. A more extensive
      example is available in patch #8:
      
      	# bridge vlan set dev v1 vid 1 mcast_max_groups 1
      	# bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 1
      	# bridge mdb add dev br port v1 grp 230.1.2.4 temp vid 1
      	Error: bridge: Port-VLAN is already in 1 groups, and mcast_max_groups=1.
      
      The patchset progresses as follows:
      
      - In patch #1, set strict_start_type at two bridge-related policies. The
        reason is we are adding a new attribute to one of these, and want the new
        attribute to be parsed strictly. The other was adjusted for completeness'
        sake.
      
      - In patches #2 to #5, br_mdb and br_multicast code is adjusted to make the
        following additions smoother.
      
      - In patch #6, add the tracepoint.
      
      - In patch #7, the code to maintain number of MDB entries is added as
        struct net_bridge_mcast_port::mdb_n_entries. The maximum is added, too,
        as struct net_bridge_mcast_port::mdb_max_entries, however at this point
        there is no way to set the value yet, and since 0 is treated as "no
        limit", the functionality doesn't change at this point. Note however,
        that mcast_hash_max violations already do trigger at this point.
      
      - In patch #8, netlink plumbing is added: reading of number of entries, and
        reading and writing of maximum.
      
        The per-port values are passed through RTM_NEWLINK / RTM_GETLINK messages
        in IFLA_BRPORT_MCAST_N_GROUPS and _MAX_GROUPS, inside IFLA_PROTINFO nest.
      
        The per-port-vlan values are passed through RTM_GETVLAN / RTM_NEWVLAN
        messages in BRIDGE_VLANDB_ENTRY_MCAST_N_GROUPS, _MAX_GROUPS, inside
        BRIDGE_VLANDB_ENTRY.
      
      The following patches deal with the selftest:
      
      - Patches #9 and #10 clean up and move around some selftest code.
      
      - Patches #11 to #14 add helpers and generalize the existing IGMP / MLD
        support to allow generating packets with configurable group addresses and
        varying source lists for (S,G) memberships.
      
      - Patch #15 adds code to generate IGMP leave and MLD done packets.
      
      - Patch #16 finally adds the selftest itself.
      
      v3:
      - Patch #7:
          - Access mdb_max_/_n_entries through READ_/WRITE_ONCE
          - Move extack setting to br_multicast_port_ngroups_inc_one().
            Since we use NL_SET_ERR_MSG_FMT_MOD, the correct context
            (port / port-vlan) can be passed through an argument.
            This also removes the need for more READ/WRITE_ONCE's
            at the extack-setting site.
      - Patch #8:
          - Move the br_multicast_port_ctx_vlan_disabled() check
            out to the _vlan_ helpers callers. Thus these helpers
            cannot fail, which makes them very similar to the
            _port_ helpers. Have them take the MC context directly
            and unify them.
      
      v2:
      - Cover letter:
          - Add an example of a bpftrace-based probe script
      - Patch #6:
          - Report IPv4 as an IPv6-mapped address through the IPv6 buffer
            as well, to save ring buffer space.
      - Patch #7:
          - In br_multicast_port_ngroups_inc_one(), bounce
            if n>=max, not if n==max
          - Adjust extack messages to mention ngroups, now
            that the bounces appear when n>=max, not n==max
          - In __br_multicast_enable_port_ctx(), do not reset
            max to 0. Also do not count number of entries by
            going through _inc, as that would end up incorrectly
            bouncing the entries.
      - Patch #8:
          - Drop locks around accesses in
            br_multicast_{port,vlan}_ngroups_{get,set_max}(),
          - Drop bounces due to max<n in
            br_multicast_{port,vlan}_ngroups_set_max().
      - Patch #12:
          - In the comment at payload_template_calc_checksum(),
            s/%#02x/%02x/, that's the mausezahn payload format.
      - Patch #16:
          - Adjust the tests that check setting max below n and
            reset of max on VLAN snooping enablement
          - Make test naming uniform
          - Enable testing of control path (IGMP/MLD) in
            mcast_vlan_snooping bridge
          - Reorganize the code so that test instances (per bridge
            type and configuration type) always come right after
            the test, in order of {d,q,qvs}{4,6}{cfg,ctl}.
            Then groups of selftests are at the end of the file.
            Similarly adjust invocation order of the tests.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cb3086ce