1. 01 May, 2019 40 commits
    • Erez Alfasi's avatar
      ethtool: Add SFF-8436 and SFF-8636 max EEPROM length definitions · 0e1a2a3e
      Erez Alfasi authored
      Added max EEPROM length defines for ethtool usage:
       #define ETH_MODULE_SFF_8636_MAX_LEN     640
       #define ETH_MODULE_SFF_8436_MAX_LEN     640
      
      These definitions are used to determine the EEPROM
      data length when reading high eeprom pages.
      
      For example, SFF-8636 EEPROM data from page 03h
      needs to be stored at data[512] - data[639].
      Signed-off-by: default avatarErez Alfasi <ereza@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      0e1a2a3e
    • Vlad Buslov's avatar
      net/mlx5e: Return error when trying to insert existing flower filter · 0e1c1a2f
      Vlad Buslov authored
      With unlocked TC it is possible to have spurious deletes and inserts of
      same filter. TC layer needs drivers to always return error when flow
      insertion failed in order to correctly calculate "in_hw_count" for each
      filter. Fix mlx5e_configure_flower() to return -EEXIST when TC tries to
      insert a filter that is already provisioned to the driver.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Reviewed-by: default avatarRoi Dayan <roid@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      0e1c1a2f
    • Eli Britstein's avatar
      net/mlx5e: Replace TC VLAN pop with VLAN 0 rewrite in prio tag mode · 0bac1194
      Eli Britstein authored
      Current ConnectX HW is unable to perform VLAN pop in TX path and VLAN
      push on RX path. To workaround that limitation untagged packets are
      tagged with VLAN ID 0x000 (priority tag) and pop/push actions are
      replaced by VLAN re-write actions (which are supported by the HW).
      Replace TC VLAN pop action with a VLAN priority tag header rewrite.
      Signed-off-by: default avatarEli Britstein <elibr@mellanox.com>
      Reviewed-by: default avatarOz Shlomo <ozsh@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      0bac1194
    • Eli Britstein's avatar
      net/mlx5e: ACLs for priority tag mode · 18486737
      Eli Britstein authored
      Current ConnectX HW is unable to perform VLAN pop in TX path and VLAN
      push on RX path. As a workaround, untagged packets are tagged with
      VID 0x000 allowing pop/push actions to be exchanged with VLAN rewrite
      actions.
      Use the ingress ACL table, preceding the FDB, to push VLAN 0x000 ID tag
      for untagged packets and the egress ACL table, succeeding the FDB, to
      pop VLAN 0x000 ID tag.
      Signed-off-by: default avatarEli Britstein <elibr@mellanox.com>
      Reviewed-by: default avatarOz Shlomo <ozsh@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      18486737
    • Tariq Toukan's avatar
      net/mlx5e: Turn on HW tunnel offload in all TIRs · 69dad68d
      Tariq Toukan authored
      Hardware requires that all TIRs that steer traffic to the same RQ
      should share identical tunneled_offload_en value.
      For that, the tunneled_offload_en bit should be set/unset (according to
      the HW capability) for all TIRs', not only the ones dedicated for
      tunneled (inner) traffic.
      
      Fixes: 1b223dd3 ("net/mlx5e: Fix checksum handling for non-stripped vlan packets")
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      69dad68d
    • Tariq Toukan's avatar
      net/mlx5e: Take common TIR context settings into a function · 7306c274
      Tariq Toukan authored
      Many TIR context settings are common to different TIR types,
      take them into a common function.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarAya Levin <ayal@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      7306c274
    • Saeed Mahameed's avatar
      Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux · c515e70d
      Saeed Mahameed authored
      This merge commit includes some misc shared code updates from mlx5-next branch needed
      for net-next.
      
      1) From Aya: Enable general events on all physical link types and
         restrict general event handling of subtype DELAY_DROP_TIMEOUT in mlx5 rdma
         driver to ethernet links only as it was intended.
      
      2) From Eli: Introduce low level bits for prio tag mode
      
      3) From Maor: Low level steering updates to support RDMA RX flow
         steering and enables RoCE loopback traffic when switchdev is enabled.
      
      4) From Vu and Parav: Two small mlx5 core cleanups
      
      5) From Yevgeny add HW definitions of geneve offloads
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      c515e70d
    • Saeed Mahameed's avatar
      net/mlx5: Fix broken hca cap offset · 91a40a48
      Saeed Mahameed authored
      The cited commit broke the offsets of hca cap struct, fix it.
      While at it, cleanup a white space introduced by the same commit.
      
      Fixes: b169e64a ("net/mlx5: Geneve, Add flow table capabilities for Geneve decap with TLV options")
      Reported-by: default avatarQian Cai <cai@lca.pw>
      Cc: Yevgeny Kliteynik <kliteyn@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      91a40a48
    • David S. Miller's avatar
      Merge branch 'net-ll_temac-x86_64-support' · 2a369ae0
      David S. Miller authored
      Esben Haabendal says:
      
      ====================
      net: ll_temac: x86_64 support
      
      This patch series adds support for use of ll_temac driver with
      platform_data configuration and fixes endianess and 64-bit problems so
      that it can be used on x86_64 platform.
      
      A few bugfixes are also included.
      
      Changes since v2:
        - Fixed lp->indirect_mutex initialization regression for OF
          platforms introduced in v2
      
      Changes since v1:
        - Make indirect_mutex specification mandatory when using platform_data
        - Move header to include/linux/platform_data
        - Enable COMPILE_TEST for XILINX_LL_TEMAC
        - Rebased to v5.1-rc7
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2a369ae0
    • Esben Haabendal's avatar
      net: ll_temac: Enable DMA when ready, not before · 73f7375d
      Esben Haabendal authored
      As soon as TAILDESCR_PTR is written, DMA transfers might start.
      Let's ensure we are ready to receive DMA IRQ's before doing that.
      Signed-off-by: default avatarEsben Haabendal <esben@geanix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      73f7375d
    • Esben Haabendal's avatar
      net: ll_temac: Allow configuration of IRQ coalescing · 7e97a194
      Esben Haabendal authored
      This allows custom setup of IRQ coalescing for platforms using legacy
      platform_device. The irq timeout and count parameters can be used for
      tuning cpu load vs. latency.
      
      I have maintained the 0x00000400 bit in TX_CHNL_CTRL.  It is specified as
      unused in the documentation I have available.  It does not make any
      difference in the hardware I have available, so it is left in to not risk
      breaking other platforms where it might be used.
      Signed-off-by: default avatarEsben Haabendal <esben@geanix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7e97a194
    • Esben Haabendal's avatar
      net: ll_temac: Replace bad usage of msleep() with usleep_range() · 901d14ab
      Esben Haabendal authored
      Use usleep_range() to avoid problems with msleep() actually sleeping
      much longer than expected.
      Signed-off-by: default avatarEsben Haabendal <esben@geanix.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      901d14ab
    • Esben Haabendal's avatar
      net: ll_temac: Fix bug causing buffer descriptor overrun · 2c9938e7
      Esben Haabendal authored
      As we are actually using a BD for both the skb and each frag contained in
      it, the oldest TX BD would be overwritten when there was exactly one BD
      less than needed.
      Signed-off-by: default avatarEsben Haabendal <esben@geanix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2c9938e7
    • Esben Haabendal's avatar
      net: ll_temac: Fix iommu/swiotlb leak · a8c9bd3b
      Esben Haabendal authored
      Unmap the actual buffer length, not the amount of data received, avoiding
      resource exhaustion of swiotlb (seen on x86_64 platform).
      Signed-off-by: default avatarEsben Haabendal <esben@geanix.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a8c9bd3b
    • Esben Haabendal's avatar
      net: ll_temac: Support indirect_mutex share within TEMAC IP · f14f5c11
      Esben Haabendal authored
      Indirect register access goes through a DCR bus bridge, which
      allows only one outstanding transaction.  And to make matters
      worse, each TEMAC IP block contains two Ethernet interfaces, and
      although they seem to have separate registers for indirect access,
      they actually share the registers.  Or to be more specific, MSW, LSW
      and CTL registers are physically shared between Ethernet interfaces
      in same TEMAC IP, with RDY register being (almost) specificic to
      the Ethernet interface.  The 0x10000 bit in RDY reflects combined
      bus ready state though.
      
      So we need to take care to synchronize not only within a single
      device, but also between devices in same TEMAC IP.
      
      This commit allows to do that with legacy platform devices.
      
      For OF devices, the xlnx,compound parent of the temac node should be
      used to find siblings, and setup a shared indirect_mutex between them.
      I will leave this work to somebody else, as I don't have hardware to
      test that.  No regression is introduced by that, as before this commit
      using two Ethernet interfaces in same TEMAC block is simply broken.
      Signed-off-by: default avatarEsben Haabendal <esben@geanix.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f14f5c11
    • Esben Haabendal's avatar
      net: ll_temac: Allow use on x86 platforms · 2c02c37e
      Esben Haabendal authored
      With little-endian and 64-bit support in place, the ll_temac driver can
      now be used on x86 and x86_64 platforms.
      
      And while at it, enable COMPILE_TEST also.
      Signed-off-by: default avatarEsben Haabendal <esben@geanix.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2c02c37e
    • Esben Haabendal's avatar
      net: ll_temac: Fix support for little-endian platforms · fdd7454e
      Esben Haabendal authored
      Both TEMAC and SDMA is big-endian, so make sure that all values in SDMA
      buffer descriptors (cmdac_bd) are handled as big-endian, independent of the
      host endianness. With all currently supported platforms being big-endian,
      this change does not make a change for any of them.
      
      Note, when using app3 and app4 for piggybacking skb pointers there is no
      need to care about endianness, as neither TEMAC nor SDMA access app3 and
      app4 in TX buffer descriptors.
      Signed-off-by: default avatarEsben Haabendal <esben@geanix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fdd7454e
    • Esben Haabendal's avatar
      net: ll_temac: Add support for non-native register endianness · a3246dc4
      Esben Haabendal authored
      Replace the powerpc specific MMIO register access functions with the
      generic big-endian mmio access functions, and add support for
      little-endian access depending on configuration.
      
      Big-endian access is maintained as the default, but little-endian can
      be configured in device-tree binding or in platform data.
      
      The temac_ior()/temac_iow() functions are replaced with macro wrappers
      to avoid modifying existing code more than necessary.
      Signed-off-by: default avatarEsben Haabendal <esben@geanix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a3246dc4
    • Esben Haabendal's avatar
      net: ll_temac: Fix support for 64-bit platforms · d84aec42
      Esben Haabendal authored
      The use of buffer descriptor APP4 field (32-bit) for storing skb pointer
      obviously does not work on 64-bit platforms.
      As APP3 is also unused, we can use that to store the other half of 64-bit
      pointer values.
      
      Contrary to what is hinted at in commit message of commit 15bfe05c
      ("net: ethernet: xilinx: Mark XILINX_LL_TEMAC broken on 64-bit")
      there are no other pointers stored in cdmac_bd.
      Signed-off-by: default avatarEsben Haabendal <esben@geanix.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d84aec42
    • Esben Haabendal's avatar
      net: ll_temac: Extend support to non-device-tree platforms · 8425c41d
      Esben Haabendal authored
      Support initialization with platdata, so the driver can be used on
      non-device-tree platforms.
      
      For currently supported device-tree platforms, the driver should behave
      as before.
      Signed-off-by: default avatarEsben Haabendal <esben@geanix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8425c41d
    • Esben Haabendal's avatar
      net: ll_temac: Fix and simplify error handling by using devres functions · a63625d2
      Esben Haabendal authored
      As a side effect, a few error cases are fixed.
      
      If of_iomap() of sdma_regs failed, no error code was returned.  Fixed to
      return -ENOMEM similar to of_iomap() fail of regs.
      
      If sysfs_create_group() or register_netdev() failed, lp->phy_node was not
      released.
      
      Finally, the order in remove function is corrected to be reverse order
      of what is done in probe, i.e. calling temac_mdio_teardown() last, so we
      unregister the netdev that most likely is using the mdio_bus first.
      Signed-off-by: default avatarEsben Haabendal <esben@geanix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a63625d2
    • YueHaibing's avatar
      net: ethernet: ti: cpsw: Fix inconsistent IS_ERR and PTR_ERR in cpsw_probe() · ac97a359
      YueHaibing authored
      Fix inconsistent IS_ERR and PTR_ERR in cpsw_probe,
      The proper pointer to use is clk instead of mode.
      
      This issue was detected with the help of Coccinelle.
      
      Fixes: 83a8471b ("net: ethernet: ti: cpsw: refactor probe to group common hw initialization")
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ac97a359
    • David S. Miller's avatar
      Merge branch 'net-sched-taprio-change-schedules' · 5b27aafa
      David S. Miller authored
      Vinicius Costa Gomes says:
      
      ====================
      net/sched: taprio change schedules
      
      Changes from RFC:
       - Removed the patches for taprio offloading, because of the lack of
         in-tree users;
       - Updated the links to point to the PATCH version of this series;
      
      Original cover letter:
      
      Overview
      --------
      
      This RFC has two objectives, it adds support for changing the running
      schedules during "runtime", explained in more detail later, and
      proposes an interface between taprio and the drivers for hardware
      offloading.
      
      These two different features are presented together so it's clear what
      the "final state" would look like. But after the RFC stage, they can
      be proposed (and reviewed) separately.
      
      Changing the schedules without disrupting traffic is important for
      handling dynamic use cases, for example, when streams are
      added/removed and when the network configuration changes.
      
      Hardware offloading support allows schedules to be more precise and
      have lower resource usage.
      
      Changing schedules
      ------------------
      
      The same as the other interfaces we proposed, we try to use the same
      concepts as the IEEE 802.1Q-2018 specification. So, for changing
      schedules, there are an "oper" (operational) and an "admin" schedule.
      The "admin" schedule is mutable and not in use, the "oper" schedule is
      immutable and is in use.
      
      That is, when the user first adds an schedule it is in the "admin"
      state, and it becomes "oper" when its base-time (basically when it
      starts) is reached.
      
      What this means is that now it's possible to create taprio with a schedule:
      
      $ tc qdisc add dev IFACE parent root handle 100 taprio \
            num_tc 3 \
            map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
            queues 1@0 1@1 2@2 \
            base-time 10000000 \
            sched-entry S 03 300000 \
            sched-entry S 02 300000 \
            sched-entry S 06 400000 \
            clockid CLOCK_TAI
      
      And then, later, after the previous schedule is "promoted" to "oper",
      add a new ("admin") schedule to be used some time later:
      
      $ tc qdisc change dev IFACE parent root handle 100 taprio \
            base-time 1553121866000000000 \
            sched-entry S 02 500000 \
            sched-entry S 0f 400000 \
            clockid CLOCK_TAI
      
      When enabling the ability to change schedules, it makes sense to add
      two more defined knobs to schedules: "cycle-time" allows to truncate a
      cycle to some value, so it repeats after a well-defined value;
      "cycle-time-extension" controls how much an entry can be extended if
      it's the last one before the change of schedules, the reason is to
      avoid a very small cycle when transitioning from a schedule to
      another.
      
      With these, taprio in the software mode should provide a fairly
      complete implementation of what's defined in the Enhancements for
      Scheduled Traffic parts of the specification.
      
      Hardware offload
      ----------------
      
      Some workloads require better guarantees from their schedules than
      what's provided by the software implementation. This series proposes
      an interface for configuring schedules into compatible network
      controllers.
      
      This part is proposed together with the support for changing
      schedules, because it raises questions like, should the "qdisc" side
      be responsible of providing visibility into the schedules or should it
      be the driver?
      
      In this proposal, the driver is called passing the new schedule as
      soon as it is validated, and the "core" qdisc takes care of displaying
      (".dump()") the correct schedules at all times. It means that some
      logic would need to be duplicated in the driver, if the hardware
      doesn't have support for multiple schedules. But as taprio doesn't
      have enough information about the underlying controller to know how
      much in advance a schedule needs to be informed to the hardware, it
      feels like a fair compromise.
      
      The hardware offloading part of this proposal also tries to define an
      interface for frame-preemption and how it interacts with the
      scheduling of traffic, see Section 8.6.8.4 of IEEE 802.1Q-2018 for
      more information.
      
      One important difference between the qdisc interface and the
      qdisc-driver interface, is that the "gate mask" on the qdisc side
      references traffic classes, that is bit 0 of the gate mask means
      Traffic Class 0, and in the driver interface, it specifies the queues,
      that is bit 0 means queue 0. That is to say that taprio converts the
      references to traffic classes to references to queues before sending
      the offloading request to the driver.
      
      Request for help
      ----------------
      
      I would like that interested driver maintainers could take a look at
      the proposed interface and see if it's going to be too awkward for any
      particular device. Also, pointers to available documentation would be
      appreciated. The idea here is to start a discussion so we can have an
      interface that would work for multiple vendors.
      
      Links
      -----
      
      kernel patches:
      https://github.com/vcgomes/net-next/tree/taprio-add-support-for-change-v3
      
      iproute2 patches:
      https://github.com/vcgomes/iproute2/tree/taprio-add-support-for-change-v3
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5b27aafa
    • Vinicius Costa Gomes's avatar
      taprio: Add support for cycle-time-extension · c25031e9
      Vinicius Costa Gomes authored
      IEEE 802.1Q-2018 defines the concept of a cycle-time-extension, so the
      last entry of a schedule before the start of a new schedule can be
      extended, so "too-short" entries can be avoided.
      Signed-off-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c25031e9
    • Vinicius Costa Gomes's avatar
      taprio: Add support for setting the cycle-time manually · 6ca6a665
      Vinicius Costa Gomes authored
      IEEE 802.1Q-2018 defines that a the cycle-time of a schedule may be
      overridden, so the schedule is truncated to a determined "width".
      Signed-off-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ca6a665
    • Vinicius Costa Gomes's avatar
      taprio: Add support adding an admin schedule · a3d43c0d
      Vinicius Costa Gomes authored
      The IEEE 802.1Q-2018 defines two "types" of schedules, the "Oper" (from
      operational?) and "Admin" ones. Up until now, 'taprio' only had
      support for the "Oper" one, added when the qdisc is created. This adds
      support for the "Admin" one, which allows the .change() operation to
      be supported.
      
      Just for clarification, some quick (and dirty) definitions, the "Oper"
      schedule is the currently (as in this instant) running one, and it's
      read-only. The "Admin" one is the one that the system configurator has
      installed, it can be changed, and it will be "promoted" to "Oper" when
      it's 'base-time' is reached.
      
      The idea behing this patch is that calling something like the below,
      (after taprio is already configured with an initial schedule):
      
      $ tc qdisc change taprio dev IFACE parent root 	     \
           	   base-time X 	     	   	       	     \
           	   sched-entry <CMD> <GATES> <INTERVAL>	     \
      	   ...
      
      Will cause a new admin schedule to be created and programmed to be
      "promoted" to "Oper" at instant X. If an "Admin" schedule already
      exists, it will be overwritten with the new parameters.
      
      Up until now, there was some code that was added to ease the support
      of changing a single entry of a schedule, but was ultimately unused.
      Now, that we have support for "change" with more well thought
      semantics, updating a single entry seems to be less useful.
      
      So we remove what is in practice dead code, and return a "not
      supported" error if the user tries to use it. If changing a single
      entry would make the user's life easier we may ressurrect this idea,
      but at this point, removing it simplifies the code.
      
      For now, only the schedule specific bits are allowed to be added for a
      new schedule, that means that 'clockid', 'num_tc', 'map' and 'queues'
      cannot be modified.
      
      Example:
      
      $ tc qdisc change dev IFACE parent root handle 100 taprio \
            base-time $BASE_TIME \
            sched-entry S 00 500000 \
            sched-entry S 0f 500000 \
            clockid CLOCK_TAI
      
      The only change in the netlink API introduced by this change is the
      introduction of an "admin" type in the response to a dump request,
      that type allows userspace to separate the "oper" schedule from the
      "admin" schedule. If userspace doesn't support the "admin" type, it
      will only display the "oper" schedule.
      Signed-off-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a3d43c0d
    • Vinicius Costa Gomes's avatar
      taprio: Fix potencial use of invalid memory during dequeue() · 8c79f0ea
      Vinicius Costa Gomes authored
      Right now, this isn't a problem, but the next commit allows schedules
      to be added during runtime. When a new schedule transitions from the
      inactive to the active state ("admin" -> "oper") the previous one can
      be freed, if it's freed just after the RCU read lock is released, we
      may access an invalid entry.
      
      So, we should take care to protect the dequeue() flow, so all the
      places that access the entries are protected by the RCU read lock.
      Signed-off-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c79f0ea
    • David S. Miller's avatar
      Merge branch 'tcp-undo-congestion' · cd86972a
      David S. Miller authored
      Yuchung Cheng says:
      
      ====================
      undo congestion window on spurious SYN or SYNACK timeout
      
      Linux TCP currently uses the initial congestion window of 1 packet
      if multiple SYN or SYNACK timeouts per RFC6298. However such
      timeouts are often spurious on wireless or cellular networks that
      experience high delay variances (e.g. ramping up dormant radios or
      local link retransmission). Another case is when the underlying
      path is longer than the default SYN timeout (e.g. 1 second). In
      these cases starting the transfer with a minimal congestion window
      is detrimental to the performance for short flows.
      
      One naive approach is to simply ignore SYN or SYNACK timeouts and
      always use a larger or default initial window. This approach however
      risks pouring gas to the fire when the network is already highly
      congested. This is particularly true in data center where application
      could start thousands to millions of connections over a single or
      multiple hosts resulting in high SYN drops (e.g. incast).
      
      This patch-set detects spurious SYN and SYNACK timeouts upon
      completing the handshake via the widely-supported TCP timestamp
      options. Upon such events the sender reverts to the default
      initial window to start the data transfer so it gets best of both
      worlds. This patch-set supports this feature for both active and
      passive as well as Fast Open or regular connections.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd86972a
    • Yuchung Cheng's avatar
      tcp: refactor setting the initial congestion window · 98fa6271
      Yuchung Cheng authored
      Relocate the congestion window initialization from tcp_init_metrics()
      to tcp_init_transfer() to improve code readability.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      98fa6271
    • Yuchung Cheng's avatar
      tcp: refactor to consolidate TFO passive open code · 6b94b1c8
      Yuchung Cheng authored
      Use a helper to consolidate two identical code block for passive TFO.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6b94b1c8
    • Yuchung Cheng's avatar
      tcp: undo cwnd on Fast Open spurious SYNACK retransmit · 794200d6
      Yuchung Cheng authored
      This patch makes passive Fast Open reverts the cwnd to default
      initial cwnd (10 packets) if the SYNACK timeout is spurious.
      
      Passive Fast Open uses a full socket during handshake so it can
      use the existing undo logic to detect spurious retransmission
      by recording the first SYNACK timeout in key state variable
      retrans_stamp. Upon receiving the ACK of the SYNACK, if the socket
      has sent some data before the timeout, the spurious timeout
      is detected by tcp_try_undo_recovery() in tcp_process_loss()
      in tcp_ack().
      
      But if the socket has not send any data yet, tcp_ack() does not
      execute the undo code since no data is acknowledged. The fix is to
      check such case explicitly after tcp_ack() during the ACK processing
      in SYN_RECV state. In addition this is checked in FIN_WAIT_1 state
      in case the server closes the socket before handshake completes.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      794200d6
    • Yuchung Cheng's avatar
      tcp: lower congestion window on Fast Open SYNACK timeout · 8c3cfe19
      Yuchung Cheng authored
      TCP sender would use congestion window of 1 packet on the second SYN
      and SYNACK timeout except passive TCP Fast Open. This makes passive
      TFO too aggressive and unfair during congestion at handshake. This
      patch fixes this issue so TCP (fast open or not, passive or active)
      always conforms to the RFC6298.
      
      Note that tcp_enter_loss() is called only once during recurring
      timeouts.  This is because during handshake, high_seq and snd_una
      are the same so tcp_enter_loss() would incorrect set the undo state
      variables multiple times.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c3cfe19
    • Yuchung Cheng's avatar
      tcp: undo init congestion window on false SYNACK timeout · 336c39a0
      Yuchung Cheng authored
      Linux implements RFC6298 and use an initial congestion window
      of 1 upon establishing the connection if the SYNACK packet is
      retransmitted 2 or more times. In cellular networks SYNACK timeouts
      are often spurious if the wireless radio was dormant or idle. Also
      some network path is longer than the default SYNACK timeout. In
      both cases falsely starting with a minimal cwnd are detrimental
      to performance.
      
      This patch avoids doing so when the final ACK's TCP timestamp
      indicates the original SYNACK was delivered. It remembers the
      original SYNACK timestamp when SYNACK timeout has occurred and
      re-uses the function to detect spurious SYN timeout conveniently.
      
      Note that a server may receives multiple SYNs from and immediately
      retransmits SYNACKs without any SYNACK timeout. This often happens
      on when the client SYNs have timed out due to wireless delay
      above. In this case since the server will still use the default
      initial congestion (e.g. 10) because tp->undo_marker is reset in
      tcp_init_metrics(). This is an intentional design because packets
      are not lost but delayed.
      
      This patch only covers regular TCP passive open. Fast Open is
      supported in the next patch.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      336c39a0
    • Yuchung Cheng's avatar
      tcp: better SYNACK sent timestamp · 9e450c1e
      Yuchung Cheng authored
      Detecting spurious SYNACK timeout using timestamp option requires
      recording the exact SYNACK skb timestamp. Previously the SYNACK
      sent timestamp was stamped slightly earlier before the skb
      was transmitted. This patch uses the SYNACK skb transmission
      timestamp directly.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9e450c1e
    • Yuchung Cheng's avatar
      tcp: undo initial congestion window on false SYN timeout · 7c1f0815
      Yuchung Cheng authored
      Linux implements RFC6298 and use an initial congestion window of 1
      upon establishing the connection if the SYN packet is retransmitted 2
      or more times. In cellular networks SYN timeouts are often spurious
      if the wireless radio was dormant or idle. Also some network path
      is longer than the default SYN timeout. Having a minimal cwnd on
      both cases are detrimental to TCP startup performance.
      
      This patch extends TCP undo feature (RFC3522 aka TCP Eifel) to detect
      spurious SYN timeout via TCP timestamps. Since tp->retrans_stamp
      records the initial SYN timestamp instead of first retransmission, we
      have to implement a different undo code additionally. The detection
      also must happen before tcp_ack() as retrans_stamp is reset when
      SYN is acknowledged.
      
      Note this patch covers both active regular and fast open.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7c1f0815
    • Yuchung Cheng's avatar
      tcp: avoid unconditional congestion window undo on SYN retransmit · bc9f38c8
      Yuchung Cheng authored
      Previously if an active TCP open has SYN timeout, it always undo the
      cwnd upon receiving the SYNACK. This is because tcp_clean_rtx_queue
      would reset tp->retrans_stamp when SYN is acked, which fools then
      tcp_try_undo_loss and tcp_packet_delayed. Addressing this issue is
      required to properly support undo for spurious SYN timeout.
      
      Fixing this is tricky -- for active TCP open tp->retrans_stamp
      records the time when the handshake starts, not the first
      retransmission time as the name may suggest. The simplest fix is
      for tcp_packet_delayed to ensure it is valid before comparing with
      other timestamp.
      
      One side effect of this change is active TCP Fast Open that incurred
      SYN timeout. Upon receiving a SYN-ACK that only acknowledged
      the SYN, it would immediately retransmit unacknowledged data in
      tcp_ack() because the data is marked lost after SYN timeout. But
      the retransmission would have an incorrect ack sequence number since
      rcv_nxt has not been updated yet tcp_rcv_synsent_state_process(), the
      retransmission needs to properly handed by tcp_rcv_fastopen_synack()
      like before.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bc9f38c8
    • Gustavo A. R. Silva's avatar
      netdevsim: fix fall-through annotation · 6d1474a9
      Gustavo A. R. Silva authored
      Replace "pass through" with a proper "fall through" annotation
      in order to fix the following warning:
      
      drivers/net/netdevsim/bus.c: In function ‘new_device_store’:
      drivers/net/netdevsim/bus.c:170:14: warning: this statement may fall through [-Wimplicit-fallthrough=]
         port_count = 1;
         ~~~~~~~~~~~^~~
      drivers/net/netdevsim/bus.c:172:2: note: here
        case 2:
        ^~~~
      
      Warning level 3 was used: -Wimplicit-fallthrough=3
      
      This fix is part of the ongoing efforts to enable
      -Wimplicit-fallthrough
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Acked-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6d1474a9
    • Gustavo A. R. Silva's avatar
      sfc: mcdi_port: Mark expected switch fall-through · 4a46a7c3
      Gustavo A. R. Silva authored
      In preparation to enabling -Wimplicit-fallthrough, mark switch
      cases where we are expecting to fall through.
      
      This patch fixes the following warning:
      
      drivers/net/ethernet/sfc/mcdi_port.c: In function ‘efx_mcdi_phy_decode_link’:
      ./include/linux/compiler.h:77:22: warning: this statement may fall through [-Wimplicit-fallthrough=]
       # define unlikely(x) __builtin_expect(!!(x), 0)
                            ^~~~~~~~~~~~~~~~~~~~~~~~~~
      ./include/asm-generic/bug.h:125:2: note: in expansion of macro ‘unlikely’
        unlikely(__ret_warn_on);     \
        ^~~~~~~~
      drivers/net/ethernet/sfc/mcdi_port.c:344:3: note: in expansion of macro ‘WARN_ON’
         WARN_ON(1);
         ^~~~~~~
      drivers/net/ethernet/sfc/mcdi_port.c:345:2: note: here
        case MC_CMD_FCNTL_OFF:
        ^~~~
      
      Warning level 3 was used: -Wimplicit-fallthrough=3
      
      This patch is part of the ongoing efforts to enable
      -Wimplicit-fallthrough.
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Acked-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4a46a7c3
    • Moshe Shemesh's avatar
      devlink: Change devlink health locking mechanism · b587bdaf
      Moshe Shemesh authored
      The devlink health reporters create/destroy and user commands currently
      use the devlink->lock as a locking mechanism. Different reporters have
      different rules in the driver and are being created/destroyed during
      different stages of driver load/unload/running. So during execution of a
      reporter recover the flow can go through another reporter's destroy and
      create. Such flow leads to deadlock trying to lock a mutex already
      held.
      
      With the new locking mechanism the different reporters share mutex lock
      only to protect access to shared reporters list.
      Added refcount per reporter, to protect the reporters from destroy while
      being used.
      Signed-off-by: default avatarMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Acked-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b587bdaf
    • David S. Miller's avatar
      Merge branch 'aquantia-next' · 5be90f99
      David S. Miller authored
      Igor Russkikh says:
      
      ====================
      net: atlantic: Aquantia driver updates 2019-04
      
      This patchset contains various improvements:
      
      - Work targeting link up speedups: link interrupt introduced, some other
        logic changes to imrove this.
      - FW operations securing with mutex
      - Counters and statistics logic improved by Dmitry
      - read out of chip temperature via hwmon interface implemented by
        Yana and Nikita.
      
      v4 changes:
      - remove drvinfo_exit noop
      - 64bit stats should be readed out sequentially (lsw, then msw)
        declare 64bit read ops for that
      
      v3 changes:
      - temp ops renamed to phy_temp ops
      - mutex commits squashed for better structure
      
      v2 changes:
      - use threaded irq for link state handling
      - rework hwmon via devm_hwmon_device_register_with_info
      Extra comments on review from Andrew:
      - direct device name pointer is used in hwmon registration.
        This causes hwmon device to derive possible interface name changes
      - Will consider sanity checks for firmware mutex lock separately.
        Right now there is no single point exsists where such check could
        be easily added.
      - There is no way now to fetch and configure min/max/crit temperatures
        via FW. Will investigate this separately.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5be90f99