1. 16 Sep, 2019 40 commits
    • Arthur Kiyanovski's avatar
      net: ena: enable the interrupt_moderation in driver_supported_features · bd21b0cc
      Arthur Kiyanovski authored
      Add driver_supported_features to host_host info which is a new API used to
      communicate to the device which features are supported by the driver.
      
      Add the interrupt_moderation bit to host_info->driver_supported_features
      and enable it to signal the device that this driver supports interrupt
      moderation properly.
      
      Reserved bits are for features implemented in the future
      Signed-off-by: default avatarArthur Kiyanovski <akiyano@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bd21b0cc
    • Arthur Kiyanovski's avatar
      net: ena: reimplement set/get_coalesce() · b3db86dc
      Arthur Kiyanovski authored
      1. Remove old adaptive interrupt moderation code from set/get_coalesce()
      2. Add ena_update_rx_rings_intr_moderation() function for updating
         nonadaptive interrupt moderation intervals similarly to
         ena_update_tx_rings_intr_moderation().
      3. Remove checks of multiple unsupported received interrupt coalescing
         parameters. This makes code cleaner and cancels the need to update
         it every time a new coalescing parameter is invented.
      Signed-off-by: default avatarArthur Kiyanovski <akiyano@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b3db86dc
    • Arthur Kiyanovski's avatar
      net: ena: switch to dim algorithm for rx adaptive interrupt moderation · 282faf61
      Arthur Kiyanovski authored
      Use the dim library for the rx adaptive interrupt moderation implementation
      Signed-off-by: default avatarArthur Kiyanovski <akiyano@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      282faf61
    • Arthur Kiyanovski's avatar
      net: ena: add intr_moder_rx_interval to struct ena_com_dev and use it · 15619e72
      Arthur Kiyanovski authored
      Add intr_moder_rx_interval to struct ena_com_dev and use it as the
      location where the interrupt moderation rx interval is saved, instead
      of the interrupt moderation table.
      
      This is done as a first step before removing the old interrupt moderation
      code.
      Signed-off-by: default avatarArthur Kiyanovski <akiyano@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      15619e72
    • David S. Miller's avatar
      Merge branch 'ethtool-implement-Energy-Detect-Powerdown-support-via-phy-tunable' · 1b8da103
      David S. Miller authored
      Alexandru Ardelean says:
      
      ====================
      ethtool: implement Energy Detect Powerdown support via phy-tunable
      
      This changeset proposes a new control for PHY tunable to control Energy
      Detect Power Down.
      
      The `phy_tunable_id` has been named `ETHTOOL_PHY_EDPD` since it looks like
      this feature is common across other PHYs (like EEE), and defining
      `ETHTOOL_PHY_ENERGY_DETECT_POWER_DOWN` seems too long.
      
      The way EDPD works, is that the RX block is put to a lower power mode,
      except for link-pulse detection circuits. The TX block is also put to low
      power mode, but the PHY wakes-up periodically to send link pulses, to avoid
      lock-ups in case the other side is also in EDPD mode.
      
      Currently, there are 2 PHY drivers that look like they could use this new
      PHY tunable feature: the `adin` && `micrel` PHYs.
      
      This series updates only the `adin` PHY driver to support this new feature,
      as this chip has been tested. A change for `micrel` can be proposed after a
      discussion of the PHY-tunable API is resolved.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1b8da103
    • Alexandru Ardelean's avatar
      net: phy: adin: implement Energy Detect Powerdown mode via phy-tunable · 65d7be09
      Alexandru Ardelean authored
      This driver becomes the first user of the kernel's `ETHTOOL_PHY_EDPD`
      phy-tunable feature.
      EDPD is also enabled by default on PHY config_init, but can be disabled via
      the phy-tunable control.
      
      When enabling EDPD, it's also a good idea (for the ADIN PHYs) to enable TX
      periodic pulses, so that in case the other PHY is also on EDPD mode, there
      is no lock-up situation where both sides are waiting for the other to
      transmit.
      
      Via the phy-tunable control, TX pulses can be disabled if specifying 0
      `tx-interval` via ethtool.
      
      The ADIN PHY supports only fixed 1 second intervals; they cannot be
      configured. That is why the acceptable values are 1,
      ETHTOOL_PHY_EDPD_DFLT_TX_MSECS and ETHTOOL_PHY_EDPD_NO_TX (which disables
      TX pulses).
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarAlexandru Ardelean <alexandru.ardelean@analog.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      65d7be09
    • Alexandru Ardelean's avatar
      ethtool: implement Energy Detect Powerdown support via phy-tunable · 9f2f13f4
      Alexandru Ardelean authored
      The `phy_tunable_id` has been named `ETHTOOL_PHY_EDPD` since it looks like
      this feature is common across other PHYs (like EEE), and defining
      `ETHTOOL_PHY_ENERGY_DETECT_POWER_DOWN` seems too long.
      
      The way EDPD works, is that the RX block is put to a lower power mode,
      except for link-pulse detection circuits. The TX block is also put to low
      power mode, but the PHY wakes-up periodically to send link pulses, to avoid
      lock-ups in case the other side is also in EDPD mode.
      
      Currently, there are 2 PHY drivers that look like they could use this new
      PHY tunable feature: the `adin` && `micrel` PHYs.
      
      The ADIN's datasheet mentions that TX pulses are at intervals of 1 second
      default each, and they can be disabled. For the Micrel KSZ9031 PHY, the
      datasheet does not mention whether they can be disabled, but mentions that
      they can modified.
      
      The way this change is structured, is similar to the PHY tunable downshift
      control:
      * a `ETHTOOL_PHY_EDPD_DFLT_TX_MSECS` value is exposed to cover a default
        TX interval; some PHYs could specify a certain value that makes sense
      * `ETHTOOL_PHY_EDPD_NO_TX` would disable TX when EDPD is enabled
      * `ETHTOOL_PHY_EDPD_DISABLE` will disable EDPD
      
      As noted by the `ETHTOOL_PHY_EDPD_DFLT_TX_MSECS` the interval unit is 1
      millisecond, which should cover a reasonable range of intervals:
       - from 1 millisecond, which does not sound like much of a power-saver
       - to ~65 seconds which is quite a lot to wait for a link to come up when
         plugging a cable
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarAlexandru Ardelean <alexandru.ardelean@analog.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9f2f13f4
    • Markus Elfring's avatar
      s390/ctcm: Delete unnecessary checks before the macro call “dev_kfree_skb” · 56a4e37e
      Markus Elfring authored
      The dev_kfree_skb() function performs also input parameter validation.
      Thus the test around the shown calls is not needed.
      
      This issue was detected by using the Coccinelle software.
      Signed-off-by: default avatarMarkus Elfring <elfring@users.sourceforge.net>
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      56a4e37e
    • David S. Miller's avatar
      Merge branch 'drop_monitor-Better-sanitize-notified-packets' · f432c2e3
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      drop_monitor: Better sanitize notified packets
      
      When working in 'packet' mode, drop monitor generates a notification
      with a potentially truncated payload of the dropped packet. The payload
      is copied from the MAC header, but I forgot to check that the MAC header
      was set, so do it now.
      
      Patch #1 sets the offsets to the various protocol layers in netdevsim,
      so that it will continue to work after the MAC header check is added to
      drop monitor in patch #2.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f432c2e3
    • Ido Schimmel's avatar
      drop_monitor: Better sanitize notified packets · bef17466
      Ido Schimmel authored
      When working in 'packet' mode, drop monitor generates a notification
      with a potentially truncated payload of the dropped packet. The payload
      is copied from the MAC header, but I forgot to check that the MAC header
      was set, so do it now.
      
      Fixes: ca30707d ("drop_monitor: Add packet alert mode")
      Fixes: 5e58109b ("drop_monitor: Add support for packet alert mode for hardware drops")
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bef17466
    • Ido Schimmel's avatar
      netdevsim: Set offsets to various protocol layers · 58a406de
      Ido Schimmel authored
      The driver periodically generates "trapped" UDP packets that it then
      passes on to devlink. Set the offsets to the various protocol layers.
      
      This is a prerequisite to the next patch, where drop monitor is taught
      to check that the offset to the MAC header was set.
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      58a406de
    • David S. Miller's avatar
      Merge branch 'tc-taprio-offload-for-SJA1105-DSA' · db539cae
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      tc-taprio offload for SJA1105 DSA
      
      This is the third attempt to submit the tc-taprio offload model for
      inclusion in the networking tree. The sja1105 switch driver will provide
      the first implementation of the offload. Only the bare minimum is added:
      
      - The offload model and a DSA pass-through
      - The hardware implementation
      - The interaction with the netdev queues in the tagger code
      - Documentation
      
      What has been removed from previous attempts is support for
      PTP-as-clocksource in sja1105, as well as configuring the traffic class
      for management traffic.  These will be added as soon as the offload
      model is settled.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      db539cae
    • Vladimir Oltean's avatar
      docs: net: dsa: sja1105: Add info about the Time-Aware Scheduler · 7c95afa4
      Vladimir Oltean authored
      While not an exhaustive usage tutorial, this describes the details
      needed to build more complex scenarios.
      Signed-off-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7c95afa4
    • Vladimir Oltean's avatar
      net: dsa: sja1105: Configure the Time-Aware Scheduler via tc-taprio offload · 317ab5b8
      Vladimir Oltean authored
      This qdisc offload is the closest thing to what the SJA1105 supports in
      hardware for time-based egress shaping. The switch core really is built
      around SAE AS6802/TTEthernet (a TTTech standard) but can be made to
      operate similarly to IEEE 802.1Qbv with some constraints:
      
      - The gate control list is a global list for all ports. There are 8
        execution threads that iterate through this global list in parallel.
        I don't know why 8, there are only 4 front-panel ports.
      
      - Care must be taken by the user to make sure that two execution threads
        never get to execute a GCL entry simultaneously. I created a O(n^4)
        checker for this hardware limitation, prior to accepting a taprio
        offload configuration as valid.
      
      - The spec says that if a GCL entry's interval is shorter than the frame
        length, you shouldn't send it (and end up in head-of-line blocking).
        Well, this switch does anyway.
      
      - The switch has no concept of ADMIN and OPER configurations. Because
        it's so simple, the TAS settings are loaded through the static config
        tables interface, so there isn't even place for any discussion about
        'graceful switchover between ADMIN and OPER'. You just reset the
        switch and upload a new OPER config.
      
      - The switch accepts multiple time sources for the gate events. Right
        now I am using the standalone clock source as opposed to PTP. So the
        base time parameter doesn't really do much. Support for the PTP clock
        source will be added in a future series.
      Signed-off-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      317ab5b8
    • Vladimir Oltean's avatar
      net: dsa: sja1105: Advertise the 8 TX queues · 5f06c63b
      Vladimir Oltean authored
      This is a preparation patch for the tc-taprio offload (and potentially
      for other future offloads such as tc-mqprio).
      
      Instead of looking directly at skb->priority during xmit, let's get the
      netdev queue and the queue-to-traffic-class mapping, and put the
      resulting traffic class into the dsa_8021q PCP field. The switch is
      configured with a 1-to-1 PCP-to-ingress-queue-to-egress-queue mapping
      (see vlan_pmap in sja1105_main.c), so the effect is that we can inject
      into a front-panel's egress traffic class through VLAN tagging from
      Linux, completely transparently.
      
      Unfortunately the switch doesn't look at the VLAN PCP in the case of
      management traffic to/from the CPU (link-local frames at
      01-80-C2-xx-xx-xx or 01-1B-19-xx-xx-xx) so we can't alter the
      transmission queue of this type of traffic on a frame-by-frame basis. It
      is only selected through the "hostprio" setting which ATM is harcoded in
      the driver to 7.
      Signed-off-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5f06c63b
    • Vladimir Oltean's avatar
      net: dsa: sja1105: Add static config tables for scheduling · 7f1e4ba8
      Vladimir Oltean authored
      In order to support tc-taprio offload, the TTEthernet egress scheduling
      core registers must be made visible through the static interface.
      Signed-off-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f1e4ba8
    • Vladimir Oltean's avatar
      net: dsa: Pass ndo_setup_tc slave callback to drivers · 47d23af2
      Vladimir Oltean authored
      DSA currently handles shared block filters (for the classifier-action
      qdisc) in the core due to what I believe are simply pragmatic reasons -
      hiding the complexity from drivers and offerring a simple API for port
      mirroring.
      
      Extend the dsa_slave_setup_tc function by passing all other qdisc
      offloads to the driver layer, where the driver may choose what it
      implements and how. DSA is simply a pass-through in this case.
      Signed-off-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Acked-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Acked-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      47d23af2
    • Vinicius Costa Gomes's avatar
      taprio: Add support for hardware offloading · 9c66d156
      Vinicius Costa Gomes authored
      This allows taprio to offload the schedule enforcement to capable
      network cards, resulting in more precise windows and less CPU usage.
      
      The gate mask acts on traffic classes (groups of queues of same
      priority), as specified in IEEE 802.1Q-2018, and following the existing
      taprio and mqprio semantics.
      It is up to the driver to perform conversion between tc and individual
      netdev queues if for some reason it needs to make that distinction.
      
      Full offload is requested from the network interface by specifying
      "flags 2" in the tc qdisc creation command, which in turn corresponds to
      the TCA_TAPRIO_ATTR_FLAG_FULL_OFFLOAD bit.
      
      The important detail here is the clockid which is implicitly /dev/ptpN
      for full offload, and hence not configurable.
      
      A reference counting API is added to support the use case where Ethernet
      drivers need to keep the taprio offload structure locally (i.e. they are
      a multi-port switch driver, and configuring a port depends on the
      settings of other ports as well). The refcount_t variable is kept in a
      private structure (__tc_taprio_qopt_offload) and not exposed to drivers.
      
      In the future, the private structure might also be expanded with a
      backpointer to taprio_sched *q, to implement the notification system
      described in the patch (of when admin became oper, or an error occurred,
      etc, so the offload can be monitored with 'tc qdisc show').
      Signed-off-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: default avatarVoon Weifeng <weifeng.voon@intel.com>
      Signed-off-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c66d156
    • Russell King's avatar
      net: phylink: clarify where phylink should be used · 67e80b99
      Russell King authored
      Update the phylink documentation to make it clear that phylink is
      designed to be used on the MAC facing side of the link, rather than
      between a SFP and PHY.
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      67e80b99
    • David S. Miller's avatar
      Merge branch 'bnxt_en-error-recovery-follow-up-patches' · 0a75709b
      David S. Miller authored
      Michael Chan says:
      
      ====================
      bnxt_en: error recovery follow-up patches.
      
      A follow-up patchset for the recently added health and error recovery
      feature.  The first fix is to prevent .ndo_set_rx_mode() from proceeding
      when reset is in progress.  The 2nd fix is for the firmware coredump
      command.  The 3rd and 4th patches update the error recovery process
      slightly to add a state that polls and waits for the firmware to be down.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a75709b
    • Vasundhara Volam's avatar
      bnxt_en: Add a new BNXT_FW_RESET_STATE_POLL_FW_DOWN state. · 4037eb71
      Vasundhara Volam authored
      This new state is required when firmware indicates that the error
      recovery process requires polling for firmware state to be completely
      down before initiating reset.  For example, firmware may take some
      time to collect the crash dump before it is down and ready to be
      reset.
      Signed-off-by: default avatarVasundhara Volam <vasundhara-v.volam@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4037eb71
    • Michael Chan's avatar
      bnxt_en: Update firmware interface spec. to 1.10.0.100. · 72e0c9f9
      Michael Chan authored
      Some error recovery updates to the spec., among other minor changes.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      72e0c9f9
    • Vasundhara Volam's avatar
      bnxt_en: Increase timeout for HWRM_DBG_COREDUMP_XX commands · 57a8730b
      Vasundhara Volam authored
      Firmware coredump messages take much longer than standard messages,
      so increase the timeout accordingly.
      
      Fixes: 6c5657d0 ("bnxt_en: Add support for ethtool get dump.")
      Signed-off-by: default avatarVasundhara Volam <vasundhara-v.volam@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      57a8730b
    • Michael Chan's avatar
      bnxt_en: Don't proceed in .ndo_set_rx_mode() when device is not in open state. · 268d0895
      Michael Chan authored
      Check the BNXT_STATE_OPEN flag instead of netif_running() in
      bnxt_set_rx_mode().  If the driver is going through any reset, such
      as firmware reset or even TX timeout, it may not be ready to set the RX
      mode and may crash.  The new rx mode settings will be picked up when
      the device is opened again later.
      
      Fixes: 230d1f0d ("bnxt_en: Handle firmware reset.")
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      268d0895
    • Thomas Higdon's avatar
      tcp: Add snd_wnd to TCP_INFO · 8f7baad7
      Thomas Higdon authored
      Neal Cardwell mentioned that snd_wnd would be useful for diagnosing TCP
      performance problems --
      > (1) Usually when we're diagnosing TCP performance problems, we do so
      > from the sender, since the sender makes most of the
      > performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
      > From the sender-side the thing that would be most useful is to see
      > tp->snd_wnd, the receive window that the receiver has advertised to
      > the sender.
      
      This serves the purpose of adding an additional __u32 to avoid the
      would-be hole caused by the addition of the tcpi_rcvi_ooopack field.
      Signed-off-by: default avatarThomas Higdon <tph@fb.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8f7baad7
    • Thomas Higdon's avatar
      tcp: Add TCP_INFO counter for packets received out-of-order · f9af2dbb
      Thomas Higdon authored
      For receive-heavy cases on the server-side, we want to track the
      connection quality for individual client IPs. This counter, similar to
      the existing system-wide TCPOFOQueue counter in /proc/net/netstat,
      tracks out-of-order packet reception. By providing this counter in
      TCP_INFO, it will allow understanding to what degree receive-heavy
      sockets are experiencing out-of-order delivery and packet drops
      indicating congestion.
      
      Please note that this is similar to the counter in NetBSD TCP_INFO, and
      has the same name.
      
      Also note that we avoid increasing the size of the tcp_sock struct by
      taking advantage of a hole.
      Signed-off-by: default avatarThomas Higdon <tph@fb.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9af2dbb
    • Dmitry Torokhov's avatar
      net: mdio: switch to using gpiod_get_optional() · 40ba6a12
      Dmitry Torokhov authored
      The MDIO device reset line is optional and now that gpiod_get_optional()
      returns proper value when GPIO support is compiled out, there is no
      reason to use fwnode_get_named_gpiod() that I plan to hide away.
      
      Let's switch to using more standard gpiod_get_optional() and
      gpiod_set_consumer_name() to keep the nice "PHY reset" label.
      
      Also there is no reason to only try to fetch the reset GPIO when we have
      OF node, gpiolib can fetch GPIO data from firmwares as well.
      Signed-off-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      40ba6a12
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 28f2c362
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf-next 2019-09-16
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      The main changes are:
      
      1) Now that initial BPF backend for gcc has been merged upstream, enable
         BPF kselftest suite for bpf-gcc. Also fix a BE issue with access to
         bpf_sysctl.file_pos, from Ilya.
      
      2) Follow-up fix for link-vmlinux.sh to remove bash-specific extensions
         related to recent work on exposing BTF info through sysfs, from Andrii.
      
      3) AF_XDP zero copy fixes for i40e and ixgbe driver which caused umem
         headroom to be added twice, from Ciara.
      
      4) Refactoring work to convert sock opt tests into test_progs framework
         in BPF kselftests, from Stanislav.
      
      5) Fix a general protection fault in dev_map_hash_update_elem(), from Toke.
      
      6) Cleanup to use BPF_PROG_RUN() macro in KCM, from Sami.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28f2c362
    • Ilya Leoshkevich's avatar
      bpf: fix accessing bpf_sysctl.file_pos on s390 · d895a0f1
      Ilya Leoshkevich authored
      "ctx:file_pos sysctl:read write ok" fails on s390 with "Read value  !=
      nux". This is because verifier rewrites a complete 32-bit
      bpf_sysctl.file_pos update to a partial update of the first 32 bits of
      64-bit *bpf_sysctl_kern.ppos, which is not correct on big-endian
      systems.
      
      Fix by using an offset on big-endian systems.
      
      Ditto for bpf_sysctl.file_pos reads. Currently the test does not detect
      a problem there, since it expects to see 0, which it gets with high
      probability in error cases, so change it to seek to offset 3 and expect
      3 in bpf_sysctl.file_pos.
      
      Fixes: e1550bfe ("bpf: Add file_pos field to bpf_sysctl ctx")
      Signed-off-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20190816105300.49035-1-iii@linux.ibm.com/
      d895a0f1
    • Toke Høiland-Jørgensen's avatar
      xdp: Fix race in dev_map_hash_update_elem() when replacing element · af58e7ee
      Toke Høiland-Jørgensen authored
      syzbot found a crash in dev_map_hash_update_elem(), when replacing an
      element with a new one. Jesper correctly identified the cause of the crash
      as a race condition between the initial lookup in the map (which is done
      before taking the lock), and the removal of the old element.
      
      Rather than just add a second lookup into the hashmap after taking the
      lock, fix this by reworking the function logic to take the lock before the
      initial lookup.
      
      Fixes: 6f9d451a ("xdp: Add devmap_hash map type for looking up devices by hashed index")
      Reported-and-tested-by: syzbot+4e7a85b1432052e8d6f8@syzkaller.appspotmail.com
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      af58e7ee
    • Daniel Borkmann's avatar
      Merge branch 'bpf-af-xdp-unaligned-fixes' · a4fa6e16
      Daniel Borkmann authored
      Ciara Loftus says:
      
      ====================
      This patch set contains some fixes for AF_XDP zero copy in the i40e and
      ixgbe drivers as well as a fix for the 'xdpsock' sample application when
      running in unaligned mode.
      
      Patches 1 and 2 fix a regression for the i40e and ixgbe drivers which
      caused the umem headroom to be added to the xdp handle twice, resulting in
      an incorrect value being received by the user for the case where the umem
      headroom is non-zero.
      
      Patch 3 fixes an issue with the xdpsock sample application whereby the
      start of the tx packet data (offset) was not being set correctly when the
      application was being run in unaligned mode.
      
      This patch set has been applied against commit a2c11b03 ("kcm: use
      BPF_PROG_RUN")
      ====================
      Acked-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a4fa6e16
    • Ciara Loftus's avatar
      samples/bpf: fix xdpsock l2fwd tx for unaligned mode · 5a712e13
      Ciara Loftus authored
      Preserve the offset of the address of the received descriptor, and include
      it in the address set for the tx descriptor, so the kernel can correctly
      locate the start of the packet data.
      
      Fixes: 03895e63 ("samples/bpf: add buffer recycling for unaligned chunks to xdpsock")
      Signed-off-by: default avatarCiara Loftus <ciara.loftus@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      5a712e13
    • Ciara Loftus's avatar
      ixgbe: fix xdp handle calculations · 2e78fc62
      Ciara Loftus authored
      Commit 7cbbf9f1 ("ixgbe: fix xdp handle calculations") reintroduced
      the addition of the umem headroom to the xdp handle in the ixgbe_zca_free,
      ixgbe_alloc_buffer_slow_zc and ixgbe_alloc_buffer_zc functions. However,
      the headroom is already added to the handle in the function
      ixgbe_run_xdp_zc. This commit removes the latter addition and fixes the
      case where the headroom is non-zero.
      
      Fixes: 7cbbf9f1 ("ixgbe: fix xdp handle calculations")
      Signed-off-by: default avatarCiara Loftus <ciara.loftus@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      2e78fc62
    • Ciara Loftus's avatar
      i40e: fix xdp handle calculations · 168dfc3a
      Ciara Loftus authored
      Commit 4c5d9a7f ("i40e: fix xdp handle calculations") reintroduced
      the addition of the umem headroom to the xdp handle in the i40e_zca_free,
      i40e_alloc_buffer_slow_zc and i40e_alloc_buffer_zc functions. However,
      the headroom is already added to the handle in the function i40_run_xdp_zc.
      This commit removes the latter addition and fixes the case where the
      headroom is non-zero.
      
      Fixes: 4c5d9a7f ("i40e: fix xdp handle calculations")
      Signed-off-by: default avatarCiara Loftus <ciara.loftus@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      168dfc3a
    • Ilya Leoshkevich's avatar
      selftests/bpf: add bpf-gcc support · 4ce150b6
      Ilya Leoshkevich authored
      Now that binutils and gcc support for BPF is upstream, make use of it in
      BPF selftests using alu32-like approach. Share as much as possible of
      CFLAGS calculation with clang.
      
      Fixes only obvious issues, leaving more complex ones for later:
      - Use gcc-provided bpf-helpers.h instead of manually defining the
        helpers, change bpf_helpers.h include guard to avoid conflict.
      - Include <linux/stddef.h> for __always_inline.
      - Add $(OUTPUT)/../usr/include to include path in order to use local
        kernel headers instead of system kernel headers when building with O=.
      
      In order to activate the bpf-gcc support, one needs to configure
      binutils and gcc with --target=bpf and make them available in $PATH. In
      particular, gcc must be installed as `bpf-gcc`, which is the default.
      
      Right now with binutils 25a2915e8dba and gcc r275589 only a handful of
      tests work:
      
      	# ./test_progs_bpf_gcc
      	# Summary: 7/39 PASSED, 1 SKIPPED, 98 FAILED
      
      The reason for those failures are as follows:
      
      - Build errors:
        - `error: too many function arguments for eBPF` for __always_inline
          functions read_str_var and read_map_var - must be inlining issue,
          and for process_l3_headers_v6, which relies on optimizing away
          function arguments.
        - `error: indirect call in function, which are not supported by eBPF`
          where there are no obvious indirect calls in the source calls, e.g.
          in __encap_ipip_none.
        - `error: field 'lock' has incomplete type` for fields of `struct
          bpf_spin_lock` type - bpf_spin_lock is re#defined by bpf-helpers.h,
          so its usage is sensitive to order of #includes.
        - `error: eBPF stack limit exceeded` in sysctl_tcp_mem.
      - Load errors:
        - Missing object files due to above build errors.
        - `libbpf: failed to create map (name: 'test_ver.bss')`.
        - `libbpf: object file doesn't contain bpf program`.
        - `libbpf: Program '.text' contains unrecognized relo data pointing to
          section 0`.
        - `libbpf: BTF is required, but is missing or corrupted` - no BTF
          support in gcc yet.
      Signed-off-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Cc: Jose E. Marchesi <jose.marchesi@oracle.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      4ce150b6
    • Alexandru Ardelean's avatar
      net: stmmac: socfpga: re-use the `interface` parameter from platform data · 5f109d45
      Alexandru Ardelean authored
      The socfpga sub-driver defines an `interface` field in the `socfpga_dwmac`
      struct and parses it on init.
      
      The shared `stmmac_probe_config_dt()` function also parses this from the
      device-tree and makes it available on the returned `plat_data` (which is
      the same data available via `netdev_priv()`).
      
      All that's needed now is to dig that information out, via some
      `dev_get_drvdata()` && `netdev_priv()` calls and re-use it.
      Signed-off-by: default avatarAlexandru Ardelean <alexandru.ardelean@analog.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5f109d45
    • David S. Miller's avatar
      Merge branch 'More-fixes-for-unlocked-cls-hardware-offload-API-refactoring' · 95cf6674
      David S. Miller authored
      Vlad Buslov says:
      
      ====================
      More fixes for unlocked cls hardware offload API refactoring
      
      Two fixes for my "Refactor cls hardware offload API to support
      rtnl-independent drivers" series and refactoring patch that implements
      infrastructure necessary for the fixes.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      95cf6674
    • Vlad Buslov's avatar
      net: sched: use get_dev() action API in flow_action infra · 470d5060
      Vlad Buslov authored
      When filling in hardware intermediate representation tc_setup_flow_action()
      directly obtains, checks and takes reference to dev used by mirred action,
      instead of using act->ops->get_dev() API created specifically for this
      purpose. In order to remove code duplication, refactor flow_action infra to
      use action API when obtaining mirred action target dev. Extend get_dev()
      with additional argument that is used to provide dev destructor to the
      user.
      
      Fixes: 5a6ff4b1 ("net: sched: take reference to action dev before calling offloads")
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      470d5060
    • Vlad Buslov's avatar
      net: sched: take reference to psample group in flow_action infra · 4a5da47d
      Vlad Buslov authored
      With recent patch set that removed rtnl lock dependency from cls hardware
      offload API rtnl lock is only taken when reading action data and can be
      released after action-specific data is parsed into intermediate
      representation. However, sample action psample group is passed by pointer
      without obtaining reference to it first, which makes it possible to
      concurrently overwrite the action and deallocate object pointed by
      psample_group pointer after rtnl lock is released but before driver
      finished using the pointer.
      
      To prevent such race condition, obtain reference to psample group while it
      is used by flow_action infra. Extend psample API with function
      psample_group_take() that increments psample group reference counter.
      Extend struct tc_action_ops with new get_psample_group() API. Implement the
      API for action sample using psample_group_take() and already existing
      psample_group_put() as a destructor. Use it in tc_setup_flow_action() to
      take reference to psample group pointed to by entry->sample.psample_group
      and release it in tc_cleanup_flow_action().
      
      Disable bh when taking psample_groups_lock. The lock is now taken while
      holding action tcf_lock that is used by data path and requires bh to be
      disabled, so doing the same for psample_groups_lock is necessary to
      preserve SOFTIRQ-irq-safety.
      
      Fixes: 918190f5 ("net: sched: flower: don't take rtnl lock for cls hw offloads API")
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4a5da47d
    • Vlad Buslov's avatar
      net: sched: extend flow_action_entry with destructor · 1158958a
      Vlad Buslov authored
      Generalize flow_action_entry cleanup by extending the structure with
      pointer to destructor function. Set the destructor in
      tc_setup_flow_action(). Refactor tc_cleanup_flow_action() to call
      entry->destructor() instead of using switch that dispatches by entry->id
      and manually executes cleanup.
      
      This refactoring is necessary for following patches in this series that
      require destructor to use tc_action->ops callbacks that can't be easily
      obtained in tc_cleanup_flow_action().
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1158958a