1. 02 Jan, 2020 1 commit
    • Jesper Dangaard Brouer's avatar
      page_pool: handle page recycle for NUMA_NO_NODE condition · 44768dec
      Jesper Dangaard Brouer authored
      The check in pool_page_reusable (page_to_nid(page) == pool->p.nid) is
      not valid if page_pool was configured with pool->p.nid = NUMA_NO_NODE.
      
      The goal of the NUMA changes in commit d5394610 ("page_pool: Don't
      recycle non-reusable pages"), were to have RX-pages that belongs to the
      same NUMA node as the CPU processing RX-packet during softirq/NAPI. As
      illustrated by the performance measurements.
      
      This patch moves the NAPI checks out of fast-path, and at the same time
      solves the NUMA_NO_NODE issue.
      
      First realize that alloc_pages_node() with pool->p.nid = NUMA_NO_NODE
      will lookup current CPU nid (Numa ID) via numa_mem_id(), which is used
      as the the preferred nid.  It is only in rare situations, where
      e.g. NUMA zone runs dry, that page gets doesn't get allocated from
      preferred nid.  The page_pool API allows drivers to control the nid
      themselves via controlling pool->p.nid.
      
      This patch moves the NAPI check to when alloc cache is refilled, via
      dequeuing/consuming pages from the ptr_ring. Thus, we can allow placing
      pages from remote NUMA into the ptr_ring, as the dequeue/consume step
      will check the NUMA node. All current drivers using page_pool will
      alloc/refill RX-ring from same CPU running softirq/NAPI process.
      
      Drivers that control the nid explicitly, also use page_pool_update_nid
      when changing nid runtime.  To speed up transision to new nid the alloc
      cache is now flushed on nid changes.  This force pages to come from
      ptr_ring, which does the appropate nid check.
      
      For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA
      node, we accept that transitioning the alloc cache doesn't happen
      immediately. The preferred nid change runtime via consulting
      numa_mem_id() based on the CPU processing RX-packets.
      
      Notice, to avoid stressing the page buddy allocator and avoid doing too
      much work under softirq with preempt disabled, the NUMA check at
      ptr_ring dequeue will break the refill cycle, when detecting a NUMA
      mismatch. This will cause a slower transition, but its done on purpose.
      
      Fixes: d5394610 ("page_pool: Don't recycle non-reusable pages")
      Reported-by: default avatarLi RongQing <lirongqing@baidu.com>
      Reported-by: default avatarYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      44768dec
  2. 01 Jan, 2020 1 commit
    • David S. Miller's avatar
      Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · fe23d634
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      1GbE Intel Wired LAN Driver Updates 2019-12-31
      
      This series contains updates to e1000e, igb and igc only.
      
      Robert Beckett provide an igb change to assist in keeping packets from
      being dropped due to receive descriptor ring being full when receive
      flow control is enabled.  Create a separate function to setup SRRCTL to
      ease in reuse and ensure that setting of the drop enable bit only if
      receive flow control is not enabled.
      
      Sasha adds support for scatter gather support in igc.  Improve the
      direct memory address mapping flow by optimizing/simplifying and more
      clear.  Update igc to use pci_release_mem_regions() instead of
      pci_release_selected_regions().  Clean up function header comments to
      align with the actual code.  Adds support for 64 bit DMA access, to help
      handle socket buffer fragments in high memory.  Adds legacy power
      management support in igc by implementing suspend, resume,
      runtime_suspend/resume, and runtime_idle callbacks.  Clean up references
      to Serdes interface in igc since that interface is not supported for
      i225 devices.
      
      Alex replaces the pr_info calls with netdev_info in all cases related to
      netdev link state, as suggested by Joe Perches.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fe23d634
  3. 31 Dec, 2019 30 commits
  4. 30 Dec, 2019 3 commits
    • Eric Dumazet's avatar
      tcp_cubic: refactor code to perform a divide only when needed · f278b99c
      Eric Dumazet authored
      Neal Cardwell suggested to not change ca->delay_min
      and apply the ack delay cushion only when Hystart ACK train
      is still under consideration. This should avoid a 64bit
      divide unless needed.
      
      Tested:
      
      40Gbit(mlx4) testbed (with sch_fq as packet scheduler)
      
      $ echo -n 'file tcp_cubic.c +p'  >/sys/kernel/debug/dynamic_debug/control
      $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
        14815
        16280
        15293
        15563
        11574
        15145
        14789
        18548
        16972
        12520
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       1396               0.0
      $ dmesg | tail -10
      [ 4873.951350] hystart_ack_train (116 > 93) delay_min 24 (+ ack_delay 69) cwnd 80
      [ 4875.155379] hystart_ack_train (55 > 50) delay_min 21 (+ ack_delay 29) cwnd 160
      [ 4876.333921] hystart_ack_train (69 > 62) delay_min 23 (+ ack_delay 39) cwnd 130
      [ 4877.519037] hystart_ack_train (69 > 60) delay_min 22 (+ ack_delay 38) cwnd 130
      [ 4878.701559] hystart_ack_train (87 > 63) delay_min 24 (+ ack_delay 39) cwnd 160
      [ 4879.844597] hystart_ack_train (93 > 50) delay_min 21 (+ ack_delay 29) cwnd 216
      [ 4880.956650] hystart_ack_train (74 > 67) delay_min 20 (+ ack_delay 47) cwnd 108
      [ 4882.098500] hystart_ack_train (61 > 57) delay_min 23 (+ ack_delay 34) cwnd 130
      [ 4883.262056] hystart_ack_train (72 > 67) delay_min 21 (+ ack_delay 46) cwnd 130
      [ 4884.418760] hystart_ack_train (74 > 67) delay_min 29 (+ ack_delay 38) cwnd 152
      
      10Gbit(bnx2x) testbed (with sch_fq as packet scheduler)
      
      $ echo -n 'file tcp_cubic.c +p'  >/sys/kernel/debug/dynamic_debug/control
      $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpk52 -l -4000000; done;nstat|egrep "Hystart"
         7050
         7065
         7100
         6900
         7202
         7263
         7189
         6869
         7463
         7034
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       3199               0.0
      $ dmesg | tail -10
      [  176.920012] hystart_ack_train (161 > 141) delay_min 83 (+ ack_delay 58) cwnd 264
      [  179.144645] hystart_ack_train (164 > 159) delay_min 120 (+ ack_delay 39) cwnd 444
      [  181.354527] hystart_ack_train (214 > 168) delay_min 125 (+ ack_delay 43) cwnd 436
      [  183.539565] hystart_ack_train (170 > 147) delay_min 96 (+ ack_delay 51) cwnd 326
      [  185.727309] hystart_ack_train (177 > 160) delay_min 61 (+ ack_delay 99) cwnd 128
      [  187.947142] hystart_ack_train (184 > 167) delay_min 123 (+ ack_delay 44) cwnd 367
      [  190.166680] hystart_ack_train (230 > 153) delay_min 116 (+ ack_delay 37) cwnd 444
      [  192.327285] hystart_ack_train (210 > 206) delay_min 86 (+ ack_delay 120) cwnd 152
      [  194.511392] hystart_ack_train (173 > 151) delay_min 94 (+ ack_delay 57) cwnd 239
      [  196.736023] hystart_ack_train (149 > 146) delay_min 105 (+ ack_delay 41) cwnd 399
      
      Fixes: 42f3a8aa ("tcp_cubic: tweak Hystart detection for short RTT flows")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarNeal Cardwell <ncardwell@google.com>
      Link: https://www.spinics.net/lists/netdev/msg621886.html
      Link: https://www.spinics.net/lists/netdev/msg621797.htmlAcked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f278b99c
    • Rahul Lakkireddy's avatar
      cxgb4/cxgb4vf: fix flow control display for auto negotiation · 0caeaf6a
      Rahul Lakkireddy authored
      As per 802.3-2005, Section Two, Annex 28B, Table 28B-2 [1], when
      _only_ Rx pause is enabled, both symmetric and asymmetric pause
      towards local device must be enabled. Also, firmware returns the local
      device's flow control pause params as part of advertised capabilities
      and negotiated params as part of current link attributes. So, fix up
      ethtool's flow control pause params fetch logic to read from acaps,
      instead of linkattr.
      
      [1] https://standards.ieee.org/standard/802_3-2005.html
      
      Fixes: c3168cab ("cxgb4/cxgbvf: Handle 32-bit fw port capabilities")
      Signed-off-by: default avatarSurendra Mobiya <surendra@chelsio.com>
      Signed-off-by: default avatarRahul Lakkireddy <rahul.lakkireddy@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0caeaf6a
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · ba402810
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter updates for net-next
      
      The following patchset contains Netfilter updates for net-next:
      
      1) Remove #ifdef pollution around nf_ingress(), from Lukas Wunner.
      
      2) Document ingress hook in netdevice, also from Lukas.
      
      3) Remove htons() in tunnel metadata port netlink attributes,
         from Xin Long.
      
      4) Missing erspan netlink attribute validation also from Xin Long.
      
      5) Missing erspan version in tunnel, from Xin Long.
      
      6) Missing attribute nest in NFTA_TUNNEL_KEY_OPTS_{VXLAN,ERSPAN}
         Patch from Xin Long.
      
      7) Missing nla_nest_cancel() in tunnel netlink dump path,
         from Xin Long.
      
      8) Remove two exported conntrack symbols with no clients,
         from Florian Westphal.
      
      9) Add nft_meta_get_eval_time() helper to nft_meta, from Florian.
      
      10) Add nft_meta_pkttype helper for loopback, also from Florian.
      
      11) Add nft_meta_socket uid helper, from Florian Westphal.
      
      12) Add nft_meta_cgroup helper, from Florian.
      
      13) Add nft_meta_ifkind helper, from Florian.
      
      14) Group all interface related meta selector, from Florian.
      
      15) Add nft_prandom_u32() helper, from Florian.
      
      16) Add nft_meta_rtclassid helper, from Florian.
      
      17) Add support for matching on the slave device index,
          from Florian.
      
      This batch, among other things, contains updates for the netfilter
      tunnel netlink interface: This extension is still incomplete and lacking
      proper userspace support which is actually my fault, I did not find the
      time to go back and finish this. This update is breaking tunnel UAPI in
      some aspects to fix it but do it better sooner than never.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba402810
  5. 29 Dec, 2019 5 commits
    • Linus Torvalds's avatar
      Linux 5.5-rc4 · fd698849
      Linus Torvalds authored
      fd698849
    • David S. Miller's avatar
      Merge branch 'mlxsw-fixes' · 3faf6eda
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      mlxsw: Couple of fixes
      
      This patch set contains two fixes for mlxsw. Please consider both for
      stable.
      
      Patch #1 from Amit fixes a wrong check during MAC validation when
      creating router interfaces (RIFs). Given a particular order of
      configuration this can result in the driver refusing to create new RIFs.
      
      Patch #2 fixes a wrong trap configuration in which VRRP packets and
      routing exceptions were policed by the same policer towards the CPU. In
      certain situations this can prevent VRRP packets from reaching the CPU.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3faf6eda
    • Ido Schimmel's avatar
      mlxsw: spectrum: Use dedicated policer for VRRP packets · acca789a
      Ido Schimmel authored
      Currently, VRRP packets and packets that hit exceptions during routing
      (e.g., MTU error) are policed using the same policer towards the CPU.
      This means, for example, that misconfiguration of the MTU on a routed
      interface can prevent VRRP packets from reaching the CPU, which in turn
      can cause the VRRP daemon to assume it is the Master router.
      
      Fix this by using a dedicated policer for VRRP packets.
      
      Fixes: 11566d34 ("mlxsw: spectrum: Add VRRP traps")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reported-by: default avatarAlex Veber <alexve@mellanox.com>
      Tested-by: default avatarAlex Veber <alexve@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      acca789a
    • Amit Cohen's avatar
      mlxsw: spectrum_router: Skip loopback RIFs during MAC validation · 314bd842
      Amit Cohen authored
      When a router interface (RIF) is created the MAC address of the backing
      netdev is verified to have the same MSBs as existing RIFs. This is
      required in order to avoid changing existing RIF MAC addresses that all
      share the same MSBs.
      
      Loopback RIFs are special in this regard as they do not have a MAC
      address, given they are only used to loop packets from the overlay to
      the underlay.
      
      Without this change, an error is returned when trying to create a RIF
      after the creation of a GRE tunnel that is represented by a loopback
      RIF. 'rif->dev->dev_addr' points to the GRE device's local IP, which
      does not share the same MSBs as physical interfaces. Adding an IP
      address to any physical interface results in:
      
      Error: mlxsw_spectrum: All router interface MAC addresses must have the
      same prefix.
      
      Fix this by skipping loopback RIFs during MAC validation.
      
      Fixes: 74bc9939 ("mlxsw: spectrum_router: Veto unsupported RIF MAC addresses")
      Signed-off-by: default avatarAmit Cohen <amitc@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      314bd842
    • Linus Torvalds's avatar
      Merge tag 'riscv/for-v5.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · a99efa00
      Linus Torvalds authored
      Pull RISC-V fixes from Paul Walmsley:
       "One important fix for RISC-V:
      
         - Redirect any incoming syscall with an ID less than -1 to
           sys_ni_syscall, rather than allowing them to fall through into the
           syscall handler.
      
        and two minor build fixes:
      
         - Export __asm_copy_{from,to}_user() from where they are defined.
           This fixes a build error triggered by some randconfigs.
      
         - Export flush_icache_all(). I'd resisted this before, since
           historically we didn't want modules to be able to flush the I$
           directly; but apparently everyone else is doing it now"
      
      * tag 'riscv/for-v5.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        riscv: export flush_icache_all to modules
        riscv: reject invalid syscalls below -1
        riscv: fix compile failure with EXPORT_SYMBOL() & !MMU
      a99efa00