1. 12 May, 2015 4 commits
    • Alexander Duyck's avatar
      mm/net: Rename and move page fragment handling from net/ to mm/ · b63ae8ca
      Alexander Duyck authored
      This change moves the __alloc_page_frag functionality out of the networking
      stack and into the page allocation portion of mm.  The idea it so help make
      this maintainable by placing it with other page allocation functions.
      
      Since we are moving it from skbuff.c to page_alloc.c I have also renamed
      the basic defines and structure from netdev_alloc_cache to page_frag_cache
      to reflect that this is now part of a different kernel subsystem.
      
      I have also added a simple __free_page_frag function which can handle
      freeing the frags based on the skb->head pointer.  The model for this is
      based off of __free_pages since we don't actually need to deal with all of
      the cases that put_page handles.  I incorporated the virt_to_head_page call
      and compound_order into the function as it actually allows for a signficant
      size reduction by reducing code duplication.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b63ae8ca
    • Alexander Duyck's avatar
      net: Store virtual address instead of page in netdev_alloc_cache · 0e392508
      Alexander Duyck authored
      This change makes it so that we store the virtual address of the page
      in the netdev_alloc_cache instead of the page pointer.  The idea behind
      this is to avoid multiple calls to page_address since the virtual address
      is required for every access, but the page pointer is only needed at
      allocation or reset of the page.
      
      While I was at it I also reordered the netdev_alloc_cache structure a bit
      so that the size is always 16 bytes by dropping size in the case where
      PAGE_SIZE is greater than or equal to 32KB.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0e392508
    • Alexander Duyck's avatar
      igb: Don't use NETDEV_FRAG_PAGE_MAX_SIZE in descriptor calculation · 2ee52ad4
      Alexander Duyck authored
      This change updates igb so that it will correctly perform the descriptor
      count calculation.  Previously it was taking NETDEV_FRAG_PAGE_MAX_SIZE
      into account with isn't really correct since a different value is used to
      determine the size of the pages used for TCP.  That is actually determined
      by SKB_FRAG_PAGE_ORDER.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2ee52ad4
    • Alexander Duyck's avatar
      net: Use cached copy of pfmemalloc to avoid accessing page · 9451980a
      Alexander Duyck authored
      While testing I found that the testing for pfmemalloc in build_skb was
      rather expensive.  I found the issue to be two-fold.  First we have to get
      from the virtual address to the head page and that comes at the cost of
      something like 11 cycles.  Then there is the cost for reading pfmemalloc out
      of the head page which can be cache cold due to the fact that
      put_page_testzero is likely invalidating the cache-line on one or more
      CPUs as the fragments can be shared.
      
      To avoid this extra expense I have added a pfmemalloc member to the
      netdev_alloc_cache.  I then pushed pieces of __alloc_rx_skb into
      __napi_alloc_skb and __netdev_alloc_skb so that I could rewrite them to
      make use of the cached pfmemalloc value.  The result is that my perf traces
      show a reduction from 9.28% overhead to 3.7% for the code covered by
      build_skb, __alloc_rx_skb, and __napi_alloc_skb when performing a test with
      the packet being dropped instead of being handed to napi_gro_receive.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9451980a
  2. 11 May, 2015 20 commits
  3. 10 May, 2015 14 commits
    • Eric Dumazet's avatar
      codel: add ce_threshold attribute · 80ba92fa
      Eric Dumazet authored
      For DCTCP or similar ECN based deployments on fabrics with shallow
      buffers, hosts are responsible for a good part of the buffering.
      
      This patch adds an optional ce_threshold to codel & fq_codel qdiscs,
      so that DCTCP can have feedback from queuing in the host.
      
      A DCTCP enabled egress port simply have a queue occupancy threshold
      above which ECT packets get CE mark.
      
      In codel language this translates to a sojourn time, so that one doesn't
      have to worry about bytes or bandwidth but delays.
      
      This makes the host an active participant in the health of the whole
      network.
      
      This also helps experimenting DCTCP in a setup without DCTCP compliant
      fabric.
      
      On following example, ce_threshold is set to 1ms, and we can see from
      'ldelay xxx us' that TCP is not trying to go around the 5ms codel
      target.
      
      Queue has more capacity to absorb inelastic bursts (say from UDP
      traffic), as queues are maintained to an optimal level.
      
      lpaa23:~# ./tc -s -d qd sh dev eth1
      qdisc mq 1: dev eth1 root
       Sent 87910654696 bytes 58065331 pkt (dropped 0, overlimits 0 requeues 42961)
       backlog 3108242b 364p requeues 42961
      qdisc codel 8063: dev eth1 parent 1:1 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 7363778701 bytes 4863809 pkt (dropped 0, overlimits 0 requeues 5503)
       rate 2348Mbit 193919pps backlog 255866b 46p requeues 5503
        count 0 lastcount 0 ldelay 1.0ms drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 72384
      qdisc codel 8064: dev eth1 parent 1:2 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 7636486190 bytes 5043942 pkt (dropped 0, overlimits 0 requeues 5186)
       rate 2319Mbit 191538pps backlog 207418b 64p requeues 5186
        count 0 lastcount 0 ldelay 694us drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 69873
      qdisc codel 8065: dev eth1 parent 1:3 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 11569360142 bytes 7641602 pkt (dropped 0, overlimits 0 requeues 5554)
       rate 3041Mbit 251096pps backlog 210446b 59p requeues 5554
        count 0 lastcount 0 ldelay 889us drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 37780
      ...
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Glenn Judd <glenn.judd@morganstanley.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      80ba92fa
    • Varka Bhadram's avatar
      ethernet: qualcomm: use spi instead of spi_device · cf9d0dcc
      Varka Bhadram authored
      All spi based drivers have an instance of struct spi_device
      as spi. This patch renames spi_device to spi to synchronize
      with all the drivers.
      Signed-off-by: default avatarVarka Bhadram <varkab@cdac.in>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf9d0dcc
    • David S. Miller's avatar
      Merge branch 'pktgen-next' · 3e3b3468
      David S. Miller authored
      Jesper Dangaard Brouer says:
      
      ====================
      The following series introduce some pktgen changes
      
      Patch01:
       Cleanup my own work when I introduced NO_TIMESTAMP.
      
      Patch02:
       Took over patch from Alexei, and addressed my own concerns, as Alexie
       is too busy with other work, and this will provide an easy tool for
       measuring ingress path performance, which is a hot topic ATM.
      
       Changes were primarily user interface related.  Introduced a separate
       "xmit_mode" setting, instead of stealing one of the dev flags like
       Alexei did.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3e3b3468
    • Alexei Starovoitov's avatar
      pktgen: introduce xmit_mode '<start_xmit|netif_receive>' · 62f64aed
      Alexei Starovoitov authored
      Introduce xmit_mode 'netif_receive' for pktgen which generates the
      packets using familiar pktgen commands, but feeds them into
      netif_receive_skb() instead of ndo_start_xmit().
      
      Default mode is called 'start_xmit'.
      
      It is designed to test netif_receive_skb and ingress qdisc
      performace only. Make sure to understand how it works before
      using it for other rx benchmarking.
      
      Sample script 'pktgen.sh':
      \#!/bin/bash
      function pgset() {
        local result
      
        echo $1 > $PGDEV
      
        result=`cat $PGDEV | fgrep "Result: OK:"`
        if [ "$result" = "" ]; then
          cat $PGDEV | fgrep Result:
        fi
      }
      
      [ -z "$1" ] && echo "Usage: $0 DEV" && exit 1
      ETH=$1
      
      PGDEV=/proc/net/pktgen/kpktgend_0
      pgset "rem_device_all"
      pgset "add_device $ETH"
      
      PGDEV=/proc/net/pktgen/$ETH
      pgset "xmit_mode netif_receive"
      pgset "pkt_size 60"
      pgset "dst 198.18.0.1"
      pgset "dst_mac 90:e2:ba:ff:ff:ff"
      pgset "count 10000000"
      pgset "burst 32"
      
      PGDEV=/proc/net/pktgen/pgctrl
      echo "Running... ctrl^C to stop"
      pgset "start"
      echo "Done"
      cat /proc/net/pktgen/$ETH
      
      Usage:
      $ sudo ./pktgen.sh eth2
      ...
      Result: OK: 232376(c232372+d3) usec, 10000000 (60byte,0frags)
        43033682pps 20656Mb/sec (20656167360bps) errors: 10000000
      
      Raw netif_receive_skb speed should be ~43 million packet
      per second on 3.7Ghz x86 and 'perf report' should look like:
        37.69%  kpktgend_0   [kernel.vmlinux]  [k] __netif_receive_skb_core
        25.81%  kpktgend_0   [kernel.vmlinux]  [k] kfree_skb
         7.22%  kpktgend_0   [kernel.vmlinux]  [k] ip_rcv
         5.68%  kpktgend_0   [pktgen]          [k] pktgen_thread_worker
      
      If fib_table_lookup is seen on top, it means skb was processed
      by the stack. To benchmark netif_receive_skb only make sure
      that 'dst_mac' of your pktgen script is different from
      receiving device mac and it will be dropped by ip_rcv
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      62f64aed
    • Jesper Dangaard Brouer's avatar
      pktgen: adjust flag NO_TIMESTAMP to be more pktgen compliant · f1f00d8f
      Jesper Dangaard Brouer authored
      Allow flag NO_TIMESTAMP to turn timestamping on again, like other flags,
      with a negation of the flag like !NO_TIMESTAMP.
      
      Also document the option flag NO_TIMESTAMP.
      
      Fixes: afb84b62 ("pktgen: add flag NO_TIMESTAMP to disable timestamping")
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f1f00d8f
    • David S. Miller's avatar
      Merge branch 'netns-scalability' · 4d95b72f
      David S. Miller authored
      Nicolas Dichtel says:
      
      ====================
      netns: ease netlink use with a lot of netns
      
      This idea was informally discussed in Ottawa / netdev0.1. The goal is to
      ease the use/scalability of netns, from a userland point of view.
      Today, users need to open one netlink socket per family and per netns.
      Thus, when the number of netns inscreases (for example 5K or more), the
      number of sockets needed to manage them grows a lot.
      
      The goal of this series is to be able to monitor netlink events, for a
      specified family, for a set of netns, with only one netlink socket. For
      this purpose, a netlink socket option is added: NETLINK_LISTEN_ALL_NSID.
      When this option is set on a netlink socket, this socket will receive
      netlink notifications from all netns that have a nsid assigned into the
      netns where the socket has been opened.
      The nsid is sent to userland via an anscillary data.
      
      Here is an example with a patched iproute2. vxlan10 is created in the
      current netns (netns0, nsid 0) and then moved to another netns (netns1,
      nsid 1):
      
      $ ip netns exec netns0 ip monitor all-nsid label
      [nsid 0][NSID]nsid 1 (iproute2 netns name: netns1)
      [nsid 0][NEIGH]??? lladdr 00:00:00:00:00:00 REACHABLE,PERMANENT
      [nsid 0][LINK]5: vxlan10@NONE: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN group default
          link/ether 92:33:17:e6:e7:1d brd ff:ff:ff:ff:ff:ff
      [nsid 0][LINK]Deleted 5: vxlan10@NONE: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN group default
          link/ether 92:33:17:e6:e7:1d brd ff:ff:ff:ff:ff:ff
      [nsid 1][NSID]nsid 0 (iproute2 netns name: netns0)
      [nsid 1][LINK]5: vxlan10@NONE: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN group default
          link/ether 92:33:17:e6:e7:1d brd ff:ff:ff:ff:ff:ff link-netnsid 0
      [nsid 1][ADDR]5: vxlan10    inet 192.168.0.249/24 brd 192.168.0.255 scope global vxlan10
             valid_lft forever preferred_lft forever
      [nsid 1][ROUTE]local 192.168.0.249 dev vxlan10  table local  proto kernel  scope host  src 192.168.0.249
      [nsid 1][ROUTE]ff00::/8 dev vxlan10  table local  metric 256  pref medium
      [nsid 1][ROUTE]2001:123::/64 dev vxlan10  proto kernel  metric 256  pref medium
      [nsid 1][LINK]5: vxlan10@NONE: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
          link/ether 92:33:17:e6:e7:1d brd ff:ff:ff:ff:ff:ff link-netnsid 0
      [nsid 1][ROUTE]broadcast 192.168.0.255 dev vxlan10  table local  proto kernel  scope link  src 192.168.0.249
      [nsid 1][ROUTE]192.168.0.0/24 dev vxlan10  proto kernel  scope link  src 192.168.0.249
      [nsid 1][ROUTE]broadcast 192.168.0.0 dev vxlan10  table local  proto kernel  scope link  src 192.168.0.249
      [nsid 1][ROUTE]fe80::/64 dev vxlan10  proto kernel  metric 256  pref medium
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4d95b72f
    • Nicolas Dichtel's avatar
      netlink: allow to listen "all" netns · 59324cf3
      Nicolas Dichtel authored
      More accurately, listen all netns that have a nsid assigned into the netns
      where the netlink socket is opened.
      For this purpose, a netlink socket option is added:
      NETLINK_LISTEN_ALL_NSID. When this option is set on a netlink socket, this
      socket will receive netlink notifications from all netns that have a nsid
      assigned into the netns where the socket has been opened. The nsid is sent
      to userland via an anscillary data.
      
      With this patch, a daemon needs only one socket to listen many netns. This
      is useful when the number of netns is high.
      
      Because 0 is a valid value for a nsid, the field nsid_is_set indicates if
      the field nsid is valid or not. skb->cb is initialized to 0 on skb
      allocation, thus we are sure that we will never send a nsid 0 by error to
      the userland.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      59324cf3
    • Nicolas Dichtel's avatar
      netlink: rename private flags and states · cc3a572f
      Nicolas Dichtel authored
      These flags and states have the same prefix (NETLINK_) that netlink socket
      options. To avoid confusion and to be able to name a flag like a socket
      option, let's use an other prefix: NETLINK_[S|F]_.
      
      Note: a comment has been fixed, it was talking about
      NETLINK_RECV_NO_ENOBUFS socket option instead of NETLINK_NO_ENOBUFS.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cc3a572f
    • Nicolas Dichtel's avatar
      netns: use a spin_lock to protect nsid management · 95f38411
      Nicolas Dichtel authored
      Before this patch, nsid were protected by the rtnl lock. The goal of this
      patch is to be able to find a nsid without needing to hold the rtnl lock.
      
      The next patch will introduce a netlink socket option to listen to all
      netns that have a nsid assigned into the netns where the socket is opened.
      Thus, it's important to call rtnl_net_notifyid() outside the spinlock, to
      avoid a recursive lock (nsid are notified via rtnl). This was the main
      reason of the previous patch.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      95f38411
    • Nicolas Dichtel's avatar
      netns: notify new nsid outside __peernet2id() · 3138dbf8
      Nicolas Dichtel authored
      There is no functional change with this patch. It will ease the refactoring
      of the locking system that protects nsids and the support of the netlink
      socket option NETLINK_LISTEN_ALL_NSID.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3138dbf8
    • Nicolas Dichtel's avatar
      netns: rename peernet2id() to peernet2id_alloc() · 7a0877d4
      Nicolas Dichtel authored
      In a following commit, a new function will be introduced to only lookup for
      a nsid (no allocation if the nsid doesn't exist). To avoid confusion, the
      existing function is renamed.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7a0877d4
    • Nicolas Dichtel's avatar
      netns: always provide the id to rtnl_net_fill() · cab3c8ec
      Nicolas Dichtel authored
      The goal of this commit is to prepare the rework of the locking of nsnid
      protection.
      After this patch, rtnl_net_notifyid() will not call anymore __peernet2id(),
      ie no idr_* operation into this function.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cab3c8ec
    • Nicolas Dichtel's avatar
      netns: returns always an id in __peernet2id() · 109582af
      Nicolas Dichtel authored
      All callers of this function expect a nsid, not an error.
      Thus, returns NETNSA_NSID_NOT_ASSIGNED in case of error so that callers
      don't have to convert the error to NETNSA_NSID_NOT_ASSIGNED.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      109582af
    • David S. Miller's avatar
      Merge tag 'linux-can-next-for-4.2-20150506' of... · 43996fdd
      David S. Miller authored
      Merge tag 'linux-can-next-for-4.2-20150506' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
      
      Marc Kleine-Budde says:
      
      ====================
      pull-request: can-next 2015-05-06
      
      this is a pull request of a seven patches for net-next/master.
      
      Andreas Gröger contributes two patches for the janz-ican3 driver. In
      the first patch, the documentation for already existing sysfs entries
      is added, the second patch adds support for another module/firmware
      variant. A patch by Shawn Landden makes the padding in the struct
      can_frame explicit. The next 4 patches target the flexcan driver, the
      first one is by David Jander adding some documentation, the reaming
      three by me add more documentation and two small code cleanups.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      43996fdd
  4. 09 May, 2015 2 commits