1. 02 Jan, 2015 13 commits
  2. 01 Jan, 2015 7 commits
  3. 31 Dec, 2014 20 commits
    • David S. Miller's avatar
    • David S. Miller's avatar
    • David S. Miller's avatar
      Merge branch 'fib_trie-next' · e495f78d
      David S. Miller authored
      Alexander Duyck says:
      
      ====================
      fib_trie: Reduce time spent in fib_table_lookup by 35 to 75%
      
      These patches are meant to address several performance issues I have seen
      in the fib_trie implementation, and fib_table_lookup specifically.  With
      these changes in place I have seen a reduction of up to 35 to 75% for the
      total time spent in fib_table_lookup depending on the type of search being
      performed.
      
      On a VM running in my Corei7-4930K system with a trie of maximum depth of 7
      this resulted in a reduction of over 370ns per packet in the total time to
      process packets received from an ixgbe interface and route them to a dummy
      interface.  This represents a failed lookup in the local trie followed by
      a successful search in the main trie.
      
      				Baseline	Refactor
        ixgbe->dummy routing		1.20Mpps	2.21Mpps
        ------------------------------------------------------------
        processing time per packet		835ns		453ns
        fib_table_lookup		50.1%	418ns	25.0%	113ns
        check_leaf.isra.9		 7.9%	 66ns	   --	 --
        ixgbe_clean_rx_irq		 5.3%	 44ns	 9.8%	 44ns
        ip_route_input_noref		 2.9%	 25ns	 4.6%	 21ns
        pvclock_clocksource_read	 2.6%	 21ns	 4.6%	 21ns
        ip_rcv			 2.6%	 22ns	 4.0%	 18ns
      
      In the simple case of receiving a frame and dropping it before it can reach
      the socket layer I saw a reduction of 40ns per packet.  This represents a
      trip through the local trie with the correct leaf found with no need for
      any backtracing.
      
      				Baseline	Refactor
        ixgbe->local receive		2.65Mpps	2.96Mpps
        ------------------------------------------------------------
        processing time per packet		377ns		337ns
        fib_table_lookup		25.1%	 95ns	25.8%	 87ns
        ixgbe_clean_rx_irq		 8.7%	 33ns	 9.0%	 30ns
        check_leaf.isra.9		 7.2%	 27ns	   --	 --
        ip_rcv			 5.7%	 21ns	 6.5%	 22ns
      
      These changes have resulted in several functions being inlined such as
      check_leaf and fib_find_node, but due to the code simplification the
      overall size of the code has been reduced.
      
         text	   data	    bss	    dec	    hex	filename
        16932	    376	     16	  17324	   43ac	net/ipv4/fib_trie.o - before
        15259	    376	      8	  15643	   3d1b	net/ipv4/fib_trie.o - after
      
      Changes since RFC:
        Replaced this_cpu_ptr with correct call to this_cpu_inc in patch 1
        Changed test for leaf_info mismatch to (key ^ n->key) & li->mask_plen in patch 10
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e495f78d
    • Alexander Duyck's avatar
      fib_trie: Add tracking value for suffix length · 5405afd1
      Alexander Duyck authored
      This change adds a tracking value for the maximum suffix length of all
      prefixes stored in any given tnode.  With this value we can determine if we
      need to backtrace or not based on if the suffix is greater than the pos
      value.
      
      By doing this we can reduce the CPU overhead for lookups in the local table
      as many of the prefixes there are 32b long and have a suffix length of 0
      meaning we can immediately backtrace to the root node without needing to
      test any of the nodes between it and where we ended up.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5405afd1
    • Alexander Duyck's avatar
      fib_trie: Remove checks for index >= tnode_child_length from tnode_get_child · 21d1f11d
      Alexander Duyck authored
      For some reason the compiler doesn't seem to understand that when we are in
      a loop that runs from tnode_child_length - 1 to 0 we don't expect the value
      of tn->bits to change.  As such every call to tnode_get_child was rerunning
      tnode_chile_length which ended up consuming quite a bit of space in the
      resultant assembly code.
      
      I have gone though and verified that in all cases where tnode_get_child
      is used we are either winding though a fixed loop from tnode_child_length -
      1 to 0, or are in a fastpath case where we are verifying the value by
      either checking for any remaining bits after shifting index by bits and
      testing for leaf, or by using tnode_child_length.
      
      size net/ipv4/fib_trie.o
      Before:
         text	   data	    bss	    dec	    hex	filename
        15506	    376	      8	  15890	   3e12	net/ipv4/fib_trie.o
      
      After:
         text	   data	    bss	    dec	    hex	filename
        14827	    376	      8	  15211	   3b6b	net/ipv4/fib_trie.o
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      21d1f11d
    • Alexander Duyck's avatar
      fib_trie: inflate/halve nodes in a more RCU friendly way · 12c081a5
      Alexander Duyck authored
      This change pulls the node_set_parent functionality out of put_child_reorg
      and instead leaves that to the function to take care of as well.  By doing
      this we can fully construct the new cluster of tnodes and all of the
      pointers out of it before we start routing pointers into it.
      
      I am suspecting this will likely fix some concurency issues though I don't
      have a good test to show as such.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      12c081a5
    • Alexander Duyck's avatar
      fib_trie: Push tnode flushing down to inflate/halve · fc86a93b
      Alexander Duyck authored
      This change pushes the tnode freeing down into the inflate and halve
      functions.  It makes more sense here as we have a better grasp of what is
      going on and when a given cluster of nodes is ready to be freed.
      
      I believe this may address a bug in the freeing logic as well.  For some
      reason if the freelist got to a certain size we would call
      synchronize_rcu().  I'm assuming that what they meant to do is call
      synchronize_rcu() after they had handed off that much memory via
      call_rcu().  As such that is what I have updated the behavior to be.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc86a93b
    • Alexander Duyck's avatar
      fib_trie: Push assignment of child to parent down into inflate/halve · ff181ed8
      Alexander Duyck authored
      This change makes it so that the assignment of the tnode to the parent is
      handled directly within whatever function is currently handling the node be
      it inflate, halve, or resize.  By doing this we can avoid some of the need
      to set NULL pointers in the tree while we are resizing the subnodes.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ff181ed8
    • Alexander Duyck's avatar
      fib_trie: Add functions should_inflate and should_halve · f05a4819
      Alexander Duyck authored
      This change pulls the logic for if we should inflate/halve the nodes out
      into separate functions.  It also addresses what I believe is a bug where 1
      full node is all that is needed to keep a node from ever being halved.
      
      Simple script to reproduce the issue:
      	modprobe dummy;	ifconfig dummy0 up
      	for i in `seq 0 255`; do ifconfig dummy0:$i 10.0.${i}.1/24 up; done
      	ifconfig dummy0:256 10.0.255.33/16 up
      	for i in `seq 0 254`; do ifconfig dummy0:$i down; done
      
      Results from /proc/net/fib_triestat
      Before:
      	Local:
      		Aver depth:     3.00
      		Max depth:      4
      		Leaves:         17
      		Prefixes:       18
      		Internal nodes: 11
      		  1: 8  2: 2  10: 1
      		Pointers: 1048
      	Null ptrs: 1021
      	Total size: 11  kB
      After:
      	Local:
      		Aver depth:     3.41
      		Max depth:      5
      		Leaves:         17
      		Prefixes:       18
      		Internal nodes: 12
      		  1: 8  2: 3  3: 1
      		Pointers: 36
      	Null ptrs: 8
      	Total size: 3  kB
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f05a4819
    • Alexander Duyck's avatar
      fib_trie: Move resize to after inflate/halve · cf3637bb
      Alexander Duyck authored
      This change consists of a cut/paste of resize to behind inflate and halve
      so that I could remove the two function prototypes.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf3637bb
    • Alexander Duyck's avatar
      fib_trie: Push rcu_read_lock/unlock to callers · 345e9b54
      Alexander Duyck authored
      This change is to start cleaning up some of the rcu_read_lock/unlock
      handling.  I realized while reviewing the code there are several spots that
      I don't believe are being handled correctly or are masking warnings by
      locally calling rcu_read_lock/unlock instead of calling them at the correct
      level.
      
      A common example is a call to fib_get_table followed by fib_table_lookup.
      The rcu_read_lock/unlock ought to wrap both but there are several spots where
      they were not wrapped.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      345e9b54
    • Alexander Duyck's avatar
      fib_trie: Use unsigned long for anything dealing with a shift by bits · 98293e8d
      Alexander Duyck authored
      This change makes it so that anything that can be shifted by, or compared
      to a value shifted by bits is updated to be an unsigned long.  This is
      mostly a precaution against an insanely huge address space that somehow
      starts coming close to the 2^32 root node size which would require
      something like 1.5 billion addresses.
      
      I chose unsigned long instead of unsigned long long since I do not believe
      it is possible to allocate a 32 bit tnode on a 32 bit system as the memory
      consumed would be 16GB + 28B which exceeds the addressible space for any
      one process.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      98293e8d
    • Alexander Duyck's avatar
      fib_trie: Update meaning of pos to represent unchecked bits · e9b44019
      Alexander Duyck authored
      This change moves the pos value to the other side of the "bits" field.  By
      doing this it actually simplifies a significant amount of code in the trie.
      
      For example when halving a tree we know that the bit lost exists at
      oldnode->pos, and if we inflate the tree the new bit being add is at
      tn->pos.  Previously to find those bits you would have to subtract pos and
      bits from the keylength or start with a value of (1 << 31) and then shift
      that.
      
      There are a number of spots throughout the code that benefit from this.  In
      the case of the hot-path searches the main advantage is that we can drop 2
      or more operations from the search path as we no longer need to compute the
      value for the index to be shifted by and can instead just use the raw pos
      value.
      
      In addition the tkey_extract_bits is now defunct and can be replaced by
      get_index since the two operations were doing the same thing, but now
      get_index does it much more quickly as it is only an xor and shift versus a
      pair of shifts and a subtraction.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e9b44019
    • Alexander Duyck's avatar
      fib_trie: Optimize fib_table_insert · 836a0123
      Alexander Duyck authored
      This patch updates the fib_table_insert function to take advantage of the
      changes made to improve the performance of fib_table_lookup.  As a result
      the code should be smaller and run faster then the original.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      836a0123
    • Alexander Duyck's avatar
      fib_trie: Optimize fib_find_node · 939afb06
      Alexander Duyck authored
      This patch makes use of the same features I made use of for
      fib_table_lookup to streamline fib_find_node.  The resultant code should be
      smaller and run faster than the original.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      939afb06
    • Alexander Duyck's avatar
      fib_trie: Optimize fib_table_lookup to avoid wasting time on loops/variables · 9f9e636d
      Alexander Duyck authored
      This patch is meant to reduce the complexity of fib_table_lookup by reducing
      the number of variables to the bare minimum while still keeping the same if
      not improved functionality versus the original.
      
      Most of this change was started off by the desire to rid the function of
      chopped_off and current_prefix_length as they actually added very little to
      the function since they only applied when computing the cindex.  I was able
      to replace them mostly with just a check for the prefix match.  As long as
      the prefix between the key and the node being tested was the same we know
      we can search the tnode fully versus just testing cindex 0.
      
      The second portion of the change ended up being a massive reordering.
      Originally the calls to check_leaf were up near the start of the loop, and
      the backtracing and descending into lower levels of tnodes was later.  This
      didn't make much sense as the structure of the tree means the leaves are
      always the last thing to be tested.  As such I reordered things so that we
      instead have a loop that will delve into the tree and only exit when we
      have either found a leaf or we have exhausted the tree.  The advantage of
      rearranging things like this is that we can fully inline check_leaf since
      there is now only one reference to it in the function.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9f9e636d
    • Alexander Duyck's avatar
      fib_trie: Merge leaf into tnode · adaf9816
      Alexander Duyck authored
      This change makes it so that leaf and tnode are the same struct.  As a
      result there is no need for rt_trie_node anymore since everyting can be
      merged into tnode.
      
      On 32b systems this results in the leaf being 4 bytes larger, however I
      don't know if that is really an issue as this and an eariler patch that
      added bits & pos have increased the size from 20 to 28.  If I am not
      mistaken slub/slab allocate on power of 2 sizes so 20 was likely being
      rounded up to 32 anyway.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      adaf9816
    • Alexander Duyck's avatar
      fib_trie: Merge tnode_free and leaf_free into node_free · 37fd30f2
      Alexander Duyck authored
      Both the leaf and the tnode had an rcu_head in them, but they had them in
      slightly different places.  Since we now have them in the same spot and
      know that any node with bits == 0 is a leaf and the rest are either vmalloc
      or kmalloc tnodes depending on the value of bits it makes it easy to combine
      the functions and reduce overhead.
      
      In addition I have taken advantage of the rcu_head pointer to go ahead and
      put together a simple linked list instead of using the tnode pointer as
      this way we can merge either type of structure for freeing.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      37fd30f2
    • Alexander Duyck's avatar
      fib_trie: Make leaf and tnode more uniform · 64c9b6fb
      Alexander Duyck authored
      This change makes some fundamental changes to the way leaves and tnodes are
      constructed.  The big differences are:
      1.  Leaves now populate pos and bits indicating their full key size.
      2.  Trie nodes now mask out their lower bits to be consistent with the leaf
      3.  Both structures have been reordered so that rt_trie_node now consisists
          of a much larger region including the pos, bits, and rcu portions of
          the tnode structure.
      
      On 32b systems this will result in the leaf being 4B larger as the pos and
      bits values were added to a hole created by the key as it was only 4B in
      length.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      64c9b6fb
    • Alexander Duyck's avatar
      fib_trie: Update usage stats to be percpu instead of global variables · 8274a97a
      Alexander Duyck authored
      The trie usage stats were currently being shared by all threads that were
      calling fib_table_lookup.  As a result when multiple threads were
      performing lookups simultaneously the trie would begin to cache bounce
      between those threads.
      
      In order to prevent this I have updated the usage stats to use a set of
      percpu variables.  By doing this we should be able to avoid the cache
      bouncing and still make use of these stats.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8274a97a