• Eric Dumazet's avatar
    bnx2: bnx2_tx_int() optimizations · d62fda08
    Eric Dumazet authored
    When using bnx2 in a high transmit load, bnx2_tx_int() cost is pretty high.
    
    There are two reasons.
    
    One is an expensive call to bnx2_get_hw_tx_cons(bnapi) for each freed skb
    
    One is cpu stalls when accessing skb_is_gso(skb) / skb_shinfo(skb)->nr_frags
    because of two cache line misses.
    (One to get skb->end/head to compute skb_shinfo(skb),
     one to get is_gso/nr_frags)
    
    This patch :
    
    1) avoids calling bnx2_get_hw_tx_cons(bnapi) too many times.
    
    2) makes bnx2_start_xmit() cache is_gso & nr_frags into sw_tx_bd descriptor.
       This uses a litle bit more ram (256 longs per device on x86), but helps a lot.
    
    3) uses a prefetch(&skb->end) to speedup dev_kfree_skb(), bringing
      cache line that will be needed in skb_release_data()
    
    result is 5 % bandwidth increase in benchmarks, involving UDP or TCP receive
     & transmits, when a cpu is dedicated to ksoftirqd for bnx2.
    
    bnx2_tx_int going from 3.33 % cpu to 0.5 % cpu in oprofile
    
    Note : skb_dma_unmap() still very expensive but this is for another patch,
    not related to bnx2 (2.9 % of cpu, while it does nothing on x86_32)
    Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    d62fda08
bnx2.c 196 KB