• Andrew Morton's avatar
    [PATCH] ia32 IRQ distribution rework · 08f16f8f
    Andrew Morton authored
    Patch from "Kamble, Nitin A" <nitin.a.kamble@intel.com>
    
    Hello All,
    
      We were looking at the performance impact of the IRQ routing from
    the 2.5.52 Linux kernel. This email includes some of our findings
    about the way the interrupts are getting moved in the 2.5.52 kernel.
    Also there is discussion and a patch for a new implementation. Let
    me know what you think at nitin.a.kamble@intel.com
    
    Current implementation:
    ======================
    We have found that the existing implementation works well on IA32
    SMP systems with light load of interrupts. Also we noticed that it
    is not working that well under heavy interrupt load conditions on
    these SMP systems. The observations are:
    
    * Interrupt load of each IRQ is getting balanced on CPUs independent
    of load of other IRQs. Also the current implementation moves the
    IRQs randomly. This works well when the interrupt load is light. But
    we start seeing imbalance of interrupt load with existence of
    multiple heavy interrupt sources. Frequently multiple heavily loaded
    IRQs gets moved to a single CPU while other CPUs stay very lightly
    loaded. To achieve a good interrupts load balance, it is important to
    consider the load of all the interrupts together.
        This further can be explained with an example of 4 CPUs and 4
    heavy interrupt sources. With the existing random movement approach,
    the chance of each of these heavy interrupt sources moving to separate
    CPUs is: (4/4)*(3/4)*(2/4)*(1/4) = 3/16. It means 13/16 = 81.25% of
    the time the situation is, some CPUs are very lightly loaded and some
    are loaded with multiple heavy interrupts. This causes the interrupt
    load imbalance and results in less performance. In a case of 2 CPUs
    and 2 heavily loaded interrupt sources, this imbalance happens
    1/2 = 50% of the times. This issue becomes more and more severe with
    increasing number of heavy interrupt sources.
    
    * Another interesting observation is: We cannot see the imbalance
    of the interrupt load from /proc/interrupts. (/proc/interrupts shows
    the cumulative load of interrupts on all CPUs.) If the interrupt load
    is imbalanced and this imbalance is getting rotated among CPUs
    continuously, then /proc/interrupts will still show that the interrupt
    load is going to processors very evenly. Currently at the frequency
    (HZ/50) at which IRQs are moved across CPUs, it is not possible to
    see any interrupt load imbalance happening.
    
    * We have also found that, in certain cases the static IRQ binding
    performs better than the existing kernel distribution of interrupt
    load. The reason is, in a well-balanced interrupt load situations,
    these interrupts are unnecessarily getting frequently moved across
    CPUs. This adds an extra overhead; also it takes off the CPU cache
    warmth benefits.
      This came out from the performance measurements done on a 4-way HT
    (8 logical processors) Pentium 4 Xeon system running 8 copies of
    netperf. The 4 NICs in the system taking different IRQs generated
    sizable interrupt load with the help of connected clients.
    
    Here the netperf transactions/sec throughput numbers observed are:
    
    IRQs nicely manually bound to CPUs: 56.20K
    The current kernel implementation of IRQ movement: 50.05K
     -----------------------
     The static binding of IRQs has performed 12.28% better than the
    current IRQ movement implemented in the kernel.
    
    * The current implementation does not distinguish siblings from the
    HT (Hyper-Threading(tm)) enabled CPUs. It will be beneficial to
    balance the interrupt load with respect to processor packages first,
    and then among logical CPUs inside processor packages.
      For example if we have 2 heavy interrupt sources and 2 processor
    packages (4 logical CPUs); Assigning both the heavy interrupt sources
    in different processor packages is better, it will use different
    execution resources from the different processor packages.
    
    
    
    New revised implementation:
    ==========================
    We also have been working on a new implementation. The following
    points are in main focus.
    
    * At any moment heavily loaded IRQs are distributed to different
    CPUs to achieve as much balance as possible.
    
    * Lightly loaded interrupt sources are ignored from the load
    balancing, as they do not cause considerable imbalance.
    
    * When the heavy interrupt sources are balanced, they are not moved
    around. This also helps in keeping the CPU caches warm.
    
    * It has been made HT aware. While distributing the load, the load
    on a processor package to which the logical CPUs belong to is also
    considered.
    
    * In the situations of few (lesser than num_cpus) heavy interrupt
    sources, it is not possible to balance them evenly. In such case
    the existing code has been reused to move the interrupts. The
    randomness from the original code has been removed.
    
    * The time interval for redistribution has been made flexible. It
    varies as the system interrupt load changes.
    
    * A new kernel_thread is introduced to do the load balancing
    calculations for all the interrupt sources. It keeps the balanace_maps
    ready for interrupt handlers, keeping the overhead in the interrupt
    handling to minimum.
    
    * It allows the disabling of the IRQ distribution from the boot loader
    command line, if anybody wants to do it for any reason.
    
    * The algorithm also takes into account the static binding of
    interrupts to CPUs that user imposes from the
    /proc/irq/{n}/smp_affinity interface.
    
    
    Throughput numbers with the netperf setup for the new implementation:
    
    Current kernel IRQ balance implementation: 50.02K transactions/sec
    The new IRQ balance implementation: 56.01K transactions/sec
     ---------------------
      The performance improvement on P4 Xeon of 11.9% is observed.
    
    The new IRQ balance implementation also shows little performance
    improvement on P6 (Pentium II, III) systems.
    
    On a P6 system the netperf throughput numbers are:
    Current kernel IRQ balance implementation: 36.96K transactions/sec
    The new IRQ balance implementation: 37.65K transactions/sec
     ---------------------
    Here the performance improvement on P6 system of about 2% is observed.
    
    
     ---------------------
    
    Andrew Theurer <habanero@us.ibm.com> did some testing of this patch on a quad
    P4:
    
    
    I got a chance to run the NetBench benchmark with your patch on 2.5.54-mjb2
    kernel.  NetBench measures SMB/CIFS performance by using several SMB
    clients  (in this case 44 Windows 2000 systems), sending SMB requests to a
    Linux  server running Samba 2.2.3a+sendfile.  Result is in throughput,
    Mbps.   Generally the network traffic on the server is 60% recv, 40% tx.
    
    I believe we have very similar systems.  Mine is a 4 x 1.6 GHz, 1 MB L3 P4
    Xeon with 4 GB DDR memory (3.2 GB/sec I believe).  The chipset is "Summit".
     I also have more than one Intel e1000 adapters.
    
    I decided to run a few configurations, first with just one adapter, with
    and  without HT support in the kernel (acpi=off), then add another adapter
    and  test again with/without HT.
    
    Here are the results:
    
    4P, no HT, 1 x e1000, no kirq:	1214 Mbps, 4% idle
    4P, no HT, 1 x e1000, kirq:		1223 Mbps, 4% idle,		+0.74%
    
    I suppose we didn't see much of an improvement here because we never run
    into  the situation where more than one interrupt with a high rate is
    routed to a  single CPU on irq_balance.
    
    4P, HT, 1 x e1000, no kirq:	1214 Mbps, 25% idle
    4P, HT, 1 x e1000, kirq:	1220 Mbps, 30% idle,			+0.49%
    
    Again, not much of a difference just yet, but lots of idle time.  We may
    have  reached the limit at which one logical CPU can process interrupts for
    an  e1000 adapter.  There are other things I can probably do to help this,
    like  int delay, and NAPI, which I will get to eventually.
    
    4P, HT, 2 x e1000, no kirq:	1269 Mbps, 23% idle
    4P, HT, 2 x e1000, kirq:	1329 Mbps, 18% idle			+4.7%
    
    OK, almost 5% better!  Probably has to do with a couple of things; the fact
    that your code does not route two different interrupts to the same
    core/different logical cpus (quite obvious by looking at /proc/interrupts),
    and that more than one interrupt does not go to the same cpu if possible.
    I  suspect irq_balance did some of those [bad] things some of the time, and
    we  observed a bottleneck in int processing that was lower than with kirq.
    
    I don't think all of the idle time is because of a int processing
    bottleneck.   I'm just not sure what it is yet :)  Hopefully something will
    become obvious  to me...
    
    Overall I like the way it works, and I believe it can be tweaked to work
    with  NUMA when necessary.  I hope to have access to a specweb system on a
    NUMA box  soon, so we can verify that.
    08f16f8f
kernel-parameters.txt 28.5 KB