• Ingo Molnar's avatar
    the IRQ balancing feature is based on the following requirements: · cf6f7853
    Ingo Molnar authored
    - irq handlers should be cache-affine to a large degree, without the
      explicit use of /proc/irq/*/smp_affinity.
    
    - idle CPUs should be preferred over busy CPUs when directing IRQs towards
      them.
    
    - the distribution of IRQs should be random, to avoid all IRQs going to
      the same CPU, and to avoid 'heavy' IRQs from loading certain CPUs
      unfairly over CPUs that handle 'light' IRQs. The IRQ system has no
      knowledge about how 'heavy' an IRQ handler is in terms of CPU cycles.
    
    here is the design and implementation:
    
    - we make per-irq decisions about where the IRQ will go to next. Right
      now it's a fastpath and a slowpath, the real stuff happens in the slow
      path. The fastpath is very lightweight.
    
    - [ i decided not to measure IRQ handler overhead via RDTSC - it ends up
        being very messy, and if we want to be 100% fair then we also need to
        measure softirq overhead, and since there is no 1:1 relationship
        between softirq load and hardirq load, it's impossible to do
        correctly. So the IRQ balancer achieves fairness via randomness. ]
    
    - we stay affine in the micro timescale, and we are loading the CPUs
      fairly in the macro timescale. The IO-APIC's lowest priority
      distribution method rotated IRQs between CPUs once per IRQ, which was
      the worst possible solution for good cache-affinity.
    
    - to achieve fairness and to avoid lock-step situations some real
      randomness is needed. The IRQs will wander in the allowed CPU group
      randomly, in a brownean motion fashion. This is what the 'move()'
      function accomplishes. The IRQ moves one step forward or one step
      backwards in the allowed CPU mask. [ Note that this achieves a level of
      NUMA affinity as well, nearby CPUs are more likely to be NUMA-affine. ]
    
    - the irq balancer has some knowledge about 'how idle' a single CPU is.
      The idle task updates the idle_timestamp. Since this update is in the
      idle-to-be codepath, it does not increase the latency of idle-wakeup,
      the overhead should be zero in all cases that matter. The idle-balancing
      happens the following way: when searching for the next target CPU after
      a 'IRQ tick' has expired, we first search 'idle enough' CPUs in the
      allowed set. If this does not succeed then we search all CPUs.
    
    - the patch is fully compatible with the /proc/irq/*/smp_affinity
      interface as well, everything works as expected.
    
    note that the current implementation can be expressed equivalently in
    terms of timer-interrupt-driven IRQ redirection. But i wanted to get some
    real feedback before removing the possibility to do finer grained
    decisions - and the per-IRQ overhead is very small anyway.
    cf6f7853
irq.c 28.7 KB