• William Lee Irwin III's avatar
    [PATCH] profile: 512x Altix timer interrupt livelock fix · ad02973d
    William Lee Irwin III authored
    I've been informed that /proc/profile livelocks some systems in the timer
    interrupt, usually at boot.  The following patch attempts to amortize the
    atomic operations done on the profile buffer to address this stability
    concern.  This patch has nothing to do with performance; kernels using
    periodic timer interrupts are under realtime constraints to complete
    whatever work they perform within timer interrupts before the next timer
    interrupt arrives lest they livelock, performing no work whatsoever apart
    from servicing timer interrupts.  The latency of the cacheline bounce for
    prof_buffer contributes to the time spent in the timer interrupt, hence it
    must be amortized when remote access latencies or deviations from fair
    exclusive cacheline acquisition may cause cacheline bounces to take longer
    than the interval between timer ticks.
    
    What this patch does is to create a pair of per-cpu open-addressed
    hashtables indexed by profile buffer slot holding values representing the
    number of pending profile buffer hits for the profile buffer slot.  When
    this hashtable overflows, one iterates over the hashtable accounting each
    of the pairs of profile buffer slots and hit counts to the global profile
    buffer.  Zero is a legitimate profile buffer slot, so zero hit counts
    represent unused hashtable entries.  The hashtable is furthermore protected
    from flush IPI's by interrupt disablement.
    
    In order to flush the pending profile hits for read_profile(), this patch
    flips betweeen the pairs of per-cpu profile buffer by signalling all cpus
    to flip via IPI at the time of read_profile(), followed by doing all the
    work to flush the profile hits from the older per-cpu buffers in the
    context of the caller of read_profile(), with exclusion provided by a
    semaphore ensuring that only one caller of profile_flip_buffers() may
    execute at a time, and using interrupt disablement to prevent buffer flip
    IPI's from altering the hashtables or flip state while an update is in
    progress.  The flip state is per-cpu so that remote cpus need only disable
    interrupts locally for synchronization, which is both simple and
    busywait-free for remote cpus.  The flip states all change in tandem when
    some cpu requests the hashtables be flipped, and the requester waits for
    the completion of smp_call_function() for notification that all cpus have
    finished flipping between their hashtables.  The IPI handler merely toggles
    the flip state (which is an array index) between 0 and 1.
    
    This is expected to be a much stronger amortization than merely reducing
    the frequency of profile buffer access by a factor of the size of the
    hashtable because numerous hits may be held for each of its entries.  This
    reduces what was before the patch a number of atomic increments equal to
    what after the patch becomes the sum of the hits held for each entry in the
    hashtable, to a number of atomic_add()'s equal to the number of entries in
    the per_cpu hashtable.  This is nondeterministic, but as the profile hits
    tend to be concentrated in a very small number of profile buffer slots
    during any given timing interval, is likely to represent a very large
    number of atomic increments.  This amortization of atomic increments does
    not depend on the hash function, only the sharp peakedness of the
    distribution of profile buffer hits.
    
    This algorithm has two advantages over full-size per-cpu profile buffers.
    The first is that the space footprint is much smaller.  Per-cpu profile
    buffers would increase the space requirements by a factor of
    num_online_cpus(), where this algorithm only requires one page per cpu. 
    The second is that reading the profile state is much faster, because the
    state that must be traversed is exactly the above space consumers, and the
    relative reduction in size concomitantly reduces the time required for a
    read operation.
    
    I also took the liberty of adding some commentary to the comments at the
    beginning of the file reflecting the major work done on profile.c in recent
    months and describing what the file implements.
    
    The reporters of this issue have verified that this resolves their timer
    interrupt livelock on 512x Altixen.  In my own testing on 4x logical
    x86-64, this patch saw a rate of about 18 flushes per minute under load, or
    about one flush every 3 seconds, for about 38.4 atomic accesses to the
    profile buffer per second per cpu in one of the algorithm's worst cases,
    about 3.84% of the number of atomic profile buffer accesses per second per
    cpu as a normal kernel would commit.  This represents a twenty-six-fold
    increase in the scalability on SMP systems with 4KB PAGE_SIZE, i.e.  with a
    4KB PAGE_SIZE, the number of atomic profile buffer accesses per second per
    cpu is reduced by a factor of 26, thereby increasing the number of cpus a
    system must have before it would experience a timer interrupt livelock by a
    factor of 26, with the proviso that cacheline bounces must take the same
    amount of time to service.  This increase in the scalability of the kernel
    is expected to be much larger for ia64, which has a large PAGE_SIZE,
    because the distribution of profile buffer hits is so sharply peaked that
    doubling the hashtable size will much more than double the amortization
    factor.  In fact, only 19 flushes were observed on a 64x Altix over an
    approximately 10 minute AIM7 run, and 1 flush on a 512x Altix over the
    course of an entire AIM7 run, for truly vast effective amortization
    factors.
    
    A prior version of this patch, which did not include the node-local
    hashtable allocation and bounded collision chains has been successfully
    tested on 64x and 512x ia64 vs 2.6.9-rc2, 8x ia64 vs.  2.6.9-rc2-mm1, 4x
    x86-64 vs.  2.6.9-rc2-mm1, and 6x sparc64 vs.  2.6.9-rc2-mm1.  This patch
    minus the hashtable initialization fix has been successfully tested on 2x
    ppc64, 2x alpha, 8x ia64, 6x sparc64, and 4x x86-64, all vs.
    2.6.9-rc2-mm1.  This precise version of the patch has been successfully
    tested on 8x ia32 against 2.6.9-rc2-mm1 and 6x sparc64 vs.  both
    2.6.9-rc2-mm1 and 2.6.9-rc2-mm2.
    Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    ad02973d
profile.c 15 KB