• Paul Mackerras's avatar
    perf_counter: record time running and time enabled for each counter · 53cfbf59
    Paul Mackerras authored
    Impact: new functionality
    
    Currently, if there are more counters enabled than can fit on the CPU,
    the kernel will multiplex the counters on to the hardware using
    round-robin scheduling.  That isn't too bad for sampling counters, but
    for counting counters it means that the value read from a counter
    represents some unknown fraction of the true count of events that
    occurred while the counter was enabled.
    
    This remedies the situation by keeping track of how long each counter
    is enabled for, and how long it is actually on the cpu and counting
    events.  These times are recorded in nanoseconds using the task clock
    for per-task counters and the cpu clock for per-cpu counters.
    
    These values can be supplied to userspace on a read from the counter.
    Userspace requests that they be supplied after the counter value by
    setting the PERF_FORMAT_TOTAL_TIME_ENABLED and/or
    PERF_FORMAT_TOTAL_TIME_RUNNING bits in the hw_event.read_format field
    when creating the counter.  (There is no way to change the read format
    after the counter is created, though it would be possible to add some
    way to do that.)
    
    Using this information it is possible for userspace to scale the count
    it reads from the counter to get an estimate of the true count:
    
    true_count_estimate = count * total_time_enabled / total_time_running
    
    This also lets userspace detect the situation where the counter never
    got to go on the cpu: total_time_running == 0.
    
    This functionality has been requested by the PAPI developers, and will
    be generally needed for interpreting the count values from counting
    counters correctly.
    
    In the implementation, this keeps 5 time values (in nanoseconds) for
    each counter: total_time_enabled and total_time_running are used when
    the counter is in state OFF or ERROR and for reporting back to
    userspace.  When the counter is in state INACTIVE or ACTIVE, it is the
    tstamp_enabled, tstamp_running and tstamp_stopped values that are
    relevant, and total_time_enabled and total_time_running are determined
    from them.  (tstamp_stopped is only used in INACTIVE state.)  The
    reason for doing it like this is that it means that only counters
    being enabled or disabled at sched-in and sched-out time need to be
    updated.  There are no new loops that iterate over all counters to
    update total_time_enabled or total_time_running.
    
    This also keeps separate child_total_time_running and
    child_total_time_enabled fields that get added in when reporting the
    totals to userspace.  They are separate fields so that they can be
    atomic.  We don't want to use atomics for total_time_running,
    total_time_enabled etc., because then we would have to use atomic
    sequences to update them, which are slower than regular arithmetic and
    memory accesses.
    
    It is possible to measure total_time_running by adding a task_clock
    counter to each group of counters, and total_time_enabled can be
    measured approximately with a top-level task_clock counter (though
    inaccuracies will creep in if you need to disable and enable groups
    since it is not possible in general to disable/enable the top-level
    task_clock counter simultaneously with another group).  However, that
    adds extra overhead - I measured around 15% increase in the context
    switch latency reported by lat_ctx (from lmbench) when a task_clock
    counter was added to each of 2 groups, and around 25% increase when a
    task_clock counter was added to each of 4 groups.  (In both cases a
    top-level task-clock counter was also added.)
    
    In contrast, the code added in this commit gives better information
    with no overhead that I could measure (in fact in some cases I
    measured lower times with this code, but the differences were all less
    than one standard deviation).
    
    [ v2: address review comments by Andrew Morton. ]
    Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
    Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Orig-LKML-Reference: <18890.6578.728637.139402@cargo.ozlabs.ibm.com>
    Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
    53cfbf59
perf_counter.h 12.4 KB