• Sebastian Andrzej Siewior's avatar
    locking/rtmutex: Drop usage of __HAVE_ARCH_CMPXCHG · cede8841
    Sebastian Andrzej Siewior authored
    The rtmutex code is the only user of __HAVE_ARCH_CMPXCHG and we have a few
    other user of cmpxchg() which do not care about __HAVE_ARCH_CMPXCHG. This
    define was first introduced in 23f78d4a ("[PATCH] pi-futex: rt mutex core")
    which is v2.6.18. The generic cmpxchg was introduced later in 068fbad2
    ("Add cmpxchg_local to asm-generic for per cpu atomic operations") which is
    v2.6.25.
    Back then something was required to get rtmutex working with the fast
    path on architectures without cmpxchg and this seems to be the result.
    
    It popped up recently on rt-users because ARM (v6+) does not define
    __HAVE_ARCH_CMPXCHG (even that it implements it) which results in slower
    locking performance in the fast path.
    To put some numbers on it: preempt -RT, am335x, 10 loops of
    100000 invocations of rt_spin_lock() + rt_spin_unlock() (time "total" is
    the average of the 10 loops for the 100000 invocations, "loop" is
    "total / 100000 * 1000"):
    
         cmpxchg |    slowpath used  ||    cmpxchg used
                 |   total   | loop  ||   total    | loop
         --------|-----------|-------||------------|-------
         ARMv6   | 9129.4 us | 91 ns ||  3311.9 us |  33 ns
         generic | 9360.2 us | 94 ns || 10834.6 us | 108 ns
         ----------------------------||--------------------
    
    Forcing it to generic cmpxchg() made things worse for the slowpath and
    even worse in cmpxchg() path. It boils down to 14ns more per lock+unlock
    in a cache hot loop so it might not be that much in real world.
    The last test was a substitute for pre ARMv6 machine but then I was able
    to perform the comparison on imx28 which is ARMv5 and therefore is
    always is using the generic cmpxchg implementation. And the numbers:
    
                  |   total     | loop
         -------- |-----------  |--------
         slowpath | 263937.2 us | 2639 ns
         cmpxchg  |  16934.2 us |  169 ns
         --------------------------------
    
    The numbers are larger since the machine is slower in general. However,
    letting rtmutex use cmpxchg() instead the slowpath seem to improve things.
    
    Since from the ARM (tested on am335x + imx28) point of view always
    using cmpxchg() in rt_mutex_lock() + rt_mutex_unlock() makes sense I
    would drop the define.
    Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: will.deacon@arm.com
    Cc: linux-arm-kernel@lists.infradead.org
    Link: http://lkml.kernel.org/r/20150225175613.GE6823@linutronix.deSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
    cede8841
cmpxchg.h 2.16 KB