• Linus Torvalds's avatar
    cpumask: re-introduce constant-sized cpumask optimizations · 596ff4a0
    Linus Torvalds authored
    Commit aa47a7c2 ("lib/cpumask: deprecate nr_cpumask_bits") resulted
    in the cpumask operations potentially becoming hugely less efficient,
    because suddenly the cpumask was always considered to be variable-sized.
    
    The optimization was then later added back in a limited form by commit
    6f9c07be ("lib/cpumask: add FORCE_NR_CPUS config option"), but that
    FORCE_NR_CPUS option is not useful in a generic kernel and more of a
    special case for embedded situations with fixed hardware.
    
    Instead, just re-introduce the optimization, with some changes.
    
    Instead of depending on CPUMASK_OFFSTACK being false, and then always
    using the full constant cpumask width, this introduces three different
    cpumask "sizes":
    
     - the exact size (nr_cpumask_bits) remains identical to nr_cpu_ids.
    
       This is used for situations where we should use the exact size.
    
     - the "small" size (small_cpumask_bits) is the NR_CPUS constant if it
       fits in a single word and the bitmap operations thus end up able
       to trigger the "small_const_nbits()" optimizations.
    
       This is used for the operations that have optimized single-word
       cases that get inlined, notably the bit find and scanning functions.
    
     - the "large" size (large_cpumask_bits) is the NR_CPUS constant if it
       is an sufficiently small constant that makes simple "copy" and
       "clear" operations more efficient.
    
       This is arbitrarily set at four words or less.
    
    As a an example of this situation, without this fixed size optimization,
    cpumask_clear() will generate code like
    
            movl    nr_cpu_ids(%rip), %edx
            addq    $63, %rdx
            shrq    $3, %rdx
            andl    $-8, %edx
            callq   memset@PLT
    
    on x86-64, because it would calculate the "exact" number of longwords
    that need to be cleared.
    
    In contrast, with this patch, using a MAX_CPU of 64 (which is quite a
    reasonable value to use), the above becomes a single
    
    	movq $0,cpumask
    
    instruction instead, because instead of caring to figure out exactly how
    many CPU's the system has, it just knows that the cpumask will be a
    single word and can just clear it all.
    
    Note that this does end up tightening the rules a bit from the original
    version in another way: operations that set bits in the cpumask are now
    limited to the actual nr_cpu_ids limit, whereas we used to do the
    nr_cpumask_bits thing almost everywhere in the cpumask code.
    
    But if you just clear bits, or scan for bits, we can use the simpler
    compile-time constants.
    
    In the process, remove 'cpumask_complement()' and 'for_each_cpu_not()'
    which were not useful, and which fundamentally have to be limited to
    'nr_cpu_ids'.  Better remove them now than have somebody introduce use
    of them later.
    
    Of course, on x86-64 with MAXSMP there is no sane small compile-time
    constant for the cpumask sizes, and we end up using the actual CPU bits,
    and will generate the above kind of horrors regardless.  Please don't
    use MAXSMP unless you really expect to have machines with thousands of
    cores.
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    596ff4a0
.clang-format 20 KB