• Thomas Gleixner's avatar
    cpu/hotplug: Cure the cpusets trainwreck · b22afcdf
    Thomas Gleixner authored
    Alexey and Joshua tried to solve a cpusets related hotplug problem which is
    user space visible and results in unexpected behaviour for some time after
    a CPU has been plugged in and the corresponding uevent was delivered.
    
    cpusets delegate the hotplug work (rebuilding cpumasks etc.) to a
    workqueue. This is done because the cpusets code has already a lock
    nesting of cgroups_mutex -> cpu_hotplug_lock. A synchronous callback or
    waiting for the work to finish with cpu_hotplug_lock held can and will
    deadlock because that results in the reverse lock order.
    
    As a consequence the uevent can be delivered before cpusets have consistent
    state which means that a user space invocation of sched_setaffinity() to
    move a task to the plugged CPU fails up to the point where the scheduled
    work has been processed.
    
    The same is true for CPU unplug, but that does not create user observable
    failure (yet).
    
    It's still inconsistent to claim that an operation is finished before it
    actually is and that's the real issue at hand. uevents just make it
    reliably observable.
    
    Obviously the problem should be fixed in cpusets/cgroups, but untangling
    that is pretty much impossible because according to the changelog of the
    commit which introduced this 8 years ago:
    
     3a5a6d0c("cpuset: don't nest cgroup_mutex inside get_online_cpus()")
    
    the lock order cgroups_mutex -> cpu_hotplug_lock is a design decision and
    the whole code is built around that.
    
    So bite the bullet and invoke the relevant cpuset function, which waits for
    the work to finish, in _cpu_up/down() after dropping cpu_hotplug_lock and
    only when tasks are not frozen by suspend/hibernate because that would
    obviously wait forever.
    
    Waiting there with cpu_add_remove_lock, which is protecting the present
    and possible CPU maps, held is not a problem at all because neither work
    queues nor cpusets/cgroups have any lockchains related to that lock.
    
    Waiting in the hotplug machinery is not problematic either because there
    are already state callbacks which wait for hardware queues to drain. It
    makes the operations slightly slower, but hotplug is slow anyway.
    
    This ensures that state is consistent before returning from a hotplug
    up/down operation. It's still inconsistent during the operation, but that's
    a different story.
    
    Add a large comment which explains why this is done and why this is not a
    dump ground for the hack of the day to work around half thought out locking
    schemes. Document also the implications vs. hotplug operations and
    serialization or the lack of it.
    
    Thanks to Alexy and Joshua for analyzing why this temporary
    sched_setaffinity() failure happened.
    
    Fixes: 3a5a6d0c("cpuset: don't nest cgroup_mutex inside get_online_cpus()")
    Reported-by: default avatarAlexey Klimov <aklimov@redhat.com>
    Reported-by: default avatarJoshua Baker <jobaker@redhat.com>
    Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
    Tested-by: default avatarAlexey Klimov <aklimov@redhat.com>
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/87tuowcnv3.ffs@nanos.tec.linutronix.de
    b22afcdf
cpu.c 64.5 KB