• Tejun Heo's avatar
    cgroup: Use separate src/dst nodes when preloading css_sets for migration · 07fd5b6c
    Tejun Heo authored
    Each cset (css_set) is pinned by its tasks. When we're moving tasks around
    across csets for a migration, we need to hold the source and destination
    csets to ensure that they don't go away while we're moving tasks about. This
    is done by linking cset->mg_preload_node on either the
    mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
    same cset->mg_preload_node for both the src and dst lists was deemed okay as
    a cset can't be both the source and destination at the same time.
    
    Unfortunately, this overloading becomes problematic when multiple tasks are
    involved in a migration and some of them are identity noop migrations while
    others are actually moving across cgroups. For example, this can happen with
    the following sequence on cgroup1:
    
     #1> mkdir -p /sys/fs/cgroup/misc/a/b
     #2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
     #3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
     #4> PID=$!
     #5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
     #6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs
    
    the process including the group leader back into a. In this final migration,
    non-leader threads would be doing identity migration while the group leader
    is doing an actual one.
    
    After #3, let's say the whole process was in cset A, and that after #4, the
    leader moves to cset B. Then, during #6, the following happens:
    
     1. cgroup_migrate_add_src() is called on B for the leader.
    
     2. cgroup_migrate_add_src() is called on A for the other threads.
    
     3. cgroup_migrate_prepare_dst() is called. It scans the src list.
    
     4. It notices that B wants to migrate to A, so it tries to A to the dst
        list but realizes that its ->mg_preload_node is already busy.
    
     5. and then it notices A wants to migrate to A as it's an identity
        migration, it culls it by list_del_init()'ing its ->mg_preload_node and
        putting references accordingly.
    
     6. The rest of migration takes place with B on the src list but nothing on
        the dst list.
    
    This means that A isn't held while migration is in progress. If all tasks
    leave A before the migration finishes and the incoming task pins it, the
    cset will be destroyed leading to use-after-free.
    
    This is caused by overloading cset->mg_preload_node for both src and dst
    preload lists. We wanted to exclude the cset from the src list but ended up
    inadvertently excluding it from the dst list too.
    
    This patch fixes the issue by separating out cset->mg_preload_node into
    ->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
    preloadings don't interfere with each other.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Reported-by: default avatarMukesh Ojha <quic_mojha@quicinc.com>
    Reported-by: default avatarshisiyuan <shisiyuan19870131@gmail.com>
    Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
    Link: https://www.spinics.net/lists/cgroups/msg33313.html
    Fixes: f817de98 ("cgroup: prepare migration path for unified hierarchy")
    Cc: stable@vger.kernel.org # v3.16+
    07fd5b6c
cgroup.c 176 KB