1. 05 Feb, 2020 4 commits
    • Xiaochen Shen's avatar
      x86/resctrl: Fix a deadlock due to inaccurate reference · cc071b7c
      Xiaochen Shen authored
      commit 334b0f4e upstream.
      
      There is a race condition which results in a deadlock when rmdir and
      mkdir execute concurrently:
      
      $ ls /sys/fs/resctrl/c1/mon_groups/m1/
      cpus  cpus_list  mon_data  tasks
      
      Thread 1: rmdir /sys/fs/resctrl/c1
      Thread 2: mkdir /sys/fs/resctrl/c1/mon_groups/m1
      
      3 locks held by mkdir/48649:
       #0:  (sb_writers#17){.+.+}, at: [<ffffffffb4ca2aa0>] mnt_want_write+0x20/0x50
       #1:  (&type->i_mutex_dir_key#8/1){+.+.}, at: [<ffffffffb4c8c13b>] filename_create+0x7b/0x170
       #2:  (rdtgroup_mutex){+.+.}, at: [<ffffffffb4a4389d>] rdtgroup_kn_lock_live+0x3d/0x70
      
      4 locks held by rmdir/48652:
       #0:  (sb_writers#17){.+.+}, at: [<ffffffffb4ca2aa0>] mnt_want_write+0x20/0x50
       #1:  (&type->i_mutex_dir_key#8/1){+.+.}, at: [<ffffffffb4c8c3cf>] do_rmdir+0x13f/0x1e0
       #2:  (&type->i_mutex_dir_key#8){++++}, at: [<ffffffffb4c86d5d>] vfs_rmdir+0x4d/0x120
       #3:  (rdtgroup_mutex){+.+.}, at: [<ffffffffb4a4389d>] rdtgroup_kn_lock_live+0x3d/0x70
      
      Thread 1 is deleting control group "c1". Holding rdtgroup_mutex,
      kernfs_remove() removes all kernfs nodes under directory "c1"
      recursively, then waits for sub kernfs node "mon_groups" to drop active
      reference.
      
      Thread 2 is trying to create a subdirectory "m1" in the "mon_groups"
      directory. The wrapper kernfs_iop_mkdir() takes an active reference to
      the "mon_groups" directory but the code drops the active reference to
      the parent directory "c1" instead.
      
      As a result, Thread 1 is blocked on waiting for active reference to drop
      and never release rdtgroup_mutex, while Thread 2 is also blocked on
      trying to get rdtgroup_mutex.
      
      Thread 1 (rdtgroup_rmdir)   Thread 2 (rdtgroup_mkdir)
      (rmdir /sys/fs/resctrl/c1)  (mkdir /sys/fs/resctrl/c1/mon_groups/m1)
      -------------------------   -------------------------
                                  kernfs_iop_mkdir
                                    /*
                                     * kn: "m1", parent_kn: "mon_groups",
                                     * prgrp_kn: parent_kn->parent: "c1",
                                     *
                                     * "mon_groups", parent_kn->active++: 1
                                     */
                                    kernfs_get_active(parent_kn)
      kernfs_iop_rmdir
        /* "c1", kn->active++ */
        kernfs_get_active(kn)
      
        rdtgroup_kn_lock_live
          atomic_inc(&rdtgrp->waitcount)
          /* "c1", kn->active-- */
          kernfs_break_active_protection(kn)
          mutex_lock
      
        rdtgroup_rmdir_ctrl
          free_all_child_rdtgrp
            sentry->flags = RDT_DELETED
      
          rdtgroup_ctrl_remove
            rdtgrp->flags = RDT_DELETED
            kernfs_get(kn)
            kernfs_remove(rdtgrp->kn)
              __kernfs_remove
                /* "mon_groups", sub_kn */
                atomic_add(KN_DEACTIVATED_BIAS, &sub_kn->active)
                kernfs_drain(sub_kn)
                  /*
                   * sub_kn->active == KN_DEACTIVATED_BIAS + 1,
                   * waiting on sub_kn->active to drop, but it
                   * never drops in Thread 2 which is blocked
                   * on getting rdtgroup_mutex.
                   */
      Thread 1 hangs here ---->
                  wait_event(sub_kn->active == KN_DEACTIVATED_BIAS)
                  ...
                                    rdtgroup_mkdir
                                      rdtgroup_mkdir_mon(parent_kn, prgrp_kn)
                                        mkdir_rdt_prepare(parent_kn, prgrp_kn)
                                          rdtgroup_kn_lock_live(prgrp_kn)
                                            atomic_inc(&rdtgrp->waitcount)
                                            /*
                                             * "c1", prgrp_kn->active--
                                             *
                                             * The active reference on "c1" is
                                             * dropped, but not matching the
                                             * actual active reference taken
                                             * on "mon_groups", thus causing
                                             * Thread 1 to wait forever while
                                             * holding rdtgroup_mutex.
                                             */
                                            kernfs_break_active_protection(
                                                                     prgrp_kn)
                                            /*
                                             * Trying to get rdtgroup_mutex
                                             * which is held by Thread 1.
                                             */
      Thread 2 hangs here ---->             mutex_lock
                                            ...
      
      The problem is that the creation of a subdirectory in the "mon_groups"
      directory incorrectly releases the active protection of its parent
      directory instead of itself before it starts waiting for rdtgroup_mutex.
      This is triggered by the rdtgroup_mkdir() flow calling
      rdtgroup_kn_lock_live()/rdtgroup_kn_unlock() with kernfs node of the
      parent control group ("c1") as argument. It should be called with kernfs
      node "mon_groups" instead. What is currently missing is that the
      kn->priv of "mon_groups" is NULL instead of pointing to the rdtgrp.
      
      Fix it by pointing kn->priv to rdtgrp when "mon_groups" is created. Then
      it could be passed to rdtgroup_kn_lock_live()/rdtgroup_kn_unlock()
      instead. And then it operates on the same rdtgroup structure but handles
      the active reference of kernfs node "mon_groups" to prevent deadlock.
      The same changes are also made to the "mon_data" directories.
      
      This results in some unused function parameters that will be cleaned up
      in follow-up patch as the focus here is on the fix only in support of
      backporting efforts.
      
      Backporting notes:
      
      Since upstream commit fa7d9493 ("x86/resctrl: Rename and move rdt
      files to a separate directory"), the file
      arch/x86/kernel/cpu/intel_rdt_rdtgroup.c has been renamed and moved to
      arch/x86/kernel/cpu/resctrl/rdtgroup.c.
      Apply the change against file arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
      for older stable trees.
      
      Fixes: c7d9aac6 ("x86/intel_rdt/cqm: Add mkdir support for RDT monitoring")
      Suggested-by: default avatarReinette Chatre <reinette.chatre@intel.com>
      Signed-off-by: default avatarXiaochen Shen <xiaochen.shen@intel.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Reviewed-by: default avatarReinette Chatre <reinette.chatre@intel.com>
      Reviewed-by: default avatarTony Luck <tony.luck@intel.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1578500886-21771-4-git-send-email-xiaochen.shen@intel.comSigned-off-by: default avatarSasha Levin <sashal@kernel.org>
      cc071b7c
    • Xiaochen Shen's avatar
      x86/resctrl: Fix use-after-free due to inaccurate refcount of rdtgroup · 95a41c7b
      Xiaochen Shen authored
      commit 074fadee upstream.
      
      There is a race condition in the following scenario which results in an
      use-after-free issue when reading a monitoring file and deleting the
      parent ctrl_mon group concurrently:
      
      Thread 1 calls atomic_inc() to take refcount of rdtgrp and then calls
      kernfs_break_active_protection() to drop the active reference of kernfs
      node in rdtgroup_kn_lock_live().
      
      In Thread 2, kernfs_remove() is a blocking routine. It waits on all sub
      kernfs nodes to drop the active reference when removing all subtree
      kernfs nodes recursively. Thread 2 could block on kernfs_remove() until
      Thread 1 calls kernfs_break_active_protection(). Only after
      kernfs_remove() completes the refcount of rdtgrp could be trusted.
      
      Before Thread 1 calls atomic_inc() and kernfs_break_active_protection(),
      Thread 2 could call kfree() when the refcount of rdtgrp (sentry) is 0
      instead of 1 due to the race.
      
      In Thread 1, in rdtgroup_kn_unlock(), referring to earlier rdtgrp memory
      (rdtgrp->waitcount) which was already freed in Thread 2 results in
      use-after-free issue.
      
      Thread 1 (rdtgroup_mondata_show)  Thread 2 (rdtgroup_rmdir)
      --------------------------------  -------------------------
      rdtgroup_kn_lock_live
        /*
         * kn active protection until
         * kernfs_break_active_protection(kn)
         */
        rdtgrp = kernfs_to_rdtgroup(kn)
                                        rdtgroup_kn_lock_live
                                          atomic_inc(&rdtgrp->waitcount)
                                          mutex_lock
                                        rdtgroup_rmdir_ctrl
                                          free_all_child_rdtgrp
                                            /*
                                             * sentry->waitcount should be 1
                                             * but is 0 now due to the race.
                                             */
                                            kfree(sentry)*[1]
        /*
         * Only after kernfs_remove()
         * completes, the refcount of
         * rdtgrp could be trusted.
         */
        atomic_inc(&rdtgrp->waitcount)
        /* kn->active-- */
        kernfs_break_active_protection(kn)
                                          rdtgroup_ctrl_remove
                                            rdtgrp->flags = RDT_DELETED
                                            /*
                                             * Blocking routine, wait for
                                             * all sub kernfs nodes to drop
                                             * active reference in
                                             * kernfs_break_active_protection.
                                             */
                                            kernfs_remove(rdtgrp->kn)
                                        rdtgroup_kn_unlock
                                          mutex_unlock
                                          atomic_dec_and_test(
                                                      &rdtgrp->waitcount)
                                          && (flags & RDT_DELETED)
                                            kernfs_unbreak_active_protection(kn)
                                            kfree(rdtgrp)
        mutex_lock
      mon_event_read
      rdtgroup_kn_unlock
        mutex_unlock
        /*
         * Use-after-free: refer to earlier rdtgrp
         * memory which was freed in [1].
         */
        atomic_dec_and_test(&rdtgrp->waitcount)
        && (flags & RDT_DELETED)
          /* kn->active++ */
          kernfs_unbreak_active_protection(kn)
          kfree(rdtgrp)
      
      Fix it by moving free_all_child_rdtgrp() to after kernfs_remove() in
      rdtgroup_rmdir_ctrl() to ensure it has the accurate refcount of rdtgrp.
      
      Backporting notes:
      
      Since upstream commit fa7d9493 ("x86/resctrl: Rename and move rdt
      files to a separate directory"), the file
      arch/x86/kernel/cpu/intel_rdt_rdtgroup.c has been renamed and moved to
      arch/x86/kernel/cpu/resctrl/rdtgroup.c.
      Apply the change against file arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
      for older stable trees.
      
      Fixes: f3cbeaca ("x86/intel_rdt/cqm: Add rmdir support")
      Suggested-by: default avatarReinette Chatre <reinette.chatre@intel.com>
      Signed-off-by: default avatarXiaochen Shen <xiaochen.shen@intel.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Reviewed-by: default avatarReinette Chatre <reinette.chatre@intel.com>
      Reviewed-by: default avatarTony Luck <tony.luck@intel.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1578500886-21771-3-git-send-email-xiaochen.shen@intel.comSigned-off-by: default avatarSasha Levin <sashal@kernel.org>
      95a41c7b
    • Xiaochen Shen's avatar
      x86/resctrl: Fix use-after-free when deleting resource groups · 1b006f8c
      Xiaochen Shen authored
      commit b8511ccc upstream.
      
      A resource group (rdtgrp) contains a reference count (rdtgrp->waitcount)
      that indicates how many waiters expect this rdtgrp to exist. Waiters
      could be waiting on rdtgroup_mutex or some work sitting on a task's
      workqueue for when the task returns from kernel mode or exits.
      
      The deletion of a rdtgrp is intended to have two phases:
      
        (1) while holding rdtgroup_mutex the necessary cleanup is done and
        rdtgrp->flags is set to RDT_DELETED,
      
        (2) after releasing the rdtgroup_mutex, the rdtgrp structure is freed
        only if there are no waiters and its flag is set to RDT_DELETED. Upon
        gaining access to rdtgroup_mutex or rdtgrp, a waiter is required to check
        for the RDT_DELETED flag.
      
      When unmounting the resctrl file system or deleting ctrl_mon groups,
      all of the subdirectories are removed and the data structure of rdtgrp
      is forcibly freed without checking rdtgrp->waitcount. If at this point
      there was a waiter on rdtgrp then a use-after-free issue occurs when the
      waiter starts running and accesses the rdtgrp structure it was waiting
      on.
      
      See kfree() calls in [1], [2] and [3] in these two call paths in
      following scenarios:
      (1) rdt_kill_sb() -> rmdir_all_sub() -> free_all_child_rdtgrp()
      (2) rdtgroup_rmdir() -> rdtgroup_rmdir_ctrl() -> free_all_child_rdtgrp()
      
      There are several scenarios that result in use-after-free issue in
      following:
      
      Scenario 1:
      -----------
      In Thread 1, rdtgroup_tasks_write() adds a task_work callback
      move_myself(). If move_myself() is scheduled to execute after Thread 2
      rdt_kill_sb() is finished, referring to earlier rdtgrp memory
      (rdtgrp->waitcount) which was already freed in Thread 2 results in
      use-after-free issue.
      
      Thread 1 (rdtgroup_tasks_write)        Thread 2 (rdt_kill_sb)
      -------------------------------        ----------------------
      rdtgroup_kn_lock_live
        atomic_inc(&rdtgrp->waitcount)
        mutex_lock
      rdtgroup_move_task
        __rdtgroup_move_task
          /*
           * Take an extra refcount, so rdtgrp cannot be freed
           * before the call back move_myself has been invoked
           */
          atomic_inc(&rdtgrp->waitcount)
          /* Callback move_myself will be scheduled for later */
          task_work_add(move_myself)
      rdtgroup_kn_unlock
        mutex_unlock
        atomic_dec_and_test(&rdtgrp->waitcount)
        && (flags & RDT_DELETED)
                                             mutex_lock
                                             rmdir_all_sub
                                               /*
                                                * sentry and rdtgrp are freed
                                                * without checking refcount
                                                */
                                               free_all_child_rdtgrp
                                                 kfree(sentry)*[1]
                                               kfree(rdtgrp)*[2]
                                             mutex_unlock
      /*
       * Callback is scheduled to execute
       * after rdt_kill_sb is finished
       */
      move_myself
        /*
         * Use-after-free: refer to earlier rdtgrp
         * memory which was freed in [1] or [2].
         */
        atomic_dec_and_test(&rdtgrp->waitcount)
        && (flags & RDT_DELETED)
          kfree(rdtgrp)
      
      Scenario 2:
      -----------
      In Thread 1, rdtgroup_tasks_write() adds a task_work callback
      move_myself(). If move_myself() is scheduled to execute after Thread 2
      rdtgroup_rmdir() is finished, referring to earlier rdtgrp memory
      (rdtgrp->waitcount) which was already freed in Thread 2 results in
      use-after-free issue.
      
      Thread 1 (rdtgroup_tasks_write)        Thread 2 (rdtgroup_rmdir)
      -------------------------------        -------------------------
      rdtgroup_kn_lock_live
        atomic_inc(&rdtgrp->waitcount)
        mutex_lock
      rdtgroup_move_task
        __rdtgroup_move_task
          /*
           * Take an extra refcount, so rdtgrp cannot be freed
           * before the call back move_myself has been invoked
           */
          atomic_inc(&rdtgrp->waitcount)
          /* Callback move_myself will be scheduled for later */
          task_work_add(move_myself)
      rdtgroup_kn_unlock
        mutex_unlock
        atomic_dec_and_test(&rdtgrp->waitcount)
        && (flags & RDT_DELETED)
                                             rdtgroup_kn_lock_live
                                               atomic_inc(&rdtgrp->waitcount)
                                               mutex_lock
                                             rdtgroup_rmdir_ctrl
                                               free_all_child_rdtgrp
                                                 /*
                                                  * sentry is freed without
                                                  * checking refcount
                                                  */
                                                 kfree(sentry)*[3]
                                               rdtgroup_ctrl_remove
                                                 rdtgrp->flags = RDT_DELETED
                                             rdtgroup_kn_unlock
                                               mutex_unlock
                                               atomic_dec_and_test(
                                                           &rdtgrp->waitcount)
                                               && (flags & RDT_DELETED)
                                                 kfree(rdtgrp)
      /*
       * Callback is scheduled to execute
       * after rdt_kill_sb is finished
       */
      move_myself
        /*
         * Use-after-free: refer to earlier rdtgrp
         * memory which was freed in [3].
         */
        atomic_dec_and_test(&rdtgrp->waitcount)
        && (flags & RDT_DELETED)
          kfree(rdtgrp)
      
      If CONFIG_DEBUG_SLAB=y, Slab corruption on kmalloc-2k can be observed
      like following. Note that "0x6b" is POISON_FREE after kfree(). The
      corrupted bits "0x6a", "0x64" at offset 0x424 correspond to
      waitcount member of struct rdtgroup which was freed:
      
        Slab corruption (Not tainted): kmalloc-2k start=ffff9504c5b0d000, len=2048
        420: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkjkkkkkkkkkkk
        Single bit error detected. Probably bad RAM.
        Run memtest86+ or a similar memory test tool.
        Next obj: start=ffff9504c5b0d800, len=2048
        000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
        010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
      
        Slab corruption (Not tainted): kmalloc-2k start=ffff9504c58ab800, len=2048
        420: 6b 6b 6b 6b 64 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkdkkkkkkkkkkk
        Prev obj: start=ffff9504c58ab000, len=2048
        000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
        010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
      
      Fix this by taking reference count (waitcount) of rdtgrp into account in
      the two call paths that currently do not do so. Instead of always
      freeing the resource group it will only be freed if there are no waiters
      on it. If there are waiters, the resource group will have its flags set
      to RDT_DELETED.
      
      It will be left to the waiter to free the resource group when it starts
      running and finding that it was the last waiter and the resource group
      has been removed (rdtgrp->flags & RDT_DELETED) since. (1) rdt_kill_sb()
      -> rmdir_all_sub() -> free_all_child_rdtgrp() (2) rdtgroup_rmdir() ->
      rdtgroup_rmdir_ctrl() -> free_all_child_rdtgrp()
      
      Backporting notes:
      
      Since upstream commit fa7d9493 ("x86/resctrl: Rename and move rdt
      files to a separate directory"), the file
      arch/x86/kernel/cpu/intel_rdt_rdtgroup.c has been renamed and moved to
      arch/x86/kernel/cpu/resctrl/rdtgroup.c.
      
      Apply the change against file arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
      in older stable trees.
      
      Fixes: f3cbeaca ("x86/intel_rdt/cqm: Add rmdir support")
      Fixes: 60cf5e10 ("x86/intel_rdt: Add mkdir to resctrl file system")
      Suggested-by: default avatarReinette Chatre <reinette.chatre@intel.com>
      Signed-off-by: default avatarXiaochen Shen <xiaochen.shen@intel.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Reviewed-by: default avatarReinette Chatre <reinette.chatre@intel.com>
      Reviewed-by: default avatarTony Luck <tony.luck@intel.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1578500886-21771-2-git-send-email-xiaochen.shen@intel.comSigned-off-by: default avatarSasha Levin <sashal@kernel.org>
      1b006f8c
    • Al Viro's avatar
      vfs: fix do_last() regression · 8d7a5100
      Al Viro authored
      commit 6404674a upstream.
      
      Brown paperbag time: fetching ->i_uid/->i_mode really should've been
      done from nd->inode.  I even suggested that, but the reason for that has
      slipped through the cracks and I went for dir->d_inode instead - made
      for more "obvious" patch.
      
      Analysis:
      
       - at the entry into do_last() and all the way to step_into(): dir (aka
         nd->path.dentry) is known not to have been freed; so's nd->inode and
         it's equal to dir->d_inode unless we are already doomed to -ECHILD.
         inode of the file to get opened is not known.
      
       - after step_into(): inode of the file to get opened is known; dir
         might be pointing to freed memory/be negative/etc.
      
       - at the call of may_create_in_sticky(): guaranteed to be out of RCU
         mode; inode of the file to get opened is known and pinned; dir might
         be garbage.
      
      The last was the reason for the original patch.  Except that at the
      do_last() entry we can be in RCU mode and it is possible that
      nd->path.dentry->d_inode has already changed under us.
      
      In that case we are going to fail with -ECHILD, but we need to be
      careful; nd->inode is pointing to valid struct inode and it's the same
      as nd->path.dentry->d_inode in "won't fail with -ECHILD" case, so we
      should use that.
      Reported-by: default avatar"Rantala, Tommi T. (Nokia - FI/Espoo)" <tommi.t.rantala@nokia.com>
      Reported-by: syzbot+190005201ced78a74ad6@syzkaller.appspotmail.com
      Wearing-brown-paperbag: Al Viro <viro@zeniv.linux.org.uk>
      Cc: stable@kernel.org
      Fixes: d0cb5018 ("do_last(): fetch directory ->i_mode and ->i_uid before it's too late")
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8d7a5100
  2. 01 Feb, 2020 36 commits