1. 10 Sep, 2024 2 commits
  2. 04 Sep, 2024 1 commit
  3. 30 Aug, 2024 14 commits
  4. 20 Aug, 2024 3 commits
    • Chen Ridong's avatar
      cgroup/cpuset: remove use_parent_ecpus of cpuset · 3c2acae8
      Chen Ridong authored
      use_parent_ecpus is used to track whether the children are using the
      parent's effective_cpus. When a parent's effective_cpus is changed
      due to changes in a child partition's effective_xcpus, any child
      using parent'effective_cpus must call update_cpumasks_hier. However,
      if a child is not a valid partition, it is sufficient to determine
      whether to call update_cpumasks_hier based on whether the child's
      effective_cpus is going to change. To make the code more succinct,
      it is suggested to remove use_parent_ecpus.
      Signed-off-by: default avatarChen Ridong <chenridong@huawei.com>
      Reviewed-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      3c2acae8
    • Chen Ridong's avatar
      cgroup/cpuset: remove fetch_xcpus · 9414f68d
      Chen Ridong authored
      Both fetch_xcpus and user_xcpus functions are used to retrieve the value
      of exclusive_cpus. If exclusive_cpus is not set, cpus_allowed is the
      implicit value used as exclusive in a local partition. I can not imagine
      a scenario where effective_xcpus is not empty when exclusive_cpus is
      empty. Therefore, I suggest removing the fetch_xcpus function.
      Signed-off-by: default avatarChen Ridong <chenridong@huawei.com>
      Reviewed-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      9414f68d
    • Chen Ridong's avatar
      cgroup/cpuset: Correct invalid remote parition prs · e55f45b4
      Chen Ridong authored
      When enable a remote partition, I found that:
      
      cd /sys/fs/cgroup/
      mkdir test
      mkdir test/test1
      echo +cpuset > cgroup.subtree_control
      echo +cpuset >  test/cgroup.subtree_control
      echo 3 > test/test1/cpuset.cpus
      echo root > test/test1/cpuset.cpus.partition
      cat test/test1/cpuset.cpus.partition
      root invalid (Parent is not a partition root)
      
      The parent of a remote partition could not be a root. This is due to the
      emtpy effective_xcpus. It would be better to prompt the message "invalid
      cpu list in cpuset.cpus.exclusive".
      Signed-off-by: default avatarChen Ridong <chenridong@huawei.com>
      Reviewed-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      e55f45b4
  5. 19 Aug, 2024 1 commit
  6. 09 Aug, 2024 1 commit
  7. 05 Aug, 2024 7 commits
    • Waiman Long's avatar
      selftest/cgroup: Add new test cases to test_cpuset_prs.sh · 92841d6e
      Waiman Long authored
      Add new test cases to test_cpuset_prs.sh to cover corner cases reported
      in previous fix commits.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      92841d6e
    • Waiman Long's avatar
      cgroup/cpuset: Check for partition roots with overlapping CPUs · 99570300
      Waiman Long authored
      With the previous commit that eliminates the overlapping partition
      root corner cases in the hotplug code, the partition roots passed down
      to generate_sched_domains() should not have overlapping CPUs. Enable
      overlapping cpuset check for v2 and warn if that happens.
      
      This patch also has the benefit of increasing test coverage of the new
      Union-Find cpuset merging code to cgroup v2.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      99570300
    • Tejun Heo's avatar
      Merge branch 'cgroup/for-6.11-fixes' into cgroup/for-6.12 · bc3c2751
      Tejun Heo authored
      cgroup/for-6.12 is about to receive updates that are dependent on changes
      from both for-6.11-fixes and for-6.12. Pull in for-6.11-fixes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      bc3c2751
    • Waiman Long's avatar
      cgroup/cpuset: Eliminate unncessary sched domains rebuilds in hotplug · ff0ce721
      Waiman Long authored
      It was found that some hotplug operations may cause multiple
      rebuild_sched_domains_locked() calls. Some of those intermediate calls
      may use cpuset states not in the final correct form leading to incorrect
      sched domain setting.
      
      Fix this problem by using the existing force_rebuild flag to inhibit
      immediate rebuild_sched_domains_locked() calls if set and only doing
      one final call at the end. Also renaming the force_rebuild flag to
      force_sd_rebuild to make its meaning for clear.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      ff0ce721
    • Waiman Long's avatar
      cgroup/cpuset: Clear effective_xcpus on cpus_allowed clearing only if cpus.exclusive not set · 311a1bdc
      Waiman Long authored
      Commit e2ffe502 ("cgroup/cpuset: Add cpuset.cpus.exclusive for
      v2") adds a user writable cpuset.cpus.exclusive file for setting
      exclusive CPUs to be used for the creation of partitions. Since then
      effective_xcpus depends on both the cpuset.cpus and cpuset.cpus.exclusive
      setting. If cpuset.cpus.exclusive is set, effective_xcpus will depend
      only on cpuset.cpus.exclusive.  When it is not set, effective_xcpus
      will be set according to the cpuset.cpus value when the cpuset becomes
      a valid partition root.
      
      When cpuset.cpus is being cleared by the user, effective_xcpus should
      only be cleared when cpuset.cpus.exclusive is not set. However, that
      is not currently the case.
      
        # cd /sys/fs/cgroup/
        # mkdir test
        # echo +cpuset > cgroup.subtree_control
        # cd test
        # echo 3 > cpuset.cpus.exclusive
        # cat cpuset.cpus.exclusive.effective
        3
        # echo > cpuset.cpus
        # cat cpuset.cpus.exclusive.effective // was cleared
      
      Fix it by clearing effective_xcpus only if cpuset.cpus.exclusive is
      not set.
      
      Fixes: e2ffe502 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2")
      Cc: stable@vger.kernel.org # v6.7+
      Reported-by: default avatarChen Ridong <chenridong@huawei.com>
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      311a1bdc
    • Chen Ridong's avatar
      cgroup/cpuset: fix panic caused by partcmd_update · 959ab635
      Chen Ridong authored
      We find a bug as below:
      BUG: unable to handle page fault for address: 00000003
      PGD 0 P4D 0
      Oops: 0000 [#1] PREEMPT SMP NOPTI
      CPU: 3 PID: 358 Comm: bash Tainted: G        W I        6.6.0-10893-g60d6
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/4
      RIP: 0010:partition_sched_domains_locked+0x483/0x600
      Code: 01 48 85 d2 74 0d 48 83 05 29 3f f8 03 01 f3 48 0f bc c2 89 c0 48 9
      RSP: 0018:ffffc90000fdbc58 EFLAGS: 00000202
      RAX: 0000000100000003 RBX: ffff888100b3dfa0 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000002fe80
      RBP: ffff888100b3dfb0 R08: 0000000000000001 R09: 0000000000000000
      R10: ffffc90000fdbcb0 R11: 0000000000000004 R12: 0000000000000002
      R13: ffff888100a92b48 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007f44a5425740(0000) GS:ffff888237d80000(0000) knlGS:0000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000100030973 CR3: 000000010722c000 CR4: 00000000000006e0
      Call Trace:
       <TASK>
       ? show_regs+0x8c/0xa0
       ? __die_body+0x23/0xa0
       ? __die+0x3a/0x50
       ? page_fault_oops+0x1d2/0x5c0
       ? partition_sched_domains_locked+0x483/0x600
       ? search_module_extables+0x2a/0xb0
       ? search_exception_tables+0x67/0x90
       ? kernelmode_fixup_or_oops+0x144/0x1b0
       ? __bad_area_nosemaphore+0x211/0x360
       ? up_read+0x3b/0x50
       ? bad_area_nosemaphore+0x1a/0x30
       ? exc_page_fault+0x890/0xd90
       ? __lock_acquire.constprop.0+0x24f/0x8d0
       ? __lock_acquire.constprop.0+0x24f/0x8d0
       ? asm_exc_page_fault+0x26/0x30
       ? partition_sched_domains_locked+0x483/0x600
       ? partition_sched_domains_locked+0xf0/0x600
       rebuild_sched_domains_locked+0x806/0xdc0
       update_partition_sd_lb+0x118/0x130
       cpuset_write_resmask+0xffc/0x1420
       cgroup_file_write+0xb2/0x290
       kernfs_fop_write_iter+0x194/0x290
       new_sync_write+0xeb/0x160
       vfs_write+0x16f/0x1d0
       ksys_write+0x81/0x180
       __x64_sys_write+0x21/0x30
       x64_sys_call+0x2f25/0x4630
       do_syscall_64+0x44/0xb0
       entry_SYSCALL_64_after_hwframe+0x78/0xe2
      RIP: 0033:0x7f44a553c887
      
      It can be reproduced with cammands:
      cd /sys/fs/cgroup/
      mkdir test
      cd test/
      echo +cpuset > ../cgroup.subtree_control
      echo root > cpuset.cpus.partition
      cat /sys/fs/cgroup/cpuset.cpus.effective
      0-3
      echo 0-3 > cpuset.cpus // taking away all cpus from root
      
      This issue is caused by the incorrect rebuilding of scheduling domains.
      In this scenario, test/cpuset.cpus.partition should be an invalid root
      and should not trigger the rebuilding of scheduling domains. When calling
      update_parent_effective_cpumask with partcmd_update, if newmask is not
      null, it should recheck newmask whether there are cpus is available
      for parect/cs that has tasks.
      
      Fixes: 0c7f293e ("cgroup/cpuset: Add cpuset.cpus.exclusive.effective for v2")
      Cc: stable@vger.kernel.org # v6.7+
      Signed-off-by: default avatarChen Ridong <chenridong@huawei.com>
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      959ab635
    • Xiu Jianfeng's avatar
      cgroup/pids: Remove unreachable paths of pids_{can,cancel}_fork · 4980f712
      Xiu Jianfeng authored
      According to the implementation of cgroup_css_set_fork(), it will fail
      if cset cannot be found and the can_fork/cancel_fork methods will not
      be called in this case, which means that the argument 'cset' for these
      methods must not be NULL, so remove the unrechable paths in them.
      Signed-off-by: default avatarXiu Jianfeng <xiujianfeng@huawei.com>
      Reviewed-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      4980f712
  8. 02 Aug, 2024 1 commit
  9. 31 Jul, 2024 1 commit
    • Waiman Long's avatar
      cgroup: Show # of subsystem CSSes in cgroup.stat · ab031252
      Waiman Long authored
      Cgroup subsystem state (CSS) is an abstraction in the cgroup layer to
      help manage different structures in various cgroup subsystems by being
      an embedded element inside a larger structure like cpuset or mem_cgroup.
      
      The /proc/cgroups file shows the number of cgroups for each of the
      subsystems.  With cgroup v1, the number of CSSes is the same as the
      number of cgroups.  That is not the case anymore with cgroup v2. The
      /proc/cgroups file cannot show the actual number of CSSes for the
      subsystems that are bound to cgroup v2.
      
      So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup),
      we can't tell by looking at /proc/cgroups which cgroup subsystems may
      be responsible.
      
      As cgroup v2 had deprecated the use of /proc/cgroups, the hierarchical
      cgroup.stat file is now being extended to show the number of live and
      dying CSSes associated with all the non-inhibited cgroup subsystems that
      have been bound to cgroup v2. The number includes CSSes in the current
      cgroup as well as in all the descendants underneath it.  This will help
      us pinpoint which subsystems are responsible for the increasing number
      of dying (nr_dying_descendants) cgroups.
      
      The CSSes dying counts are stored in the cgroup structure itself
      instead of inside the CSS as suggested by Johannes. This will allow
      us to accurately track dying counts of cgroup subsystems that have
      recently been disabled in a cgroup. It is now possible that a zero
      subsystem number is coupled with a non-zero dying subsystem number.
      
      The cgroup-v2.rst file is updated to discuss this new behavior.
      
      With this patch applied, a sample output from root cgroup.stat file
      was shown below.
      
      	nr_descendants 56
      	nr_subsys_cpuset 1
      	nr_subsys_cpu 43
      	nr_subsys_io 43
      	nr_subsys_memory 56
      	nr_subsys_perf_event 57
      	nr_subsys_hugetlb 1
      	nr_subsys_pids 56
      	nr_subsys_rdma 1
      	nr_subsys_misc 1
      	nr_dying_descendants 30
      	nr_dying_subsys_cpuset 0
      	nr_dying_subsys_cpu 0
      	nr_dying_subsys_io 0
      	nr_dying_subsys_memory 30
      	nr_dying_subsys_perf_event 0
      	nr_dying_subsys_hugetlb 0
      	nr_dying_subsys_pids 0
      	nr_dying_subsys_rdma 0
      	nr_dying_subsys_misc 0
      
      Another sample output from system.slice/cgroup.stat was:
      
      	nr_descendants 34
      	nr_subsys_cpuset 0
      	nr_subsys_cpu 32
      	nr_subsys_io 32
      	nr_subsys_memory 34
      	nr_subsys_perf_event 35
      	nr_subsys_hugetlb 0
      	nr_subsys_pids 34
      	nr_subsys_rdma 0
      	nr_subsys_misc 0
      	nr_dying_descendants 30
      	nr_dying_subsys_cpuset 0
      	nr_dying_subsys_cpu 0
      	nr_dying_subsys_io 0
      	nr_dying_subsys_memory 30
      	nr_dying_subsys_perf_event 0
      	nr_dying_subsys_hugetlb 0
      	nr_dying_subsys_pids 0
      	nr_dying_subsys_rdma 0
      	nr_dying_subsys_misc 0
      
      Note that 'debug' controller wasn't used to provide this information because
      the controller is not recommended in productions kernels, also many of them
      won't enable CONFIG_CGROUP_DEBUG by default.
      
      Similar information could be retrieved with debuggers like drgn but that's
      also not always available (e.g. lockdown) and the additional cost of runtime
      tracking here is deemed marginal.
      
      tj: Added Michal's paragraphs on why this is not added the debug controller
          to the commit message.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Reviewed-by: default avatarKamalesh Babulal <kamalesh.babulal@oracle.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Link: http://lkml.kernel.org/r/20240715150034.2583772-1-longman@redhat.comSigned-off-by: default avatarTejun Heo <tj@kernel.org>
      ab031252
  10. 30 Jul, 2024 6 commits
  11. 28 Jul, 2024 3 commits