1. 18 Apr, 2023 33 commits
  2. 16 Apr, 2023 7 commits
    • Andrew Morton's avatar
    • Peter Xu's avatar
      Revert "userfaultfd: don't fail on unrecognized features" · 2ff559f3
      Peter Xu authored
      This is a proposal to revert commit 914eedcb.
      
      I found this when writing a simple UFFDIO_API test to be the first unit
      test in this set.  Two things breaks with the commit:
      
        - UFFDIO_API check was lost and missing.  According to man page, the
        kernel should reject ioctl(UFFDIO_API) if uffdio_api.api != 0xaa.  This
        check is needed if the api version will be extended in the future, or
        user app won't be able to identify which is a new kernel.
      
        - Feature flags checks were removed, which means UFFDIO_API with a
        feature that does not exist will also succeed.  According to the man
        page, we should (and it makes sense) to reject ioctl(UFFDIO_API) if
        unknown features passed in.
      
      Link: https://lore.kernel.org/r/20220722201513.1624158-1-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20230412163922.327282-2-peterx@redhat.com
      Fixes: 914eedcb ("userfaultfd: don't fail on unrecognized features")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2ff559f3
    • Baokun Li's avatar
      writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs · 1ba1199e
      Baokun Li authored
      KASAN report null-ptr-deref:
      ==================================================================
      BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0
      Write of size 8 at addr 0000000000000000 by task sync/943
      CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461
      Call Trace:
       <TASK>
       dump_stack_lvl+0x7f/0xc0
       print_report+0x2ba/0x340
       kasan_report+0xc4/0x120
       kasan_check_range+0x1b7/0x2e0
       __kasan_check_write+0x24/0x40
       bdi_split_work_to_wbs+0x5c5/0x7b0
       sync_inodes_sb+0x195/0x630
       sync_inodes_one_sb+0x3a/0x50
       iterate_supers+0x106/0x1b0
       ksys_sync+0x98/0x160
      [...]
      ==================================================================
      
      The race that causes the above issue is as follows:
      
                 cpu1                     cpu2
      -------------------------|-------------------------
      inode_switch_wbs
       INIT_WORK(&isw->work, inode_switch_wbs_work_fn)
       queue_rcu_work(isw_wq, &isw->work)
       // queue_work async
        inode_switch_wbs_work_fn
         wb_put_many(old_wb, nr_switched)
          percpu_ref_put_many
           ref->data->release(ref)
           cgwb_release
            queue_work(cgwb_release_wq, &wb->release_work)
            // queue_work async
             &wb->release_work
             cgwb_release_workfn
                                  ksys_sync
                                   iterate_supers
                                    sync_inodes_one_sb
                                     sync_inodes_sb
                                      bdi_split_work_to_wbs
                                       kmalloc(sizeof(*work), GFP_ATOMIC)
                                       // alloc memory failed
              percpu_ref_exit
               ref->data = NULL
               kfree(data)
                                       wb_get(wb)
                                        percpu_ref_get(&wb->refcnt)
                                         percpu_ref_get_many(ref, 1)
                                          atomic_long_add(nr, &ref->data->count)
                                           atomic64_add(i, v)
                                           // trigger null-ptr-deref
      
      bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all
      wbs.  If the allocation of new work fails, the on-stack fallback will be
      used and the reference count of the current wb is increased afterwards. 
      If cgroup writeback membership switches occur before getting the reference
      count and the current wb is released as old_wd, then calling wb_get() or
      wb_put() will trigger the null pointer dereference above.
      
      This issue was introduced in v4.3-rc7 (see fix tag1).  Both
      sync_inodes_sb() and __writeback_inodes_sb_nr() calls to
      bdi_split_work_to_wbs() can trigger this issue.  For scenarios called via
      sync_inodes_sb(), originally commit 7fc5854f ("writeback: synchronize
      sync(2) against cgroup writeback membership switches") reduced the
      possibility of the issue by adding wb_switch_rwsem, but in v5.14-rc1 (see
      fix tag2) removed the "inode_io_list_del_locked(inode, old_wb)" from
      inode_switch_wbs_work_fn() so that wb->state contains WB_has_dirty_io,
      thus old_wb is not skipped when traversing wbs in bdi_split_work_to_wbs(),
      and the issue becomes easily reproducible again.
      
      To solve this problem, percpu_ref_exit() is called under RCU protection to
      avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs(). 
      Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(),
      and skip the current wb if wb_tryget() fails because the wb has already
      been shutdown.
      
      Link: https://lkml.kernel.org/r/20230410130826.1492525-1-libaokun1@huawei.com
      Fixes: b817525a ("writeback: bdi_writeback iteration must not skip dying ones")
      Signed-off-by: default avatarBaokun Li <libaokun1@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Hou Tao <houtao1@huawei.com>
      Cc: yangerkun <yangerkun@huawei.com>
      Cc: Zhang Yi <yi.zhang@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1ba1199e
    • Peng Zhang's avatar
      maple_tree: fix a potential memory leak, OOB access, or other unpredictable bug · 1f5f12ec
      Peng Zhang authored
      In mas_alloc_nodes(), "node->node_count = 0" means to initialize the
      node_count field of the new node, but the node may not be a new node.  It
      may be a node that existed before and node_count has a value, setting it
      to 0 will cause a memory leak.  At this time, mas->alloc->total will be
      greater than the actual number of nodes in the linked list, which may
      cause many other errors.  For example, out-of-bounds access in
      mas_pop_node(), and mas_pop_node() may return addresses that should not be
      used.  Fix it by initializing node_count only for new nodes.
      
      Also, by the way, an if-else statement was removed to simplify the code.
      
      Link: https://lkml.kernel.org/r/20230411041005.26205-1-zhangpeng.00@bytedance.com
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1f5f12ec
    • Steve Chou's avatar
      tools/mm/page_owner_sort.c: fix TGID output when cull=tg is used · 92357568
      Steve Chou authored
      When using cull option with 'tg' flag, the fprintf is using pid instead
      of tgid. It should use tgid instead.
      
      Link: https://lkml.kernel.org/r/20230411034929.2071501-1-steve_chou@pesi.com.tw
      Fixes: 9c8a0a8e ("tools/vm/page_owner_sort.c: support for user-defined culling rules")
      Signed-off-by: default avatarSteve Chou <steve_chou@pesi.com.tw>
      Cc: Jiajian Ye <yejiajian2018@email.szu.edu.cn>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      92357568
    • Jonathan Toppins's avatar
      mailmap: update jtoppins' entry to reference correct email · d2c115ba
      Jonathan Toppins authored
      Link: https://lkml.kernel.org/r/d79bc6eaf65e68bd1c2a1e1510ab6291ce5926a6.1681162487.git.jtoppins@redhat.comSigned-off-by: default avatarJonathan Toppins <jtoppins@redhat.com>
      Cc: Colin Ian King <colin.i.king@gmail.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Kirill Tkhai <tkhai@ya.ru>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Qais Yousef <qyousef@layalina.io>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d2c115ba
    • Liam R. Howlett's avatar
      mm/mempolicy: fix use-after-free of VMA iterator · f4e9e0e6
      Liam R. Howlett authored
      set_mempolicy_home_node() iterates over a list of VMAs and calls
      mbind_range() on each VMA, which also iterates over the singular list of
      the VMA passed in and potentially splits the VMA.  Since the VMA iterator
      is not passed through, set_mempolicy_home_node() may now point to a stale
      node in the VMA tree.  This can result in a UAF as reported by syzbot.
      
      Avoid the stale maple tree node by passing the VMA iterator through to the
      underlying call to split_vma().
      
      mbind_range() is also overly complicated, since there are two calling
      functions and one already handles iterating over the VMAs.  Simplify
      mbind_range() to only handle merging and splitting of the VMAs.
      
      Align the new loop in do_mbind() and existing loop in
      set_mempolicy_home_node() to use the reduced mbind_range() function.  This
      allows for a single location of the range calculation and avoids
      constantly looking up the previous VMA (since this is a loop over the
      VMAs).
      
      Link: https://lore.kernel.org/linux-mm/000000000000c93feb05f87e24ad@google.com/
      Fixes: 66850be5 ("mm/mempolicy: use vma iterator & maple state instead of vma linked list")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reported-by: syzbot+a7c1ec5b1d71ceaa5186@syzkaller.appspotmail.com
        Link: https://lkml.kernel.org/r/20230410152205.2294819-1-Liam.Howlett@oracle.com
      Tested-by: syzbot+a7c1ec5b1d71ceaa5186@syzkaller.appspotmail.com
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f4e9e0e6