1. 06 May, 2024 10 commits
  2. 25 Apr, 2024 11 commits
  3. 16 Apr, 2024 15 commits
    • Jeongjun Park's avatar
      nilfs2: fix OOB in nilfs_set_de_type · c4a7dc95
      Jeongjun Park authored
      The size of the nilfs_type_by_mode array in the fs/nilfs2/dir.c file is
      defined as "S_IFMT >> S_SHIFT", but the nilfs_set_de_type() function,
      which uses this array, specifies the index to read from the array in the
      same way as "(mode & S_IFMT) >> S_SHIFT".
      
      static void nilfs_set_de_type(struct nilfs_dir_entry *de, struct inode
       *inode)
      {
      	umode_t mode = inode->i_mode;
      
      	de->file_type = nilfs_type_by_mode[(mode & S_IFMT)>>S_SHIFT]; // oob
      }
      
      However, when the index is determined this way, an out-of-bounds (OOB)
      error occurs by referring to an index that is 1 larger than the array size
      when the condition "mode & S_IFMT == S_IFMT" is satisfied.  Therefore, a
      patch to resize the nilfs_type_by_mode array should be applied to prevent
      OOB errors.
      
      Link: https://lkml.kernel.org/r/20240415182048.7144-1-konishi.ryusuke@gmail.com
      Reported-by: syzbot+2e22057de05b9f3b30d8@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=2e22057de05b9f3b30d8
      Fixes: 2ba466d7 ("nilfs2: directory entry operations")
      Signed-off-by: default avatarJeongjun Park <aha310510@gmail.com>
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c4a7dc95
    • Naoya Horiguchi's avatar
      MAINTAINERS: update Naoya Horiguchi's email address · 8247bf1d
      Naoya Horiguchi authored
      My old NEC address has been removed, so update MAINTAINERS and .mailmap to
      map it to my gmail address.
      
      Link: https://lkml.kernel.org/r/20240412181720.18452-1-nao.horiguchi@gmail.comSigned-off-by: default avatarNaoya Horiguchi <nao.horiguchi@gmail.com>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8247bf1d
    • Miaohe Lin's avatar
      fork: defer linking file vma until vma is fully initialized · 35e35178
      Miaohe Lin authored
      Thorvald reported a WARNING [1]. And the root cause is below race:
      
       CPU 1					CPU 2
       fork					hugetlbfs_fallocate
        dup_mmap				 hugetlbfs_punch_hole
         i_mmap_lock_write(mapping);
         vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
         i_mmap_unlock_write(mapping);
         hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem!
      					 i_mmap_lock_write(mapping);
         					 hugetlb_vmdelete_list
      					  vma_interval_tree_foreach
      					   hugetlb_vma_trylock_write -- Vma_lock is cleared.
         tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
      					   hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
      					 i_mmap_unlock_write(mapping);
      
      hugetlb_dup_vma_private() and hugetlb_vm_op_open() are called outside
      i_mmap_rwsem lock while vma lock can be used in the same time.  Fix this
      by deferring linking file vma until vma is fully initialized.  Those vmas
      should be initialized first before they can be used.
      
      Link: https://lkml.kernel.org/r/20240410091441.3539905-1-linmiaohe@huawei.com
      Fixes: 8d9bfb26 ("hugetlb: add vma based lock for pmd sharing")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reported-by: default avatarThorvald Natvig <thorvald@google.com>
      Closes: https://lore.kernel.org/linux-mm/20240129161735.6gmjsswx62o4pbja@revolver/T/ [1]
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mateusz Guzik <mjguzik@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peng Zhang <zhangpeng.00@bytedance.com>
      Cc: Tycho Andersen <tandersen@netflix.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      35e35178
    • Sumanth Korikkar's avatar
      mm/shmem: inline shmem_is_huge() for disabled transparent hugepages · 1f737846
      Sumanth Korikkar authored
      In order to  minimize code size (CONFIG_CC_OPTIMIZE_FOR_SIZE=y),
      compiler might choose to make a regular function call (out-of-line) for
      shmem_is_huge() instead of inlining it. When transparent hugepages are
      disabled (CONFIG_TRANSPARENT_HUGEPAGE=n), it can cause compilation
      error.
      
      mm/shmem.c: In function `shmem_getattr':
      ./include/linux/huge_mm.h:383:27: note: in expansion of macro `BUILD_BUG'
        383 | #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
            |                           ^~~~~~~~~
      mm/shmem.c:1148:33: note: in expansion of macro `HPAGE_PMD_SIZE'
       1148 |                 stat->blksize = HPAGE_PMD_SIZE;
      
      To prevent the possible error, always inline shmem_is_huge() when
      transparent hugepages are disabled.
      
      Link: https://lkml.kernel.org/r/20240409155407.2322714-1-sumanthk@linux.ibm.comSigned-off-by: default avatarSumanth Korikkar <sumanthk@linux.ibm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ilya Leoshkevich <iii@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1f737846
    • Oscar Salvador's avatar
      mm,page_owner: defer enablement of static branch · 0b2cf0a4
      Oscar Salvador authored
      Kefeng Wang reported that he was seeing some memory leaks with kmemleak
      with page_owner enabled.
      
      The reason is that we enable the page_owner_inited static branch and then
      proceed with the linking of stack_list struct to dummy_stack, which means
      that exists a race window between these two steps where we can have pages
      already being allocated calling add_stack_record_to_list(), allocating
      objects and linking them to stack_list, but then we set stack_list
      pointing to dummy_stack in init_page_owner.  Which means that the objects
      that have been allocated during that time window are unreferenced and
      lost.
      
      Fix this by deferring the enablement of the branch until we have properly
      set up the list.
      
      Link: https://lkml.kernel.org/r/20240409131715.13632-1-osalvador@suse.de
      Fixes: 4bedfb31 ("mm,page_owner: maintain own list of stack_records structs")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reported-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Closes: https://lore.kernel.org/linux-mm/74b147b0-718d-4d50-be75-d6afc801cd24@huawei.com/Tested-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0b2cf0a4
    • Phillip Lougher's avatar
      Squashfs: check the inode number is not the invalid value of zero · 9253c54e
      Phillip Lougher authored
      Syskiller has produced an out of bounds access in fill_meta_index().
      
      That out of bounds access is ultimately caused because the inode
      has an inode number with the invalid value of zero, which was not checked.
      
      The reason this causes the out of bounds access is due to following
      sequence of events:
      
      1. Fill_meta_index() is called to allocate (via empty_meta_index())
         and fill a metadata index.  It however suffers a data read error
         and aborts, invalidating the newly returned empty metadata index.
         It does this by setting the inode number of the index to zero,
         which means unused (zero is not a valid inode number).
      
      2. When fill_meta_index() is subsequently called again on another
         read operation, locate_meta_index() returns the previous index
         because it matches the inode number of 0.  Because this index
         has been returned it is expected to have been filled, and because
         it hasn't been, an out of bounds access is performed.
      
      This patch adds a sanity check which checks that the inode number
      is not zero when the inode is created and returns -EINVAL if it is.
      
      [phillip@squashfs.org.uk: whitespace fix]
        Link: https://lkml.kernel.org/r/20240409204723.446925-1-phillip@squashfs.org.uk
      Link: https://lkml.kernel.org/r/20240408220206.435788-1-phillip@squashfs.org.ukSigned-off-by: default avatarPhillip Lougher <phillip@squashfs.org.uk>
      Reported-by: default avatar"Ubisectech Sirius" <bugreport@ubisectech.com>
      Closes: https://lore.kernel.org/lkml/87f5c007-b8a5-41ae-8b57-431e924c5915.bugreport@ubisectech.com/
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9253c54e
    • Oscar Salvador's avatar
      mm,swapops: update check in is_pfn_swap_entry for hwpoison entries · 07a57a33
      Oscar Salvador authored
      Tony reported that the Machine check recovery was broken in v6.9-rc1, as
      he was hitting a VM_BUG_ON when injecting uncorrectable memory errors to
      DRAM.
      
      After some more digging and debugging on his side, he realized that this
      went back to v6.1, with the introduction of 'commit 0d206b5d
      ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")'.  That
      commit, among other things, introduced swp_offset_pfn(), replacing
      hwpoison_entry_to_pfn() in its favour.
      
      The patch also introduced a VM_BUG_ON() check for is_pfn_swap_entry(), but
      is_pfn_swap_entry() never got updated to cover hwpoison entries, which
      means that we would hit the VM_BUG_ON whenever we would call
      swp_offset_pfn() for such entries on environments with CONFIG_DEBUG_VM
      set.  Fix this by updating the check to cover hwpoison entries as well,
      and update the comment while we are it.
      
      Link: https://lkml.kernel.org/r/20240407130537.16977-1-osalvador@suse.de
      Fixes: 0d206b5d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reported-by: default avatarTony Luck <tony.luck@intel.com>
      Closes: https://lore.kernel.org/all/Zg8kLSl2yAlA3o5D@agluck-desk3/Tested-by: default avatarTony Luck <tony.luck@intel.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: <stable@vger.kernel.org>	[6.1.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      07a57a33
    • Miaohe Lin's avatar
      mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled · 1983184c
      Miaohe Lin authored
      When I did hard offline test with hugetlb pages, below deadlock occurs:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      6.8.0-11409-gf6cef5f8 #1 Not tainted
      ------------------------------------------------------
      bash/46904 is trying to acquire lock:
      ffffffffabe68910 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_slow_dec+0x16/0x60
      
      but task is already holding lock:
      ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (pcp_batch_high_lock){+.+.}-{3:3}:
             __mutex_lock+0x6c/0x770
             page_alloc_cpu_online+0x3c/0x70
             cpuhp_invoke_callback+0x397/0x5f0
             __cpuhp_invoke_callback_range+0x71/0xe0
             _cpu_up+0xeb/0x210
             cpu_up+0x91/0xe0
             cpuhp_bringup_mask+0x49/0xb0
             bringup_nonboot_cpus+0xb7/0xe0
             smp_init+0x25/0xa0
             kernel_init_freeable+0x15f/0x3e0
             kernel_init+0x15/0x1b0
             ret_from_fork+0x2f/0x50
             ret_from_fork_asm+0x1a/0x30
      
      -> #0 (cpu_hotplug_lock){++++}-{0:0}:
             __lock_acquire+0x1298/0x1cd0
             lock_acquire+0xc0/0x2b0
             cpus_read_lock+0x2a/0xc0
             static_key_slow_dec+0x16/0x60
             __hugetlb_vmemmap_restore_folio+0x1b9/0x200
             dissolve_free_huge_page+0x211/0x260
             __page_handle_poison+0x45/0xc0
             memory_failure+0x65e/0xc70
             hard_offline_page_store+0x55/0xa0
             kernfs_fop_write_iter+0x12c/0x1d0
             vfs_write+0x387/0x550
             ksys_write+0x64/0xe0
             do_syscall_64+0xca/0x1e0
             entry_SYSCALL_64_after_hwframe+0x6d/0x75
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(pcp_batch_high_lock);
                                     lock(cpu_hotplug_lock);
                                     lock(pcp_batch_high_lock);
        rlock(cpu_hotplug_lock);
      
       *** DEADLOCK ***
      
      5 locks held by bash/46904:
       #0: ffff98f6c3bb23f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
       #1: ffff98f6c328e488 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0
       #2: ffff98ef83b31890 (kn->active#113){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0
       #3: ffffffffabf9db48 (mf_mutex){+.+.}-{3:3}, at: memory_failure+0x44/0xc70
       #4: ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40
      
      stack backtrace:
      CPU: 10 PID: 46904 Comm: bash Kdump: loaded Not tainted 6.8.0-11409-gf6cef5f8 #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      Call Trace:
       <TASK>
       dump_stack_lvl+0x68/0xa0
       check_noncircular+0x129/0x140
       __lock_acquire+0x1298/0x1cd0
       lock_acquire+0xc0/0x2b0
       cpus_read_lock+0x2a/0xc0
       static_key_slow_dec+0x16/0x60
       __hugetlb_vmemmap_restore_folio+0x1b9/0x200
       dissolve_free_huge_page+0x211/0x260
       __page_handle_poison+0x45/0xc0
       memory_failure+0x65e/0xc70
       hard_offline_page_store+0x55/0xa0
       kernfs_fop_write_iter+0x12c/0x1d0
       vfs_write+0x387/0x550
       ksys_write+0x64/0xe0
       do_syscall_64+0xca/0x1e0
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      RIP: 0033:0x7fc862314887
      Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
      RSP: 002b:00007fff19311268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007fc862314887
      RDX: 000000000000000c RSI: 000056405645fe10 RDI: 0000000000000001
      RBP: 000056405645fe10 R08: 00007fc8623d1460 R09: 000000007fffffff
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
      R13: 00007fc86241b780 R14: 00007fc862417600 R15: 00007fc862416a00
      
      In short, below scene breaks the lock dependency chain:
      
       memory_failure
        __page_handle_poison
         zone_pcp_disable -- lock(pcp_batch_high_lock)
         dissolve_free_huge_page
          __hugetlb_vmemmap_restore_folio
           static_key_slow_dec
            cpus_read_lock -- rlock(cpu_hotplug_lock)
      
      Fix this by calling drain_all_pages() instead.
      
      This issue won't occur until commit a6b40850 ("mm: hugetlb: replace
      hugetlb_free_vmemmap_enabled with a static_key").  As it introduced
      rlock(cpu_hotplug_lock) in dissolve_free_huge_page() code path while
      lock(pcp_batch_high_lock) is already in the __page_handle_poison().
      
      [linmiaohe@huawei.com: extend comment per Oscar]
      [akpm@linux-foundation.org: reflow block comment]
      Link: https://lkml.kernel.org/r/20240407085456.2798193-1-linmiaohe@huawei.com
      Fixes: a6b40850 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1983184c
    • Peter Xu's avatar
      mm/userfaultfd: allow hugetlb change protection upon poison entry · c5977c95
      Peter Xu authored
      After UFFDIO_POISON, there can be two kinds of hugetlb pte markers, either
      the POISON one or UFFD_WP one.
      
      Allow change protection to run on a poisoned marker just like !hugetlb
      cases, ignoring the marker irrelevant of the permission.
      
      Here the two bits are mutual exclusive.  For example, when install a
      poisoned entry it must not be UFFD_WP already (by checking pte_none()
      before such install).  And it also means if UFFD_WP is set there must have
      no POISON bit set.  It makes sense because UFFD_WP is a bit to reflect
      permission, and permissions do not apply if the pte is poisoned and
      destined to sigbus.
      
      So here we simply check uffd_wp bit set first, do nothing otherwise.
      
      Attach the Fixes to UFFDIO_POISON work, as before that it should not be
      possible to have poison entry for hugetlb (e.g., hugetlb doesn't do swap,
      so no chance of swapin errors).
      
      Link: https://lkml.kernel.org/r/20240405231920.1772199-1-peterx@redhat.com
      Link: https://lore.kernel.org/r/000000000000920d5e0615602dd1@google.com
      Fixes: fc71884a ("mm: userfaultfd: add new UFFDIO_POISON ioctl")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reported-by: syzbot+b07c8ac8eee3d4d8440f@syzkaller.appspotmail.com
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Cc: <stable@vger.kernel.org>	[6.6+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c5977c95
    • Oscar Salvador's avatar
      mm,page_owner: fix printing of stack records · 74017458
      Oscar Salvador authored
      When seq_* code sees that its buffer overflowed, it re-allocates a bigger
      onecand calls seq_operations->start() callback again.  stack_start()
      naively though that if it got called again, it meant that the old record
      got already printed so it returned the next object, but that is not true.
      
      The consequence of that is that every time stack_stop() -> stack_start()
      get called because we needed a bigger buffer, stack_start() will skip
      entries, and those will not be printed.
      
      Fix it by not advancing to the next object in stack_start().
      
      Link: https://lkml.kernel.org/r/20240404070702.2744-5-osalvador@suse.de
      Fixes: 765973a0 ("mm,page_owner: display all stacks and their count")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      74017458
    • Oscar Salvador's avatar
      mm,page_owner: fix accounting of pages when migrating · 718b1f33
      Oscar Salvador authored
      Upon migration, new allocated pages are being given the handle of the old
      pages.  This is problematic because it means that for the stack which
      allocated the old page, we will be substracting the old page + the new one
      when that page is freed, creating an accounting imbalance.
      
      There is an interest in keeping it that way, as otherwise the output will
      biased towards migration stacks should those operations occur often, but
      that is not really helpful.
      
      The link from the new page to the old stack is being performed by calling
      __update_page_owner_handle() in __folio_copy_owner().  The only thing that
      is left is to link the migrate stack to the old page, so the old page will
      be subtracted from the migrate stack, avoiding by doing so any possible
      imbalance.
      
      Link: https://lkml.kernel.org/r/20240404070702.2744-4-osalvador@suse.de
      Fixes: 217b2119 ("mm,page_owner: implement the tracking of the stacks count")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      718b1f33
    • Oscar Salvador's avatar
      mm,page_owner: fix refcount imbalance · f5c12105
      Oscar Salvador authored
      Current code does not contemplate scenarios were an allocation and free
      operation on the same pages do not handle it in the same amount at once. 
      To give an example, page_alloc_exact(), where we will allocate a page of
      enough order to stafisfy the size request, but we will free the remainings
      right away.
      
      In the above example, we will increment the stack_record refcount only
      once, but we will decrease it the same number of times as number of unused
      pages we have to free.  This will lead to a warning because of refcount
      imbalance.
      
      Fix this by recording the number of base pages in the refcount field.
      
      Link: https://lkml.kernel.org/r/20240404070702.2744-3-osalvador@suse.de
      Reported-by: syzbot+41bbfdb8d41003d12c0f@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/linux-mm/00000000000090e8ff0613eda0e5@google.com
      Fixes: 217b2119 ("mm,page_owner: implement the tracking of the stacks count")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarAlexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f5c12105
    • Oscar Salvador's avatar
      mm,page_owner: update metadata for tail pages · ea4b5b33
      Oscar Salvador authored
      Patch series "page_owner: Fix refcount imbalance and print fixup", v4.
      
      This series consists of a refactoring/correctness of updating the metadata
      of tail pages, a couple of fixups for the refcounting part and a fixup for
      the stack_start() function.
      
      From this series on, instead of counting the stacks, we count the
      outstanding nr_base_pages each stack has, which gives us a much better
      memory overview.  The other fixup is for the migration part.
      
      A more detailed explanation can be found in the changelog of the
      respective patches.
      
      
      This patch (of 4):
      
      __set_page_owner_handle() and __reset_page_owner() update the metadata of
      all pages when the page is of a higher-order, but we miss to do the same
      when the pages are migrated.  __folio_copy_owner() only updates the
      metadata of the head page, meaning that the information stored in the
      first page and the tail pages will not match.
      
      Strictly speaking that is not a big problem because 1) we do not print
      tail pages and 2) upon splitting all tail pages will inherit the metadata
      of the head page, but it is better to have all metadata in check should
      there be any problem, so it can ease debugging.
      
      For that purpose, a couple of helpers are created
      __update_page_owner_handle() which updates the metadata on allocation, and
      __update_page_owner_free_handle() which does the same when the page is
      freed.
      
      __folio_copy_owner() will make use of both as it needs to entirely replace
      the page_owner metadata for the new page.
      
      Link: https://lkml.kernel.org/r/20240404070702.2744-1-osalvador@suse.de
      Link: https://lkml.kernel.org/r/20240404070702.2744-2-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ea4b5b33
    • Lokesh Gidra's avatar
      userfaultfd: change src_folio after ensuring it's unpinned in UFFDIO_MOVE · c0205eaf
      Lokesh Gidra authored
      Commit d7a08838 ("mm: userfaultfd: fix unexpected change to src_folio
      when UFFDIO_MOVE fails") moved the src_folio->{mapping, index} changing to
      after clearing the page-table and ensuring that it's not pinned.  This
      avoids failure of swapout+migration and possibly memory corruption.
      
      However, the commit missed fixing it in the huge-page case.
      
      Link: https://lkml.kernel.org/r/20240404171726.2302435-1-lokeshgidra@google.com
      Fixes: adef4406 ("userfaultfd: UFFDIO_MOVE uABI")
      Signed-off-by: default avatarLokesh Gidra <lokeshgidra@google.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Nicolas Geoffray <ngeoffray@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c0205eaf
    • David Hildenbrand's avatar
      mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly · 631426ba
      David Hildenbrand authored
      Darrick reports that in some cases where pread() would fail with -EIO and
      mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ /
      MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT.
      
      While the madvise() call can be interrupted by a signal, this is not the
      desired behavior.  MADV_POPULATE_READ / MADV_POPULATE_WRITE should behave
      like page faults in that case: fail and not retry forever.
      
      A reproducer can be found at [1].
      
      The reason is that __get_user_pages(), as called by
      faultin_vma_page_range(), will not handle VM_FAULT_RETRY in a proper way:
      it will simply return 0 when VM_FAULT_RETRY happened, making
      madvise_populate()->faultin_vma_page_range() retry again and again, never
      setting FOLL_TRIED->FAULT_FLAG_TRIED for __get_user_pages().
      
      __get_user_pages_locked() does what we want, but duplicating that logic in
      faultin_vma_page_range() feels wrong.
      
      So let's use __get_user_pages_locked() instead, that will detect
      VM_FAULT_RETRY and set FOLL_TRIED when retrying, making the fault handler
      return VM_FAULT_SIGBUS (VM_FAULT_ERROR) at some point, propagating -EFAULT
      from faultin_page() to __get_user_pages(), all the way to
      madvise_populate().
      
      But, there is an issue: __get_user_pages_locked() will end up re-taking
      the MM lock and then __get_user_pages() will do another VMA lookup.  In
      the meantime, the VMA layout could have changed and we'd fail with
      different error codes than we'd want to.
      
      As __get_user_pages() will currently do a new VMA lookup either way, let
      it do the VMA handling in a different way, controlled by a new
      FOLL_MADV_POPULATE flag, effectively moving these checks from
      madvise_populate() + faultin_page_range() in there.
      
      With this change, Darricks reproducer properly fails with -EFAULT, as
      documented for MADV_POPULATE_READ / MADV_POPULATE_WRITE.
      
      [1] https://lore.kernel.org/all/20240313171936.GN1927156@frogsfrogsfrogs/
      
      Link: https://lkml.kernel.org/r/20240314161300.382526-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20240314161300.382526-2-david@redhat.com
      Fixes: 4ca9b385 ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Closes: https://lore.kernel.org/all/20240311223815.GW1927156@frogsfrogsfrogs/
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      631426ba
  4. 14 Apr, 2024 4 commits
    • Linus Torvalds's avatar
      Linux 6.9-rc4 · 0bbac3fa
      Linus Torvalds authored
      0bbac3fa
    • Linus Torvalds's avatar
      Merge tag 'pull-sysfs-annotation-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 72374d71
      Linus Torvalds authored
      Pull sysfs fix from Al Viro:
       "Get rid of lockdep false positives around sysfs/overlayfs
      
        syzbot has uncovered a class of lockdep false positives for setups
        with sysfs being one of the backing layers in overlayfs. The root
        cause is that of->mutex allocated when opening a sysfs file read-only
        (which overlayfs might do) is confused with of->mutex of a file opened
        writable (held in write to sysfs file, which overlayfs won't do).
      
        Assigning them separate lockdep classes fixes that bunch and it's
        obviously safe"
      
      * tag 'pull-sysfs-annotation-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        kernfs: annotate different lockdep class for of->mutex of writable files
      72374d71
    • Linus Torvalds's avatar
      Merge tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 27fd8085
      Linus Torvalds authored
      Pull misc x86 fixes from Ingo Molnar:
      
       - Follow up fixes for the BHI mitigations code
      
       - Fix !SPECULATION_MITIGATIONS bug not turning off mitigations as
         expected
      
       - Work around an APIC emulation bug when the kernel is built with Clang
         and run as a SEV guest
      
       - Follow up x86 topology fixes
      
      * tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/cpu/amd: Move TOPOEXT enablement into the topology parser
        x86/cpu/amd: Make the NODEID_MSR union actually work
        x86/cpu/amd: Make the CPUID 0x80000008 parser correct
        x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHI
        x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto
        x86/bugs: Clarify that syscall hardening isn't a BHI mitigation
        x86/bugs: Fix BHI handling of RRSBA
        x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr'
        x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIES
        x86/bugs: Fix BHI documentation
        x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n
        x86/topology: Don't update cpu_possible_map in topo_set_cpuids()
        x86/bugs: Fix return type of spectre_bhi_state()
        x86/apic: Force native_apic_mem_read() to use the MOV instruction
      27fd8085
    • Linus Torvalds's avatar
      Merge tag 'timers-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · c748fc3b
      Linus Torvalds authored
      Pull timer fixes from Ingo Molnar:
      
       - Address a (valid) W=1 build warning
      
       - Fix timer self-tests
      
       - Annotate a KCSAN warning wrt. accesses to the tick_do_timer_cpu
         global variable
      
       - Address a !CONFIG_BUG build warning
      
      * tag 'timers-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        selftests: kselftest: Fix build failure with NOLIBC
        selftests: timers: Fix abs() warning in posix_timers test
        selftests: kselftest: Mark functions that unconditionally call exit() as __noreturn
        selftests: timers: Fix posix_timers ksft_print_msg() warning
        selftests: timers: Fix valid-adjtimex signed left-shift undefined behavior
        bug: Fix no-return-statement warning with !CONFIG_BUG
        timekeeping: Use READ/WRITE_ONCE() for tick_do_timer_cpu
        selftests/timers/posix_timers: Reimplement check_timer_distribution()
        irqflags: Explicitly ignore lockdep_hrtimer_exit() argument
      c748fc3b