1. 11 Sep, 2022 9 commits
    • Alistair Popple's avatar
      mm/migrate_device.c: add missing flush_cache_page() · a3589e1d
      Alistair Popple authored
      Currently we only call flush_cache_page() for the anon_exclusive case,
      however in both cases we clear the pte so should flush the cache.
      
      Link: https://lkml.kernel.org/r/5676f30436ab71d1a587ac73f835ed8bd2113ff5.1662078528.git-series.apopple@nvidia.com
      Fixes: 8c3328f1 ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
      Signed-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Karol Herbst <kherbst@redhat.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a3589e1d
    • Alistair Popple's avatar
      mm/migrate_device.c: flush TLB while holding PTL · 60bae737
      Alistair Popple authored
      When clearing a PTE the TLB should be flushed whilst still holding the PTL
      to avoid a potential race with madvise/munmap/etc.  For example consider
      the following sequence:
      
        CPU0                          CPU1
        ----                          ----
      
        migrate_vma_collect_pmd()
        pte_unmap_unlock()
                                      madvise(MADV_DONTNEED)
                                      -> zap_pte_range()
                                      pte_offset_map_lock()
                                      [ PTE not present, TLB not flushed ]
                                      pte_unmap_unlock()
                                      [ page is still accessible via stale TLB ]
        flush_tlb_range()
      
      In this case the page may still be accessed via the stale TLB entry after
      madvise returns.  Fix this by flushing the TLB while holding the PTL.
      
      Fixes: 8c3328f1 ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
      Link: https://lkml.kernel.org/r/9f801e9d8d830408f2ca27821f606e09aa856899.1662078528.git-series.apopple@nvidia.comSigned-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Reported-by: default avatarNadav Amit <nadav.amit@gmail.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Karol Herbst <kherbst@redhat.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      60bae737
    • Naohiro Aota's avatar
      x86/mm: disable instrumentations of mm/pgprot.c · 818c4fda
      Naohiro Aota authored
      Commit 4867fbbd ("x86/mm: move protection_map[] inside the platform")
      moved accesses to protection_map[] from mem_encrypt_amd.c to pgprot.c.  As
      a result, the accesses are now targets of KASAN (and other
      instrumentations), leading to the crash during the boot process.
      
      Disable the instrumentations for pgprot.c like commit 67bb8e99
      ("x86/mm: Disable various instrumentations of mm/mem_encrypt.c and
      mm/tlb.c").
      
      Before this patch, my AMD machine cannot boot since v6.0-rc1 with KASAN
      enabled, without anything printed.  After the change, it successfully
      boots up.
      
      Fixes: 4867fbbd ("x86/mm: move protection_map[] inside the platform")
      Link: https://lkml.kernel.org/r/20220824084726.2174758-1-naohiro.aota@wdc.comSigned-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      818c4fda
    • Dan Williams's avatar
      mm/memory-failure: fall back to vma_address() when ->notify_failure() fails · ac87ca0e
      Dan Williams authored
      In the case where a filesystem is polled to take over the memory failure
      and receives -EOPNOTSUPP it indicates that page->index and page->mapping
      are valid for reverse mapping the failure address.  Introduce
      FSDAX_INVALID_PGOFF to distinguish when add_to_kill() is being called from
      mf_dax_kill_procs() by a filesytem vs the typical memory_failure() path.
      
      Otherwise, vma_pgoff_address() is called with an invalid fsdax_pgoff which
      then trips this failing signature:
      
       kernel BUG at mm/memory-failure.c:319!
       invalid opcode: 0000 [#1] PREEMPT SMP PTI
       CPU: 13 PID: 1262 Comm: dax-pmd Tainted: G           OE    N 6.0.0-rc2+ #62
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
       RIP: 0010:add_to_kill.cold+0x19d/0x209
       [..]
       Call Trace:
        <TASK>
        collect_procs.part.0+0x2c4/0x460
        memory_failure+0x71b/0xba0
        ? _printk+0x58/0x73
        do_madvise.part.0.cold+0xaf/0xc5
      
      Link: https://lkml.kernel.org/r/166153429427.2758201.14605968329933175594.stgit@dwillia2-xfh.jf.intel.com
      Fixes: c36e2024 ("mm: introduce mf_dax_kill_procs() for fsdax case")
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Shiyang Ruan <ruansy.fnst@fujitsu.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ritesh Harjani <riteshh@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ac87ca0e
    • Dan Williams's avatar
      mm/memory-failure: fix detection of memory_failure() handlers · 65d3440e
      Dan Williams authored
      Some pagemap types, like MEMORY_DEVICE_GENERIC (device-dax) do not even
      have pagemap ops which results in crash signatures like this:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000010
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 8000000205073067 P4D 8000000205073067 PUD 2062b3067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP PTI
        CPU: 22 PID: 4535 Comm: device-dax Tainted: G           OE    N 6.0.0-rc2+ #59
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:memory_failure+0x667/0xba0
       [..]
        Call Trace:
         <TASK>
         ? _printk+0x58/0x73
         do_madvise.part.0.cold+0xaf/0xc5
      
      Check for ops before checking if the ops have a memory_failure()
      handler.
      
      Link: https://lkml.kernel.org/r/166153428781.2758201.1990616683438224741.stgit@dwillia2-xfh.jf.intel.com
      Fixes: 33a8f7f2 ("pagemap,pmem: introduce ->memory_failure()")
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Shiyang Ruan <ruansy.fnst@fujitsu.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ritesh Harjani <riteshh@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      65d3440e
    • Dan Williams's avatar
      xfs: fix SB_BORN check in xfs_dax_notify_failure() · fd63612a
      Dan Williams authored
      The SB_BORN flag is stored in the vfs superblock, not xfs_sb.
      
      Link: https://lkml.kernel.org/r/166153428094.2758201.7936572520826540019.stgit@dwillia2-xfh.jf.intel.com
      Fixes: 6f643c57 ("xfs: implement ->notify_failure() for XFS")
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Shiyang Ruan <ruansy.fnst@fujitsu.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Ritesh Harjani <riteshh@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fd63612a
    • Dan Williams's avatar
      xfs: quiet notify_failure EOPNOTSUPP cases · b14d067e
      Dan Williams authored
      Patch series "mm, xfs, dax: Fixes for memory_failure() handling".
      
      I failed to run the memory error injection section of the ndctl test suite
      on linux-next prior to the merge window and as a result some bugs were
      missed.  While the new enabling targeted reflink enabled XFS filesystems
      the bugs cropped up in the surrounding cases of DAX error injection on
      ext4-fsdax and device-dax.
      
      One new assumption / clarification in this set is the notion that if a
      filesystem's ->notify_failure() handler returns -EOPNOTSUPP, then it must
      be the case that the fsdax usage of page->index and page->mapping are
      valid.  I am fairly certain this is true for xfs_dax_notify_failure(), but
      would appreciate another set of eyes.
      
      
      This patch (of 4):
      
      XFS always registers dax_holder_operations regardless of whether the
      filesystem is capable of handling the notifications.  The expectation is
      that if the notify_failure handler cannot run then there are no scenarios
      where it needs to run.  In other words the expected semantic is that
      page->index and page->mapping are valid for memory_failure() when the
      conditions that cause -EOPNOTSUPP in xfs_dax_notify_failure() are present.
      
      A fallback to the generic memory_failure() path is expected so do not warn
      when that happens.
      
      Link: https://lkml.kernel.org/r/166153426798.2758201.15108211981034512993.stgit@dwillia2-xfh.jf.intel.com
      Link: https://lkml.kernel.org/r/166153427440.2758201.6709480562966161512.stgit@dwillia2-xfh.jf.intel.com
      Fixes: 6f643c57 ("xfs: implement ->notify_failure() for XFS")
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Shiyang Ruan <ruansy.fnst@fujitsu.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Ritesh Harjani <riteshh@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b14d067e
    • Mel Gorman's avatar
      mm/page_alloc: fix race condition between build_all_zonelists and page allocation · 3d36424b
      Mel Gorman authored
      Patrick Daly reported the following problem;
      
      	NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation
      	[0] - ZONE_MOVABLE
      	[1] - ZONE_NORMAL
      	[2] - NULL
      
      	For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the
      	offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent
      	memory_offline operation removes the last page from ZONE_MOVABLE,
      	build_all_zonelists() & build_zonerefs_node() will update
      	node_zonelists as shown below. Only populated zones are added.
      
      	NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation
      	[0] - ZONE_NORMAL
      	[1] - NULL
      	[2] - NULL
      
      The race is simple -- page allocation could be in progress when a memory
      hot-remove operation triggers a zonelist rebuild that removes zones.  The
      allocation request will still have a valid ac->preferred_zoneref that is
      now pointing to NULL and triggers an OOM kill.
      
      This problem probably always existed but may be slightly easier to trigger
      due to 6aa303de ("mm, vmscan: only allocate and reclaim from zones
      with pages managed by the buddy allocator") which distinguishes between
      zones that are completely unpopulated versus zones that have valid pages
      not managed by the buddy allocator (e.g.  reserved, memblock, ballooning
      etc).  Memory hotplug had multiple stages with timing considerations
      around managed/present page updates, the zonelist rebuild and the zone
      span updates.  As David Hildenbrand puts it
      
      	memory offlining adjusts managed+present pages of the zone
      	essentially in one go. If after the adjustments, the zone is no
      	longer populated (present==0), we rebuild the zone lists.
      
      	Once that's done, we try shrinking the zone (start+spanned
      	pages) -- which results in zone_start_pfn == 0 if there are no
      	more pages. That happens *after* rebuilding the zonelists via
      	remove_pfn_range_from_zone().
      
      The only requirement to fix the race is that a page allocation request
      identifies when a zonelist rebuild has happened since the allocation
      request started and no page has yet been allocated.  Use a seqlock_t to
      track zonelist updates with a lockless read-side of the zonelist and
      protecting the rebuild and update of the counter with a spinlock.
      
      [akpm@linux-foundation.org: make zonelist_update_seq static]
      Link: https://lkml.kernel.org/r/20220824110900.vh674ltxmzb3proq@techsingularity.net
      Fixes: 6aa303de ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reported-by: default avatarPatrick Daly <quic_pdaly@quicinc.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.9+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3d36424b
    • ChenXiaoSong's avatar
      ntfs: fix BUG_ON in ntfs_lookup_inode_by_name() · 1b513f61
      ChenXiaoSong authored
      Syzkaller reported BUG_ON as follows:
      
      ------------[ cut here ]------------
      kernel BUG at fs/ntfs/dir.c:86!
      invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
      CPU: 3 PID: 758 Comm: a.out Not tainted 5.19.0-next-20220808 #5
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      RIP: 0010:ntfs_lookup_inode_by_name+0xd11/0x2d10
      Code: ff e9 b9 01 00 00 e8 1e fe d6 fe 48 8b 7d 98 49 8d 5d 07 e8 91 85 29 ff 48 c7 45 98 00 00 00 00 e9 5a fb ff ff e8 ff fd d6 fe <0f> 0b e8 f8 fd d6 fe 0f 0b e8 f1 fd d6 fe 48 8b b5 50 ff ff ff 4c
      RSP: 0018:ffff888079607978 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 0000000000008000 RCX: 0000000000000000
      RDX: ffff88807cf10000 RSI: ffffffff82a4a081 RDI: 0000000000000003
      RBP: ffff888079607a70 R08: 0000000000000001 R09: ffff88807a6d01d7
      R10: ffffed100f4da03a R11: 0000000000000000 R12: ffff88800f0fb110
      R13: ffff88800f0ee000 R14: ffff88800f0fb000 R15: 0000000000000001
      FS:  00007f33b63c7540(0000) GS:ffff888108580000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f33b635c090 CR3: 000000000f39e005 CR4: 0000000000770ee0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <TASK>
       load_system_files+0x1f7f/0x3620
       ntfs_fill_super+0xa01/0x1be0
       mount_bdev+0x36a/0x440
       ntfs_mount+0x3a/0x50
       legacy_get_tree+0xfb/0x210
       vfs_get_tree+0x8f/0x2f0
       do_new_mount+0x30a/0x760
       path_mount+0x4de/0x1880
       __x64_sys_mount+0x2b3/0x340
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f33b62ff9ea
      Code: 48 8b 0d a9 f4 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 76 f4 0b 00 f7 d8 64 89 01 48
      RSP: 002b:00007ffd0c471aa8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f33b62ff9ea
      RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffd0c471be0
      RBP: 00007ffd0c471c60 R08: 00007ffd0c471ae0 R09: 00007ffd0c471c24
      R10: 0000000000000000 R11: 0000000000000202 R12: 000055bac5afc160
      R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
       </TASK>
      Modules linked in:
      ---[ end trace 0000000000000000 ]---
      
      Fix this by adding sanity check on extended system files' directory inode
      to ensure that it is directory, just like ntfs_extend_init() when mounting
      ntfs3.
      
      Link: https://lkml.kernel.org/r/20220809064730.2316892-1-chenxiaosong2@huawei.comSigned-off-by: default avatarChenXiaoSong <chenxiaosong2@huawei.com>
      Cc: Anton Altaparmakov <anton@tuxera.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1b513f61
  2. 28 Aug, 2022 25 commits
  3. 27 Aug, 2022 6 commits