1. 11 Sep, 2022 2 commits
    • Mel Gorman's avatar
      mm/page_alloc: fix race condition between build_all_zonelists and page allocation · 3d36424b
      Mel Gorman authored
      Patrick Daly reported the following problem;
      
      	NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation
      	[0] - ZONE_MOVABLE
      	[1] - ZONE_NORMAL
      	[2] - NULL
      
      	For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the
      	offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent
      	memory_offline operation removes the last page from ZONE_MOVABLE,
      	build_all_zonelists() & build_zonerefs_node() will update
      	node_zonelists as shown below. Only populated zones are added.
      
      	NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation
      	[0] - ZONE_NORMAL
      	[1] - NULL
      	[2] - NULL
      
      The race is simple -- page allocation could be in progress when a memory
      hot-remove operation triggers a zonelist rebuild that removes zones.  The
      allocation request will still have a valid ac->preferred_zoneref that is
      now pointing to NULL and triggers an OOM kill.
      
      This problem probably always existed but may be slightly easier to trigger
      due to 6aa303de ("mm, vmscan: only allocate and reclaim from zones
      with pages managed by the buddy allocator") which distinguishes between
      zones that are completely unpopulated versus zones that have valid pages
      not managed by the buddy allocator (e.g.  reserved, memblock, ballooning
      etc).  Memory hotplug had multiple stages with timing considerations
      around managed/present page updates, the zonelist rebuild and the zone
      span updates.  As David Hildenbrand puts it
      
      	memory offlining adjusts managed+present pages of the zone
      	essentially in one go. If after the adjustments, the zone is no
      	longer populated (present==0), we rebuild the zone lists.
      
      	Once that's done, we try shrinking the zone (start+spanned
      	pages) -- which results in zone_start_pfn == 0 if there are no
      	more pages. That happens *after* rebuilding the zonelists via
      	remove_pfn_range_from_zone().
      
      The only requirement to fix the race is that a page allocation request
      identifies when a zonelist rebuild has happened since the allocation
      request started and no page has yet been allocated.  Use a seqlock_t to
      track zonelist updates with a lockless read-side of the zonelist and
      protecting the rebuild and update of the counter with a spinlock.
      
      [akpm@linux-foundation.org: make zonelist_update_seq static]
      Link: https://lkml.kernel.org/r/20220824110900.vh674ltxmzb3proq@techsingularity.net
      Fixes: 6aa303de ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reported-by: default avatarPatrick Daly <quic_pdaly@quicinc.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.9+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3d36424b
    • ChenXiaoSong's avatar
      ntfs: fix BUG_ON in ntfs_lookup_inode_by_name() · 1b513f61
      ChenXiaoSong authored
      Syzkaller reported BUG_ON as follows:
      
      ------------[ cut here ]------------
      kernel BUG at fs/ntfs/dir.c:86!
      invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
      CPU: 3 PID: 758 Comm: a.out Not tainted 5.19.0-next-20220808 #5
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      RIP: 0010:ntfs_lookup_inode_by_name+0xd11/0x2d10
      Code: ff e9 b9 01 00 00 e8 1e fe d6 fe 48 8b 7d 98 49 8d 5d 07 e8 91 85 29 ff 48 c7 45 98 00 00 00 00 e9 5a fb ff ff e8 ff fd d6 fe <0f> 0b e8 f8 fd d6 fe 0f 0b e8 f1 fd d6 fe 48 8b b5 50 ff ff ff 4c
      RSP: 0018:ffff888079607978 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 0000000000008000 RCX: 0000000000000000
      RDX: ffff88807cf10000 RSI: ffffffff82a4a081 RDI: 0000000000000003
      RBP: ffff888079607a70 R08: 0000000000000001 R09: ffff88807a6d01d7
      R10: ffffed100f4da03a R11: 0000000000000000 R12: ffff88800f0fb110
      R13: ffff88800f0ee000 R14: ffff88800f0fb000 R15: 0000000000000001
      FS:  00007f33b63c7540(0000) GS:ffff888108580000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f33b635c090 CR3: 000000000f39e005 CR4: 0000000000770ee0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <TASK>
       load_system_files+0x1f7f/0x3620
       ntfs_fill_super+0xa01/0x1be0
       mount_bdev+0x36a/0x440
       ntfs_mount+0x3a/0x50
       legacy_get_tree+0xfb/0x210
       vfs_get_tree+0x8f/0x2f0
       do_new_mount+0x30a/0x760
       path_mount+0x4de/0x1880
       __x64_sys_mount+0x2b3/0x340
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f33b62ff9ea
      Code: 48 8b 0d a9 f4 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 76 f4 0b 00 f7 d8 64 89 01 48
      RSP: 002b:00007ffd0c471aa8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f33b62ff9ea
      RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffd0c471be0
      RBP: 00007ffd0c471c60 R08: 00007ffd0c471ae0 R09: 00007ffd0c471c24
      R10: 0000000000000000 R11: 0000000000000202 R12: 000055bac5afc160
      R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
       </TASK>
      Modules linked in:
      ---[ end trace 0000000000000000 ]---
      
      Fix this by adding sanity check on extended system files' directory inode
      to ensure that it is directory, just like ntfs_extend_init() when mounting
      ntfs3.
      
      Link: https://lkml.kernel.org/r/20220809064730.2316892-1-chenxiaosong2@huawei.comSigned-off-by: default avatarChenXiaoSong <chenxiaosong2@huawei.com>
      Cc: Anton Altaparmakov <anton@tuxera.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1b513f61
  2. 28 Aug, 2022 25 commits
  3. 27 Aug, 2022 13 commits