1. 22 Apr, 2022 5 commits
    • Christophe Leroy's avatar
      mm, hugetlb: allow for "high" userspace addresses · 5f24d5a5
      Christophe Leroy authored
      This is a fix for commit f6795053 ("mm: mmap: Allow for "high"
      userspace addresses") for hugetlb.
      
      This patch adds support for "high" userspace addresses that are
      optionally supported on the system and have to be requested via a hint
      mechanism ("high" addr parameter to mmap).
      
      Architectures such as powerpc and x86 achieve this by making changes to
      their architectural versions of hugetlb_get_unmapped_area() function.
      However, arm64 uses the generic version of that function.
      
      So take into account arch_get_mmap_base() and arch_get_mmap_end() in
      hugetlb_get_unmapped_area().  To allow that, move those two macros out
      of mm/mmap.c into include/linux/sched/mm.h
      
      If these macros are not defined in architectural code then they default
      to (TASK_SIZE) and (base) so should not introduce any behavioural
      changes to architectures that do not define them.
      
      For the time being, only ARM64 is affected by this change.
      
      Catalin (ARM64) said
       "We should have fixed hugetlb_get_unmapped_area() as well when we added
        support for 52-bit VA. The reason for commit f6795053 was to
        prevent normal mmap() from returning addresses above 48-bit by default
        as some user-space had hard assumptions about this.
      
        It's a slight ABI change if you do this for hugetlb_get_unmapped_area()
        but I doubt anyone would notice. It's more likely that the current
        behaviour would cause issues, so I'd rather have them consistent.
      
        Basically when arm64 gained support for 52-bit addresses we did not
        want user-space calling mmap() to suddenly get such high addresses,
        otherwise we could have inadvertently broken some programs (similar
        behaviour to x86 here). Hence we added commit f6795053. But we
        missed hugetlbfs which could still get such high mmap() addresses. So
        in theory that's a potential regression that should have bee addressed
        at the same time as commit f6795053 (and before arm64 enabled
        52-bit addresses)"
      
      Link: https://lkml.kernel.org/r/ab847b6edb197bffdfe189e70fb4ac76bfe79e0d.1650033747.git.christophe.leroy@csgroup.eu
      Fixes: f6795053 ("mm: mmap: Allow for "high" userspace addresses")
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: <stable@vger.kernel.org>	[5.0.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f24d5a5
    • Nadav Amit's avatar
      userfaultfd: mark uffd_wp regardless of VM_WRITE flag · 0e88904c
      Nadav Amit authored
      When a PTE is set by UFFD operations such as UFFDIO_COPY, the PTE is
      currently only marked as write-protected if the VMA has VM_WRITE flag
      set.  This seems incorrect or at least would be unexpected by the users.
      
      Consider the following sequence of operations that are being performed
      on a certain page:
      
      	mprotect(PROT_READ)
      	UFFDIO_COPY(UFFDIO_COPY_MODE_WP)
      	mprotect(PROT_READ|PROT_WRITE)
      
      At this point the user would expect to still get UFFD notification when
      the page is accessed for write, but the user would not get one, since
      the PTE was not marked as UFFD_WP during UFFDIO_COPY.
      
      Fix it by always marking PTEs as UFFD_WP regardless on the
      write-permission in the VMA flags.
      
      Link: https://lkml.kernel.org/r/20220217211602.2769-1-namit@vmware.com
      Fixes: 292924b2 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit")
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e88904c
    • Shakeel Butt's avatar
      memcg: sync flush only if periodic flush is delayed · 9b301615
      Shakeel Butt authored
      Daniel Dao has reported [1] a regression on workloads that may trigger a
      lot of refaults (anon and file).  The underlying issue is that flushing
      rstat is expensive.  Although rstat flush are batched with (nr_cpus *
      MEMCG_BATCH) stat updates, it seems like there are workloads which
      genuinely do stat updates larger than batch value within short amount of
      time.  Since the rstat flush can happen in the performance critical
      codepaths like page faults, such workload can suffer greatly.
      
      This patch fixes this regression by making the rstat flushing
      conditional in the performance critical codepaths.  More specifically,
      the kernel relies on the async periodic rstat flusher to flush the stats
      and only if the periodic flusher is delayed by more than twice the
      amount of its normal time window then the kernel allows rstat flushing
      from the performance critical codepaths.
      
      Now the question: what are the side-effects of this change? The worst
      that can happen is the refault codepath will see 4sec old lruvec stats
      and may cause false (or missed) activations of the refaulted page which
      may under-or-overestimate the workingset size.  Though that is not very
      concerning as the kernel can already miss or do false activations.
      
      There are two more codepaths whose flushing behavior is not changed by
      this patch and we may need to come to them in future.  One is the
      writeback stats used by dirty throttling and second is the deactivation
      heuristic in the reclaim.  For now keeping an eye on them and if there
      is report of regression due to these codepaths, we will reevaluate then.
      
      Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com [1]
      Link: https://lkml.kernel.org/r/20220304184040.1304781-1-shakeelb@google.com
      Fixes: 1f828223 ("memcg: flush lruvec stats in the refault")
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reported-by: default avatarDaniel Dao <dqminh@cloudflare.com>
      Tested-by: default avatarIvan Babrou <ivan@cloudflare.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Frank Hofmann <fhofmann@cloudflare.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9b301615
    • Xu Yu's avatar
      mm/memory-failure.c: skip huge_zero_page in memory_failure() · d173d541
      Xu Yu authored
      Kernel panic when injecting memory_failure for the global
      huge_zero_page, when CONFIG_DEBUG_VM is enabled, as follows.
      
        Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
        page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
        head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
        flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
        raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
        raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
        page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
        ------------[ cut here ]------------
        kernel BUG at mm/huge_memory.c:2499!
        invalid opcode: 0000 [#1] PREEMPT SMP PTI
        CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
        Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
        RIP: 0010:split_huge_page_to_list+0x66a/0x880
        Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
        RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
        RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
        RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
        R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
        R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
        FS:  00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         try_to_split_thp_page+0x3a/0x130
         memory_failure+0x128/0x800
         madvise_inject_error.cold+0x8b/0xa1
         __x64_sys_madvise+0x54/0x60
         do_syscall_64+0x35/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x7fc3754f8bf9
        Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
        RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
        RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
        RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
        RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
        R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
        R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000
      
      This makes huge_zero_page bail out explicitly before split in
      memory_failure(), thus the panic above won't happen again.
      
      Link: https://lkml.kernel.org/r/497d3835612610e370c74e697ea3c721d1d55b9c.1649775850.git.xuyu@linux.alibaba.com
      Fixes: 6a46079c ("HWPOISON: The high level memory error handler in the VM v7")
      Signed-off-by: default avatarXu Yu <xuyu@linux.alibaba.com>
      Reported-by: default avatarAbaci <abaci@linux.alibaba.com>
      Suggested-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d173d541
    • Naoya Horiguchi's avatar
      mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb() · 405ce051
      Naoya Horiguchi authored
      There is a race condition between memory_failure_hugetlb() and hugetlb
      free/demotion, which causes setting PageHWPoison flag on the wrong page.
      The one simple result is that wrong processes can be killed, but another
      (more serious) one is that the actual error is left unhandled, so no one
      prevents later access to it, and that might lead to more serious results
      like consuming corrupted data.
      
      Think about the below race window:
      
        CPU 1                                   CPU 2
        memory_failure_hugetlb
        struct page *head = compound_head(p);
                                                hugetlb page might be freed to
                                                buddy, or even changed to another
                                                compound page.
      
        get_hwpoison_page -- page is not what we want now...
      
      The current code first does prechecks roughly and then reconfirms after
      taking refcount, but it's found that it makes code overly complicated,
      so move the prechecks in a single hugetlb_lock range.
      
      A newly introduced function, try_memory_failure_hugetlb(), always takes
      hugetlb_lock (even for non-hugetlb pages).  That can be improved, but
      memory_failure() is rare in principle, so should not be a big problem.
      
      Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
      Fixes: 761ad8d7 ("mm: hwpoison: introduce memory_failure_hugetlb()")
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reported-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      405ce051
  2. 20 Apr, 2022 4 commits
    • Linus Torvalds's avatar
      Merge tag 'xtensa-20220416' of https://github.com/jcmvbkbc/linux-xtensa · b2534357
      Linus Torvalds authored
      Pull xtensa fixes from Max Filippov:
      
       - fix patching CPU selection in patch_text
      
       - fix potential deadlock in ISS platform serial driver
      
       - fix potential register clobbering in coprocessor exception handler
      
      * tag 'xtensa-20220416' of https://github.com/jcmvbkbc/linux-xtensa:
        xtensa: fix a7 clobbering in coprocessor context load/store
        arch: xtensa: platforms: Fix deadlock in rs_close()
        xtensa: patch_text: Fixup last cpu should be master
      b2534357
    • Linus Torvalds's avatar
      Merge tag 'erofs-for-5.18-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs · 10c5f102
      Linus Torvalds authored
      Pull erofs fixes from Gao Xiang:
       "One patch to fix a use-after-free race related to the on-stack
        z_erofs_decompressqueue, which happens very rarely but needs to be
        fixed properly soon.
      
        The other patch fixes some sysfs Sphinx warnings"
      
      * tag 'erofs-for-5.18-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
        Documentation/ABI: sysfs-fs-erofs: Fix Sphinx errors
        erofs: fix use-after-free of on-stack io[]
      10c5f102
    • Linus Torvalds's avatar
      Revert "fs/pipe: use kvcalloc to allocate a pipe_buffer array" · 906f9040
      Linus Torvalds authored
      This reverts commit 5a519c8f.
      
      It turns out that making the pipe almost arbitrarily large has some
      rather unexpected downsides.  The kernel test robot reports a kernel
      warning that is due to pipe->max_usage now growing to the point where
      the iter_file_splice_write() buffer allocation can no longer be
      satisfied as a slab allocation, and the
      
              int nbufs = pipe->max_usage;
              struct bio_vec *array = kcalloc(nbufs, sizeof(struct bio_vec),
                                              GFP_KERNEL);
      
      code sequence there will now always fail as a result.
      
      That code could be modified to use kvcalloc() too, but I feel very
      uncomfortable making those kinds of changes for a very niche use case
      that really should have other options than make these kinds of
      fundamental changes to pipe behavior.
      
      Maybe the CRIU process dumping should be multi-threaded, and use
      multiple pipes and multiple cores, rather than try to use one larger
      pipe to minimize splice() calls.
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Link: https://lore.kernel.org/all/20220420073717.GD16310@xsang-OptiPlex-9020/
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      906f9040
    • Mikulas Patocka's avatar
      x86: __memcpy_flushcache: fix wrong alignment if size > 2^32 · a6823e4e
      Mikulas Patocka authored
      The first "if" condition in __memcpy_flushcache is supposed to align the
      "dest" variable to 8 bytes and copy data up to this alignment.  However,
      this condition may misbehave if "size" is greater than 4GiB.
      
      The statement min_t(unsigned, size, ALIGN(dest, 8) - dest); casts both
      arguments to unsigned int and selects the smaller one.  However, the
      cast truncates high bits in "size" and it results in misbehavior.
      
      For example:
      
      	suppose that size == 0x100000001, dest == 0x200000002
      	min_t(unsigned, size, ALIGN(dest, 8) - dest) == min_t(0x1, 0xe) == 0x1;
      	...
      	dest += 0x1;
      
      so we copy just one byte "and" dest remains unaligned.
      
      This patch fixes the bug by replacing unsigned with size_t.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a6823e4e
  3. 19 Apr, 2022 3 commits
    • Song Liu's avatar
      vmalloc: replace VM_NO_HUGE_VMAP with VM_ALLOW_HUGE_VMAP · 559089e0
      Song Liu authored
      Huge page backed vmalloc memory could benefit performance in many cases.
      However, some users of vmalloc may not be ready to handle huge pages for
      various reasons: hardware constraints, potential pages split, etc.
      VM_NO_HUGE_VMAP was introduced to allow vmalloc users to opt-out huge
      pages.  However, it is not easy to track down all the users that require
      the opt-out, as the allocation are passed different stacks and may cause
      issues in different layers.
      
      To address this issue, replace VM_NO_HUGE_VMAP with an opt-in flag,
      VM_ALLOW_HUGE_VMAP, so that users that benefit from huge pages could ask
      specificially.
      
      Also, remove vmalloc_no_huge() and add opt-in helper vmalloc_huge().
      
      Fixes: fac54e2b ("x86/Kconfig: Select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP")
      Link: https://lore.kernel.org/netdev/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/"
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      559089e0
    • Linus Torvalds's avatar
      Merge tag 'spi-fix-v5.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi · b7f73403
      Linus Torvalds authored
      Pull spi fixes from Mark Brown:
       "A few more fixes for SPI, plus one new PCI ID for another Intel
        chipset.
      
        All device specific stuff"
      
      * tag 'spi-fix-v5.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
        spi: atmel-quadspi: Fix the buswidth adjustment between spi-mem and controller
        spi: cadence-quadspi: fix incorrect supports_op() return value
        spi: intel: Add support for Raptor Lake-S SPI serial flash
        spi: spi-mtk-nor: initialize spi controller after resume
      b7f73403
    • Christian Brauner's avatar
      fs: fix acl translation · 705191b0
      Christian Brauner authored
      Last cycle we extended the idmapped mounts infrastructure to support
      idmapped mounts of idmapped filesystems (No such filesystem yet exist.).
      Since then, the meaning of an idmapped mount is a mount whose idmapping
      is different from the filesystems idmapping.
      
      While doing that work we missed to adapt the acl translation helpers.
      They still assume that checking for the identity mapping is enough.  But
      they need to use the no_idmapping() helper instead.
      
      Note, POSIX ACLs are always translated right at the userspace-kernel
      boundary using the caller's current idmapping and the initial idmapping.
      The order depends on whether we're coming from or going to userspace.
      The filesystem's idmapping doesn't matter at the border.
      
      Consequently, if a non-idmapped mount is passed we need to make sure to
      always pass the initial idmapping as the mount's idmapping and not the
      filesystem idmapping.  Since it's irrelevant here it would yield invalid
      ids and prevent setting acls for filesystems that are mountable in a
      userns and support posix acls (tmpfs and fuse).
      
      I verified the regression reported in [1] and verified that this patch
      fixes it.  A regression test will be added to xfstests in parallel.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=215849 [1]
      Fixes: bd303368 ("fs: support mapped mounts of mapped filesystems")
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: <stable@vger.kernel.org> # 5.17
      Cc: <regressions@lists.linux.dev>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      705191b0
  4. 17 Apr, 2022 10 commits
  5. 16 Apr, 2022 7 commits
    • Linus Torvalds's avatar
      Merge tag 'soc-fixes-5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc · 70a0cec8
      Linus Torvalds authored
      Pull ARM SoC fixes from Arnd Bergmann:
       "There are a number of SoC bugfixes that came in since the merge
        window, and more of them are already pending.
      
        This batch includes:
      
         - A boot time regression fix for davinci that triggered on
           multi_v5_defconfig when booting any platform
      
         - Defconfig updates to address removed features, changed symbol names
           or dependencies, for gemini, ux500, and pxa
      
         - Email address changes for Krzysztof Kozlowski
      
         - Build warning fixes for ep93xx and iop32x
      
         - Devicetree warning fixes across many platforms
      
         - Minor bugfixes for the reset controller, memory controller and SCMI
           firmware subsystems plus the versatile-express board"
      
      * tag 'soc-fixes-5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (34 commits)
        ARM: config: Update Gemini defconfig
        arm64: dts: qcom/sdm845-shift-axolotl: Fix boolean properties with values
        ARM: dts: align SPI NOR node name with dtschema
        ARM: dts: Fix more boolean properties with values
        arm/arm64: dts: qcom: Fix boolean properties with values
        arm64: dts: imx: Fix imx8*-var-som touchscreen property sizes
        arm: dts: imx: Fix boolean properties with values
        arm64: dts: tegra: Fix boolean properties with values
        arm: dts: at91: Fix boolean properties with values
        arm: configs: imote2: Drop defconfig as board support dropped.
        ep93xx: clock: Don't use plain integer as NULL pointer
        ep93xx: clock: Fix UAF in ep93xx_clk_register_gate()
        ARM: vexpress/spc: Fix all the kernel-doc build warnings
        ARM: vexpress/spc: Fix kernel-doc build warning for ve_spc_cpu_in_wfi
        ARM: config: u8500: Re-enable AB8500 battery charging
        ARM: config: u8500: Add some common hardware
        memory: fsl_ifc: populate child nodes of buses and mfd devices
        ARM: config: Refresh U8500 defconfig
        firmware: arm_scmi: Fix sparse warnings in OPTEE transport driver
        firmware: arm_scmi: Replace zero-length array with flexible-array member
        ...
      70a0cec8
    • Linus Torvalds's avatar
      Merge tag 'random-5.18-rc3-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random · 92edbe32
      Linus Torvalds authored
      Pull random number generator fixes from Jason Donenfeld:
      
       - Per your suggestion, random reads now won't fail if there's a page
         fault after some non-zero amount of data has been read, which makes
         the behavior consistent with all other reads in the kernel.
      
       - Rather than an inconsistent mix of random_get_entropy() returning an
         unsigned long or a cycles_t, now it just returns an unsigned long.
      
       - A memcpy() was replaced with an memmove(), because the addresses are
         sometimes overlapping. In practice the destination is always before
         the source, so not really an issue, but better to be correct than
         not.
      
      * tag 'random-5.18-rc3-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
        random: use memmove instead of memcpy for remaining 32 bytes
        random: make random_get_entropy() return an unsigned long
        random: allow partial reads if later user copies fail
      92edbe32
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 90ea17a9
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "13 fixes, all in drivers.
      
        The most extensive changes are in the iscsi series (affecting drivers
        qedi, cxgbi and bnx2i), the next most is scsi_debug, but that's just a
        simple revert and then minor updates to pm80xx"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: iscsi: MAINTAINERS: Add Mike Christie as co-maintainer
        scsi: qedi: Fix failed disconnect handling
        scsi: iscsi: Fix NOP handling during conn recovery
        scsi: iscsi: Merge suspend fields
        scsi: iscsi: Fix unbound endpoint error handling
        scsi: iscsi: Fix conn cleanup and stop race during iscsid restart
        scsi: iscsi: Fix endpoint reuse regression
        scsi: iscsi: Release endpoint ID when its freed
        scsi: iscsi: Fix offload conn cleanup when iscsid restarts
        scsi: iscsi: Move iscsi_ep_disconnect()
        scsi: pm80xx: Enable upper inbound, outbound queues
        scsi: pm80xx: Mask and unmask upper interrupt vectors 32-63
        Revert "scsi: scsi_debug: Address races following module load"
      90ea17a9
    • Bartosz Golaszewski's avatar
      Merge tag 'intel-gpio-v5.18-2' of... · 0ebb4fbe
      Bartosz Golaszewski authored
      Merge tag 'intel-gpio-v5.18-2' of gitolite.kernel.org:pub/scm/linux/kernel/git/andy/linux-gpio-intel into gpio/for-current
      
      intel-gpio for v5.18-2
      
      * Couple of fixes related to handling unsigned value of the pin from ACPI
      
      gpiolib:
       -  acpi: Convert type for pin to be unsigned
       -  acpi: use correct format characters
      0ebb4fbe
    • Linus Torvalds's avatar
      Merge tag 'dma-mapping-5.18-2' of git://git.infradead.org/users/hch/dma-mapping · b0086839
      Linus Torvalds authored
      Pull dma-mapping fix from Christoph Hellwig:
      
       - avoid a double memory copy for swiotlb (Chao Gao)
      
      * tag 'dma-mapping-5.18-2' of git://git.infradead.org/users/hch/dma-mapping:
        dma-direct: avoid redundant memory sync for swiotlb
      b0086839
    • Jason A. Donenfeld's avatar
      random: use memmove instead of memcpy for remaining 32 bytes · 35a33ff3
      Jason A. Donenfeld authored
      In order to immediately overwrite the old key on the stack, before
      servicing a userspace request for bytes, we use the remaining 32 bytes
      of block 0 as the key. This means moving indices 8,9,a,b,c,d,e,f ->
      4,5,6,7,8,9,a,b. Since 4 < 8, for the kernel implementations of
      memcpy(), this doesn't actually appear to be a problem in practice. But
      relying on that characteristic seems a bit brittle. So let's change that
      to a proper memmove(), which is the by-the-books way of handling
      overlapping memory copies.
      Reviewed-by: default avatarDominik Brodowski <linux@dominikbrodowski.net>
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      35a33ff3
    • Max Filippov's avatar
      xtensa: fix a7 clobbering in coprocessor context load/store · 839769c3
      Max Filippov authored
      Fast coprocessor exception handler saves a3..a6, but coprocessor context
      load/store code uses a4..a7 as temporaries, potentially clobbering a7.
      'Potentially' because coprocessor state load/store macros may not use
      all four temporary registers (and neither FPU nor HiFi macros do).
      Use a3..a6 as intended.
      
      Cc: stable@vger.kernel.org
      Fixes: c658eac6 ("[XTENSA] Add support for configurable registers and coprocessors")
      Signed-off-by: default avatarMax Filippov <jcmvbkbc@gmail.com>
      839769c3
  6. 15 Apr, 2022 11 commits
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 59250f8a
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "14 patches.
      
        Subsystems affected by this patch series: MAINTAINERS, binfmt, and
        mm (tmpfs, secretmem, kasan, kfence, pagealloc, zram, compaction,
        hugetlb, vmalloc, and kmemleak)"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm: kmemleak: take a full lowmem check in kmemleak_*_phys()
        mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
        revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"
        revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders"
        hugetlb: do not demote poisoned hugetlb pages
        mm: compaction: fix compiler warning when CONFIG_COMPACTION=n
        mm: fix unexpected zeroed page mapping with zram swap
        mm, page_alloc: fix build_zonerefs_node()
        mm, kfence: support kmem_dump_obj() for KFENCE objects
        kasan: fix hw tags enablement when KUNIT tests are disabled
        irq_work: use kasan_record_aux_stack_noalloc() record callstack
        mm/secretmem: fix panic when growing a memfd_secret
        tmpfs: fix regressions from wider use of ZERO_PAGE
        MAINTAINERS: Broadcom internal lists aren't maintainers
      59250f8a
    • Linus Torvalds's avatar
      Merge tag 'for-5.18/dm-fixes-2' of... · ce673f63
      Linus Torvalds authored
      Merge tag 'for-5.18/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
      
      Pull device mapper fixes from Mike Snitzer:
      
       - Fix memory corruption in DM integrity target when tag_size is less
         than digest size.
      
       - Fix DM multipath's historical-service-time path selector to not use
         sched_clock() and ktime_get_ns(); only use ktime_get_ns().
      
       - Fix dm_io->orig_bio NULL pointer dereference in dm_zone_map_bio() due
         to 5.18 changes that overlooked DM zone's use of ->orig_bio
      
       - Fix for regression that broke the use of dm_accept_partial_bio() for
         "abnormal" IO (e.g. WRITE ZEROES) that does not need duplicate bios
      
       - Fix DM's issuing of empty flush bio so that it's size is 0.
      
      * tag 'for-5.18/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
        dm: fix bio length of empty flush
        dm: allow dm_accept_partial_bio() for dm_io without duplicate bios
        dm zone: fix NULL pointer dereference in dm_zone_map_bio
        dm mpath: only use ktime_get_ns() in historical selector
        dm integrity: fix memory corruption when tag_size is less than digest size
      ce673f63
    • Patrick Wang's avatar
      mm: kmemleak: take a full lowmem check in kmemleak_*_phys() · 23c2d497
      Patrick Wang authored
      The kmemleak_*_phys() apis do not check the address for lowmem's min
      boundary, while the caller may pass an address below lowmem, which will
      trigger an oops:
      
        # echo scan > /sys/kernel/debug/kmemleak
        Unable to handle kernel paging request at virtual address ff5fffffffe00000
        Oops [#1]
        Modules linked in:
        CPU: 2 PID: 134 Comm: bash Not tainted 5.18.0-rc1-next-20220407 #33
        Hardware name: riscv-virtio,qemu (DT)
        epc : scan_block+0x74/0x15c
         ra : scan_block+0x72/0x15c
        epc : ffffffff801e5806 ra : ffffffff801e5804 sp : ff200000104abc30
         gp : ffffffff815cd4e8 tp : ff60000004cfa340 t0 : 0000000000000200
         t1 : 00aaaaaac23954cc t2 : 00000000000003ff s0 : ff200000104abc90
         s1 : ffffffff81b0ff28 a0 : 0000000000000000 a1 : ff5fffffffe01000
         a2 : ffffffff81b0ff28 a3 : 0000000000000002 a4 : 0000000000000001
         a5 : 0000000000000000 a6 : ff200000104abd7c a7 : 0000000000000005
         s2 : ff5fffffffe00ff9 s3 : ffffffff815cd998 s4 : ffffffff815d0e90
         s5 : ffffffff81b0ff28 s6 : 0000000000000020 s7 : ffffffff815d0eb0
         s8 : ffffffffffffffff s9 : ff5fffffffe00000 s10: ff5fffffffe01000
         s11: 0000000000000022 t3 : 00ffffffaa17db4c t4 : 000000000000000f
         t5 : 0000000000000001 t6 : 0000000000000000
        status: 0000000000000100 badaddr: ff5fffffffe00000 cause: 000000000000000d
          scan_gray_list+0x12e/0x1a6
          kmemleak_scan+0x2aa/0x57e
          kmemleak_write+0x32a/0x40c
          full_proxy_write+0x56/0x82
          vfs_write+0xa6/0x2a6
          ksys_write+0x6c/0xe2
          sys_write+0x22/0x2a
          ret_from_syscall+0x0/0x2
      
      The callers may not quite know the actual address they pass(e.g. from
      devicetree).  So the kmemleak_*_phys() apis should guarantee the address
      they finally use is in lowmem range, so check the address for lowmem's
      min boundary.
      
      Link: https://lkml.kernel.org/r/20220413122925.33856-1-patrick.wang.shcn@gmail.comSigned-off-by: default avatarPatrick Wang <patrick.wang.shcn@gmail.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23c2d497
    • Omar Sandoval's avatar
      mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore · c12cd77c
      Omar Sandoval authored
      Commit 3ee48b6a ("mm, x86: Saving vmcore with non-lazy freeing of
      vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
      lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
      purge the vmap areas instead of doing it lazily.
      
      Commit 690467c8 ("mm/vmalloc: Move draining areas out of caller
      context") moved the purging from the vunmap() caller to a worker thread.
      Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
      (possibly forever).  For example, consider the following scenario:
      
       1. Thread reads from /proc/vmcore. This eventually calls
          __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
          vmap_lazy_nr to lazy_max_pages() + 1.
      
       2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
          pages (one page plus the guard page) to the purge list and
          vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
          drain_vmap_work is scheduled.
      
       3. Thread returns from the kernel and is scheduled out.
      
       4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
          frees the 2 pages on the purge list. vmap_lazy_nr is now
          lazy_max_pages() + 1.
      
       5. This is still over the threshold, so it tries to purge areas again,
          but doesn't find anything.
      
       6. Repeat 5.
      
      If the system is running with only one CPU (which is typicial for kdump)
      and preemption is disabled, then this will never make forward progress:
      there aren't any more pages to purge, so it hangs.  If there is more
      than one CPU or preemption is enabled, then the worker thread will spin
      forever in the background.  (Note that if there were already pages to be
      purged at the time that set_iounmap_nonlazy() was called, this bug is
      avoided.)
      
      This can be reproduced with anything that reads from /proc/vmcore
      multiple times.  E.g., vmcore-dmesg /proc/vmcore.
      
      It turns out that improvements to vmap() over the years have obsoleted
      the need for this "optimization".  I benchmarked `dd if=/proc/vmcore
      of=/dev/null` with 4k and 1M read sizes on a system with a 32GB vmcore.
      The test was run on 5.17, 5.18-rc1 with a fix that avoided the hang, and
      5.18-rc1 with set_iounmap_nonlazy() removed entirely:
      
          |5.17  |5.18+fix|5.18+removal
        4k|40.86s|  40.09s|      26.73s
        1M|24.47s|  23.98s|      21.84s
      
      The removal was the fastest (by a wide margin with 4k reads).  This
      patch removes set_iounmap_nonlazy().
      
      Link: https://lkml.kernel.org/r/52f819991051f9b865e9ce25605509bfdbacadcd.1649277321.git.osandov@fb.com
      Fixes: 690467c8  ("mm/vmalloc: Move draining areas out of caller context")
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c12cd77c
    • Andrew Morton's avatar
      revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE" · aeb79237
      Andrew Morton authored
      Despite Mike's attempted fix (925346c1), regressions reports
      continue:
      
        https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/
        https://bugzilla.kernel.org/show_bug.cgi?id=215720
        https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info
      
      So revert this patch.
      
      Fixes: 9630f0d6 ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE")
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Fangrui Song <maskray@google.com>
      Cc: H.J. Lu <hjl.tools@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thorsten Leemhuis <regressions@leemhuis.info>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aeb79237
    • Andrew Morton's avatar
      revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders" · 354e923d
      Andrew Morton authored
      Commit 925346c1 ("fs/binfmt_elf: fix PT_LOAD p_align values for
      loaders") was an attempt to fix regressions due to 9630f0d6
      ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE").
      
      But regressionss continue to be reported:
      
        https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/
        https://bugzilla.kernel.org/show_bug.cgi?id=215720
        https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info
      
      This patch reverts the fix, so the original can also be reverted.
      
      Fixes: 925346c1 ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders")
      Cc: H.J. Lu <hjl.tools@gmail.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: Fangrui Song <maskray@google.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Thorsten Leemhuis <regressions@leemhuis.info>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      354e923d
    • Mike Kravetz's avatar
      hugetlb: do not demote poisoned hugetlb pages · 5a317412
      Mike Kravetz authored
      It is possible for poisoned hugetlb pages to reside on the free lists.
      The huge page allocation routines which dequeue entries from the free
      lists make a point of avoiding poisoned pages.  There is no such check
      and avoidance in the demote code path.
      
      If a hugetlb page on the is on a free list, poison will only be set in
      the head page rather then the page with the actual error.  If such a
      page is demoted, then the poison flag may follow the wrong page.  A page
      without error could have poison set, and a page with poison could not
      have the flag set.
      
      Check for poison before attempting to demote a hugetlb page.  Also,
      return -EBUSY to the caller if only poisoned pages are on the free list.
      
      Link: https://lkml.kernel.org/r/20220307215707.50916-1-mike.kravetz@oracle.com
      Fixes: 8531fc6f ("hugetlb: add hugetlb demote page support")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a317412
    • Charan Teja Kalla's avatar
      mm: compaction: fix compiler warning when CONFIG_COMPACTION=n · 31ca72fa
      Charan Teja Kalla authored
      The below warning is reported when CONFIG_COMPACTION=n:
      
         mm/compaction.c:56:27: warning: 'HPAGE_FRAG_CHECK_INTERVAL_MSEC' defined but not used [-Wunused-const-variable=]
            56 | static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500;
               |                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Fix it by moving 'HPAGE_FRAG_CHECK_INTERVAL_MSEC' under
      CONFIG_COMPACTION defconfig.
      
      Also since this is just a 'static const int' type, use #define for it.
      
      Link: https://lkml.kernel.org/r/1647608518-20924-1-git-send-email-quic_charante@quicinc.comSigned-off-by: default avatarCharan Teja Kalla <quic_charante@quicinc.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Nitin Gupta <nigupta@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31ca72fa
    • Minchan Kim's avatar
      mm: fix unexpected zeroed page mapping with zram swap · e914d8f0
      Minchan Kim authored
      Two processes under CLONE_VM cloning, user process can be corrupted by
      seeing zeroed page unexpectedly.
      
            CPU A                        CPU B
      
        do_swap_page                do_swap_page
        SWP_SYNCHRONOUS_IO path     SWP_SYNCHRONOUS_IO path
        swap_readpage valid data
          swap_slot_free_notify
            delete zram entry
                                    swap_readpage zeroed(invalid) data
                                    pte_lock
                                    map the *zero data* to userspace
                                    pte_unlock
        pte_lock
        if (!pte_same)
          goto out_nomap;
        pte_unlock
        return and next refault will
        read zeroed data
      
      The swap_slot_free_notify is bogus for CLONE_VM case since it doesn't
      increase the refcount of swap slot at copy_mm so it couldn't catch up
      whether it's safe or not to discard data from backing device.  In the
      case, only the lock it could rely on to synchronize swap slot freeing is
      page table lock.  Thus, this patch gets rid of the swap_slot_free_notify
      function.  With this patch, CPU A will see correct data.
      
            CPU A                        CPU B
      
        do_swap_page                do_swap_page
        SWP_SYNCHRONOUS_IO path     SWP_SYNCHRONOUS_IO path
                                    swap_readpage original data
                                    pte_lock
                                    map the original data
                                    swap_free
                                      swap_range_free
                                        bd_disk->fops->swap_slot_free_notify
        swap_readpage read zeroed data
                                    pte_unlock
        pte_lock
        if (!pte_same)
          goto out_nomap;
        pte_unlock
        return
        on next refault will see mapped data by CPU B
      
      The concern of the patch would increase memory consumption since it
      could keep wasted memory with compressed form in zram as well as
      uncompressed form in address space.  However, most of cases of zram uses
      no readahead and do_swap_page is followed by swap_free so it will free
      the compressed form from in zram quickly.
      
      Link: https://lkml.kernel.org/r/YjTVVxIAsnKAXjTd@google.com
      Fixes: 0bcac06f ("mm, swap: skip swapcache for swapin of synchronous device")
      Reported-by: default avatarIvan Babrou <ivan@cloudflare.com>
      Tested-by: default avatarIvan Babrou <ivan@cloudflare.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e914d8f0
    • Juergen Gross's avatar
      mm, page_alloc: fix build_zonerefs_node() · e553f62f
      Juergen Gross authored
      Since commit 6aa303de ("mm, vmscan: only allocate and reclaim from
      zones with pages managed by the buddy allocator") only zones with free
      memory are included in a built zonelist.  This is problematic when e.g.
      all memory of a zone has been ballooned out when zonelists are being
      rebuilt.
      
      The decision whether to rebuild the zonelists when onlining new memory
      is done based on populated_zone() returning 0 for the zone the memory
      will be added to.  The new zone is added to the zonelists only, if it
      has free memory pages (managed_zone() returns a non-zero value) after
      the memory has been onlined.  This implies, that onlining memory will
      always free the added pages to the allocator immediately, but this is
      not true in all cases: when e.g. running as a Xen guest the onlined new
      memory will be added only to the ballooned memory list, it will be freed
      only when the guest is being ballooned up afterwards.
      
      Another problem with using managed_zone() for the decision whether a
      zone is being added to the zonelists is, that a zone with all memory
      used will in fact be removed from all zonelists in case the zonelists
      happen to be rebuilt.
      
      Use populated_zone() when building a zonelist as it has been done before
      that commit.
      
      There was a report that QubesOS (based on Xen) is hitting this problem.
      Xen has switched to use the zone device functionality in kernel 5.9 and
      QubesOS wants to use memory hotplugging for guests in order to be able
      to start a guest with minimal memory and expand it as needed.  This was
      the report leading to the patch.
      
      Link: https://lkml.kernel.org/r/20220407120637.9035-1-jgross@suse.com
      Fixes: 6aa303de ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Reported-by: default avatarMarek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e553f62f
    • Marco Elver's avatar
      mm, kfence: support kmem_dump_obj() for KFENCE objects · 2dfe63e6
      Marco Elver authored
      Calling kmem_obj_info() via kmem_dump_obj() on KFENCE objects has been
      producing garbage data due to the object not actually being maintained
      by SLAB or SLUB.
      
      Fix this by implementing __kfence_obj_info() that copies relevant
      information to struct kmem_obj_info when the object was allocated by
      KFENCE; this is called by a common kmem_obj_info(), which also calls the
      slab/slub/slob specific variant now called __kmem_obj_info().
      
      For completeness, kmem_dump_obj() now displays if the object was
      allocated by KFENCE.
      
      Link: https://lore.kernel.org/all/20220323090520.GG16885@xsang-OptiPlex-9020/
      Link: https://lkml.kernel.org/r/20220406131558.3558585-1-elver@google.com
      Fixes: b89fb5ef ("mm, kfence: insert KFENCE hooks for SLUB")
      Fixes: d3fb45f3 ("mm, kfence: insert KFENCE hooks for SLAB")
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Acked-by: Vlastimil Babka <vbabka@suse.cz>	[slab]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2dfe63e6