1. 27 Jul, 2024 4 commits
    • Linus Torvalds's avatar
      Merge tag 'mm-hotfixes-stable-2024-07-26-14-33' of... · 7b0acd91
      Linus Torvalds authored
      Merge tag 'mm-hotfixes-stable-2024-07-26-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
      
      Pull misc hotfixes from Andrew Morton:
       "11 hotfixes, 7 of which are cc:stable.  7 are MM, 4 are other"
      
      * tag 'mm-hotfixes-stable-2024-07-26-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
        nilfs2: handle inconsistent state in nilfs_btnode_create_block()
        selftests/mm: skip test for non-LPA2 and non-LVA systems
        mm/page_alloc: fix pcp->count race between drain_pages_zone() vs __rmqueue_pcplist()
        mm: memcg: add cacheline padding after lruvec in mem_cgroup_per_node
        alloc_tag: outline and export free_reserved_page()
        decompress_bunzip2: fix rare decompression failure
        mm/huge_memory: avoid PMD-size page cache if needed
        mm: huge_memory: use !CONFIG_64BIT to relax huge page alignment on 32 bit machines
        mm: fix old/young bit handling in the faulting path
        dt-bindings: arm: update James Clark's email address
        MAINTAINERS: mailmap: update James Clark's email address
      7b0acd91
    • Linus Torvalds's avatar
      Merge tag 'timers-urgent-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 5256184b
      Linus Torvalds authored
      Pull timer migration updates from Thomas Gleixner:
       "Fixes and minor updates for the timer migration code:
      
         - Stop testing the group->parent pointer as it is not guaranteed to
           be stable over a chain of operations by design.
      
           This includes a warning which would be nice to have but it produces
           false positives due to the racy nature of the check.
      
         - Plug a race between CPUs going in and out of idle and a CPU hotplug
           operation. The latter can create and connect a new hierarchy level
           which is missed in the concurrent updates of CPUs which go into
           idle. As a result the events of such a CPU might not be processed
           and timers go stale.
      
           Cure it by splitting the hotplug operation into a prepare and
           online callback. The prepare callback is guaranteed to run on an
           online and therefore active CPU. This CPU updates the hierarchy and
           being online ensures that there is always at least one migrator
           active which handles the modified hierarchy correctly when going
           idle. The online callback which runs on the incoming CPU then just
           marks the CPU active and brings it into operation.
      
         - Improve tracing and polish the code further so it is more obvious
           what's going on"
      
      * tag 'timers-urgent-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        timers/migration: Fix grammar in comment
        timers/migration: Spare write when nothing changed
        timers/migration: Rename childmask by groupmask to make naming more obvious
        timers/migration: Read childmask and parent pointer in a single place
        timers/migration: Use a single struct for hierarchy walk data
        timers/migration: Improve tracing
        timers/migration: Move hierarchy setup into cpuhotplug prepare callback
        timers/migration: Do not rely always on group->parent
      5256184b
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-6.11-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · c9f33436
      Linus Torvalds authored
      Pull more RISC-V updates from Palmer Dabbelt:
      
       - Support for NUMA (via SRAT and SLIT), console output (via SPCR), and
         cache info (via PPTT) on ACPI-based systems.
      
       - The trap entry/exit code no longer breaks the return address stack
         predictor on many systems, which results in an improvement to trap
         latency.
      
       - Support for HAVE_ARCH_STACKLEAK.
      
       - The sv39 linear map has been extended to support 128GiB mappings.
      
       - The frequency of the mtime CSR is now visible via hwprobe.
      
      * tag 'riscv-for-linus-6.11-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (21 commits)
        RISC-V: Provide the frequency of time CSR via hwprobe
        riscv: Extend sv39 linear mapping max size to 128G
        riscv: enable HAVE_ARCH_STACKLEAK
        riscv: signal: Remove unlikely() from WARN_ON() condition
        riscv: Improve exception and system call latency
        RISC-V: Select ACPI PPTT drivers
        riscv: cacheinfo: initialize cacheinfo's level and type from ACPI PPTT
        riscv: cacheinfo: remove the useless input parameter (node) of ci_leaf_init()
        RISC-V: ACPI: Enable SPCR table for console output on RISC-V
        riscv: boot: remove duplicated targets line
        trace: riscv: Remove deprecated kprobe on ftrace support
        riscv: cpufeature: Extract common elements from extension checking
        riscv: Introduce vendor variants of extension helpers
        riscv: Add vendor extensions to /proc/cpuinfo
        riscv: Extend cpufeature.c to detect vendor extensions
        RISC-V: run savedefconfig for defconfig
        RISC-V: hwprobe: sort EXT_KEY()s in hwprobe_isa_ext0() alphabetically
        ACPI: NUMA: replace pr_info with pr_debug in arch_acpi_numa_init
        ACPI: NUMA: change the ACPI_NUMA to a hidden option
        ACPI: NUMA: Add handler for SRAT RINTC affinity structure
        ...
      c9f33436
    • Linus Torvalds's avatar
      Merge tag 'for-linus-6.11-rc1a-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · c17f1224
      Linus Torvalds authored
      Pull xen fixes from Juergen Gross:
       "Two fixes for issues introduced in this merge window:
      
         - fix enhanced debugging in the Xen multicall handling
      
         - two patches fixing a boot failure when running as dom0 in PVH mode"
      
      * tag 'for-linus-6.11-rc1a-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        x86/xen: fix memblock_reserve() usage on PVH
        x86/xen: move xen_reserve_extra_memory()
        xen: fix multicall debug data referencing
      c17f1224
  2. 26 Jul, 2024 35 commits
    • Linus Torvalds's avatar
      minmax: avoid overly complicated constant expressions in VM code · 3a7e02c0
      Linus Torvalds authored
      The minmax infrastructure is overkill for simple constants, and can
      cause huge expansions because those simple constants are then used by
      other things.
      
      For example, 'pageblock_order' is a core VM constant, but because it was
      implemented using 'min_t()' and all the type-checking that involves, it
      actually expanded to something like 2.5kB of preprocessor noise.
      
      And when that simple constant was then used inside other expansions:
      
        #define pageblock_nr_pages      (1UL << pageblock_order)
        #define pageblock_start_pfn(pfn)  ALIGN_DOWN((pfn), pageblock_nr_pages)
      
      and we then use that inside a 'max()' macro:
      
      	case ISOLATE_SUCCESS:
      		update_cached = false;
      		last_migrated_pfn = max(cc->zone->zone_start_pfn,
      			pageblock_start_pfn(cc->migrate_pfn - 1));
      
      the end result was that one statement expanding to 253kB in size.
      
      There are probably other cases of this, but this one case certainly
      stood out.
      
      I've added 'MIN_T()' and 'MAX_T()' macros for this kind of "core simple
      constant with specific type" use.  These macros skip the type checking,
      and as such need to be very sparingly used only for obvious cases that
      have active issues like this.
      Reported-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Link: https://lore.kernel.org/all/36aa2cad-1db1-4abf-8dd2-fb20484aabc3@lucifer.local/
      Cc: David Laight <David.Laight@aculab.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a7e02c0
    • Linus Torvalds's avatar
      minmax: avoid overly complex min()/max() macro arguments in xen · e8432ac8
      Linus Torvalds authored
      We have some very fancy min/max macros that have tons of sanity checking
      to warn about mixed signedness etc.
      
      This is all things that a sane compiler should warn about, but there are
      no sane compiler interfaces for this, and '-Wsign-compare' is broken [1]
      and not useful.
      
      So then we compensate (some would say over-compensate) by doing the
      checks manually with some truly horrid macro games.
      
      And no, we can't just use __builtin_types_compatible_p(), because the
      whole question of "does it make sense to compare these two values" is a
      lot more complicated than that.
      
      For example, it makes a ton of sense to compare unsigned values with
      simple constants like "5", even if that is indeed a signed type.  So we
      have these very strange macros to try to make sensible type checking
      decisions on the arguments to 'min()' and 'max()'.
      
      But that can cause enormous code expansion if the min()/max() macros are
      used with complicated expressions, and particularly if you nest these
      things so that you get the first big expansion then expanded again.
      
      The xen setup.c file ended up ballooning to over 50MB of preprocessed
      noise that takes 15s to compile (obviously depending on the build host),
      largely due to one single line.
      
      So let's split that one single line to just be simpler.  I think it ends
      up being more legible to humans too at the same time.  Now that single
      file compiles in under a second.
      Reported-and-reviewed-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Link: https://lore.kernel.org/all/c83c17bb-be75-4c67-979d-54eee38774c6@lucifer.local/
      Link: https://staticthinking.wordpress.com/2023/07/25/wsign-compare-is-garbage/ [1]
      Cc: David Laight <David.Laight@aculab.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8432ac8
    • Ryusuke Konishi's avatar
      nilfs2: handle inconsistent state in nilfs_btnode_create_block() · 4811f7af
      Ryusuke Konishi authored
      Syzbot reported that a buffer state inconsistency was detected in
      nilfs_btnode_create_block(), triggering a kernel bug.
      
      It is not appropriate to treat this inconsistency as a bug; it can occur
      if the argument block address (the buffer index of the newly created
      block) is a virtual block number and has been reallocated due to
      corruption of the bitmap used to manage its allocation state.
      
      So, modify nilfs_btnode_create_block() and its callers to treat it as a
      possible filesystem error, rather than triggering a kernel bug.
      
      Link: https://lkml.kernel.org/r/20240725052007.4562-1-konishi.ryusuke@gmail.com
      Fixes: a60be987 ("nilfs2: B-tree node cache")
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Reported-by: syzbot+89cc4f2324ed37988b60@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=89cc4f2324ed37988b60
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4811f7af
    • Dev Jain's avatar
      selftests/mm: skip test for non-LPA2 and non-LVA systems · f556acc2
      Dev Jain authored
      Post my improvement of the test in e4a4ba41 ("selftests/mm:
      va_high_addr_switch: dynamically initialize testcases to enable LPA2
      testing"):
      
      The test begins to fail on 4k and 16k pages, on non-LPA2 systems.  To
      reduce noise in the CI systems, let us skip the test when higher address
      space is not implemented.
      
      Link: https://lkml.kernel.org/r/20240718052504.356517-1-dev.jain@arm.com
      Fixes: e4a4ba41 ("selftests/mm: va_high_addr_switch: dynamically initialize testcases to enable LPA2 testing")
      Signed-off-by: default avatarDev Jain <dev.jain@arm.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Mark Brown <broonie@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f556acc2
    • Li Zhijian's avatar
      mm/page_alloc: fix pcp->count race between drain_pages_zone() vs __rmqueue_pcplist() · 66eca102
      Li Zhijian authored
      It's expected that no page should be left in pcp_list after calling
      zone_pcp_disable() in offline_pages().  Previously, it's observed that
      offline_pages() gets stuck [1] due to some pages remaining in pcp_list.
      
      Cause:
      There is a race condition between drain_pages_zone() and __rmqueue_pcplist()
      involving the pcp->count variable. See below scenario:
      
               CPU0                              CPU1
          ----------------                    ---------------
                                            spin_lock(&pcp->lock);
                                            __rmqueue_pcplist() {
      zone_pcp_disable() {
                                              /* list is empty */
                                              if (list_empty(list)) {
                                                /* add pages to pcp_list */
                                                alloced = rmqueue_bulk()
        mutex_lock(&pcp_batch_high_lock)
        ...
        __drain_all_pages() {
          drain_pages_zone() {
            /* read pcp->count, it's 0 here */
            count = READ_ONCE(pcp->count)
            /* 0 means nothing to drain */
                                                /* update pcp->count */
                                                pcp->count += alloced << order;
            ...
                                            ...
                                            spin_unlock(&pcp->lock);
      
      In this case, after calling zone_pcp_disable() though, there are still some
      pages in pcp_list. And these pages in pcp_list are neither movable nor
      isolated, offline_pages() gets stuck as a result.
      
      Solution:
      Expand the scope of the pcp->lock to also protect pcp->count in
      drain_pages_zone(), to ensure no pages are left in the pcp list after
      zone_pcp_disable()
      
      [1] https://lore.kernel.org/linux-mm/6a07125f-e720-404c-b2f9-e55f3f166e85@fujitsu.com/
      
      Link: https://lkml.kernel.org/r/20240723064428.1179519-1-lizhijian@fujitsu.com
      Fixes: 4b23a68f ("mm/page_alloc: protect PCP lists with a spinlock")
      Signed-off-by: default avatarLi Zhijian <lizhijian@fujitsu.com>
      Reported-by: default avatarYao Xingtao <yaoxt.fnst@fujitsu.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      66eca102
    • Roman Gushchin's avatar
      mm: memcg: add cacheline padding after lruvec in mem_cgroup_per_node · f59adcf5
      Roman Gushchin authored
      Oliver Sand reported a performance regression caused by commit
      98c9daf5 ("mm: memcg: guard memcg1-specific members of struct
      mem_cgroup_per_node"), which puts some fields of the mem_cgroup_per_node
      structure under the CONFIG_MEMCG_V1 config option.  Apparently it causes a
      false cache sharing between lruvec and lru_zone_size members of the
      structure.  Fix it by adding an explicit padding after the lruvec member.
      
      Even though the padding is not required with CONFIG_MEMCG_V1 set, it seems
      like the introduced memory overhead is not significant enough to warrant
      another divergence in the mem_cgroup_per_node layout, so the padding is
      added unconditionally.
      
      Link: https://lkml.kernel.org/r/20240723171244.747521-1-roman.gushchin@linux.dev
      Fixes: 98c9daf5 ("mm: memcg: guard memcg1-specific members of struct mem_cgroup_per_node")
      Signed-off-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Closes: https://lore.kernel.org/oe-lkp/202407121335.31a10cb6-oliver.sang@intel.comTested-by: default avatarOliver Sang <oliver.sang@intel.com>
      Acked-by: default avatarShakeel Butt <shakeel.butt@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f59adcf5
    • Suren Baghdasaryan's avatar
      alloc_tag: outline and export free_reserved_page() · b3bebe44
      Suren Baghdasaryan authored
      Outline and export free_reserved_page() because modules use it and it in
      turn uses page_ext_{get|put} which should not be exported.  The same
      result could be obtained by outlining {get|put}_page_tag_ref() but that
      would have higher performance impact as these functions are used in more
      performance critical paths.
      
      Link: https://lkml.kernel.org/r/20240717212844.2749975-1-surenb@google.com
      Fixes: dcfe378c ("lib: introduce support for page allocation tagging")
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202407080044.DWMC9N9I-lkp@intel.com/Suggested-by: default avatarChristoph Hellwig <hch@infradead.org>
      Suggested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Sourav Panda <souravpanda@google.com>
      Cc: <stable@vger.kernel.org>	[6.10]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b3bebe44
    • Ross Lagerwall's avatar
      decompress_bunzip2: fix rare decompression failure · bf6acd5d
      Ross Lagerwall authored
      The decompression code parses a huffman tree and counts the number of
      symbols for a given bit length.  In rare cases, there may be >= 256
      symbols with a given bit length, causing the unsigned char to overflow. 
      This causes a decompression failure later when the code tries and fails to
      find the bit length for a given symbol.
      
      Since the maximum number of symbols is 258, use unsigned short instead.
      
      Link: https://lkml.kernel.org/r/20240717162016.1514077-1-ross.lagerwall@citrix.com
      Fixes: bc22c17e ("bzip2/lzma: library support for gzip, bzip2 and lzma decompression")
      Signed-off-by: default avatarRoss Lagerwall <ross.lagerwall@citrix.com>
      Cc: Alain Knaff <alain@knaff.lu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bf6acd5d
    • Gavin Shan's avatar
      mm/huge_memory: avoid PMD-size page cache if needed · d659b715
      Gavin Shan authored
      xarray can't support arbitrary page cache size.  the largest and supported
      page cache size is defined as MAX_PAGECACHE_ORDER by commit 099d9064
      ("mm/filemap: make MAX_PAGECACHE_ORDER acceptable to xarray").  However,
      it's possible to have 512MB page cache in the huge memory's collapsing
      path on ARM64 system whose base page size is 64KB.  512MB page cache is
      breaking the limitation and a warning is raised when the xarray entry is
      split as shown in the following example.
      
      [root@dhcp-10-26-1-207 ~]# cat /proc/1/smaps | grep KernelPageSize
      KernelPageSize:       64 kB
      [root@dhcp-10-26-1-207 ~]# cat /tmp/test.c
         :
      int main(int argc, char **argv)
      {
      	const char *filename = TEST_XFS_FILENAME;
      	int fd = 0;
      	void *buf = (void *)-1, *p;
      	int pgsize = getpagesize();
      	int ret = 0;
      
      	if (pgsize != 0x10000) {
      		fprintf(stdout, "System with 64KB base page size is required!\n");
      		return -EPERM;
      	}
      
      	system("echo 0 > /sys/devices/virtual/bdi/253:0/read_ahead_kb");
      	system("echo 1 > /proc/sys/vm/drop_caches");
      
      	/* Open the xfs file */
      	fd = open(filename, O_RDONLY);
      	assert(fd > 0);
      
      	/* Create VMA */
      	buf = mmap(NULL, TEST_MEM_SIZE, PROT_READ, MAP_SHARED, fd, 0);
      	assert(buf != (void *)-1);
      	fprintf(stdout, "mapped buffer at 0x%p\n", buf);
      
      	/* Populate VMA */
      	ret = madvise(buf, TEST_MEM_SIZE, MADV_NOHUGEPAGE);
      	assert(ret == 0);
      	ret = madvise(buf, TEST_MEM_SIZE, MADV_POPULATE_READ);
      	assert(ret == 0);
      
      	/* Collapse VMA */
      	ret = madvise(buf, TEST_MEM_SIZE, MADV_HUGEPAGE);
      	assert(ret == 0);
      	ret = madvise(buf, TEST_MEM_SIZE, MADV_COLLAPSE);
      	if (ret) {
      		fprintf(stdout, "Error %d to madvise(MADV_COLLAPSE)\n", errno);
      		goto out;
      	}
      
      	/* Split xarray entry. Write permission is needed */
      	munmap(buf, TEST_MEM_SIZE);
      	buf = (void *)-1;
      	close(fd);
      	fd = open(filename, O_RDWR);
      	assert(fd > 0);
      	fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
       		  TEST_MEM_SIZE - pgsize, pgsize);
      out:
      	if (buf != (void *)-1)
      		munmap(buf, TEST_MEM_SIZE);
      	if (fd > 0)
      		close(fd);
      
      	return ret;
      }
      
      [root@dhcp-10-26-1-207 ~]# gcc /tmp/test.c -o /tmp/test
      [root@dhcp-10-26-1-207 ~]# /tmp/test
       ------------[ cut here ]------------
       WARNING: CPU: 25 PID: 7560 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128
       Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib    \
       nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct      \
       nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4      \
       ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm fuse   \
       xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 virtio_net  \
       sha1_ce net_failover virtio_blk virtio_console failover dimlib virtio_mmio
       CPU: 25 PID: 7560 Comm: test Kdump: loaded Not tainted 6.10.0-rc7-gavin+ #9
       Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024
       pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
       pc : xas_split_alloc+0xf8/0x128
       lr : split_huge_page_to_list_to_order+0x1c4/0x780
       sp : ffff8000ac32f660
       x29: ffff8000ac32f660 x28: ffff0000e0969eb0 x27: ffff8000ac32f6c0
       x26: 0000000000000c40 x25: ffff0000e0969eb0 x24: 000000000000000d
       x23: ffff8000ac32f6c0 x22: ffffffdfc0700000 x21: 0000000000000000
       x20: 0000000000000000 x19: ffffffdfc0700000 x18: 0000000000000000
       x17: 0000000000000000 x16: ffffd5f3708ffc70 x15: 0000000000000000
       x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
       x11: ffffffffffffffc0 x10: 0000000000000040 x9 : ffffd5f3708e692c
       x8 : 0000000000000003 x7 : 0000000000000000 x6 : ffff0000e0969eb8
       x5 : ffffd5f37289e378 x4 : 0000000000000000 x3 : 0000000000000c40
       x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
       Call trace:
        xas_split_alloc+0xf8/0x128
        split_huge_page_to_list_to_order+0x1c4/0x780
        truncate_inode_partial_folio+0xdc/0x160
        truncate_inode_pages_range+0x1b4/0x4a8
        truncate_pagecache_range+0x84/0xa0
        xfs_flush_unmap_range+0x70/0x90 [xfs]
        xfs_file_fallocate+0xfc/0x4d8 [xfs]
        vfs_fallocate+0x124/0x2f0
        ksys_fallocate+0x4c/0xa0
        __arm64_sys_fallocate+0x24/0x38
        invoke_syscall.constprop.0+0x7c/0xd8
        do_el0_svc+0xb4/0xd0
        el0_svc+0x44/0x1d8
        el0t_64_sync_handler+0x134/0x150
        el0t_64_sync+0x17c/0x180
      
      Fix it by correcting the supported page cache orders, different sets for
      DAX and other files.  With it corrected, 512MB page cache becomes
      disallowed on all non-DAX files on ARM64 system where the base page size
      is 64KB.  After this patch is applied, the test program fails with error
      -EINVAL returned from __thp_vma_allowable_orders() and the madvise()
      system call to collapse the page caches.
      
      Link: https://lkml.kernel.org/r/20240715000423.316491-1-gshan@redhat.com
      Fixes: 6b24ca4a ("mm: Use multi-index entries in the page cache")
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: <stable@vger.kernel.org>	[5.17+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d659b715
    • Yang Shi's avatar
      mm: huge_memory: use !CONFIG_64BIT to relax huge page alignment on 32 bit machines · d9592025
      Yang Shi authored
      Yves-Alexis Perez reported commit 4ef9ad19 ("mm: huge_memory: don't
      force huge page alignment on 32 bit") didn't work for x86_32 [1].  It is
      because x86_32 uses CONFIG_X86_32 instead of CONFIG_32BIT.
      
      !CONFIG_64BIT should cover all 32 bit machines.
      
      [1] https://lore.kernel.org/linux-mm/CAHbLzkr1LwH3pcTgM+aGQ31ip2bKqiqEQ8=FQB+t2c3dhNKNHA@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20240712155855.1130330-1-yang@os.amperecomputing.com
      Fixes: 4ef9ad19 ("mm: huge_memory: don't force huge page alignment on 32 bit")
      Signed-off-by: default avatarYang Shi <yang@os.amperecomputing.com>
      Reported-by: default avatarYves-Alexis Perez <corsac@debian.org>
      Tested-by: default avatarYves-Alexis Perez <corsac@debian.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Salvatore Bonaccorso <carnil@debian.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: <stable@vger.kernel.org>	[6.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d9592025
    • Ram Tummala's avatar
      mm: fix old/young bit handling in the faulting path · 4cd7ba16
      Ram Tummala authored
      Commit 3bd786f7 ("mm: convert do_set_pte() to set_pte_range()")
      replaced do_set_pte() with set_pte_range() and that introduced a
      regression in the following faulting path of non-anonymous vmas which
      caused the PTE for the faulting address to be marked as old instead of
      young.
      
      handle_pte_fault()
        do_pte_missing()
          do_fault()
            do_read_fault() || do_cow_fault() || do_shared_fault()
              finish_fault()
                set_pte_range()
      
      The polarity of prefault calculation is incorrect.  This leads to prefault
      being incorrectly set for the faulting address.  The following check will
      incorrectly mark the PTE old rather than young.  On some architectures
      this will cause a double fault to mark it young when the access is
      retried.
      
          if (prefault && arch_wants_old_prefaulted_pte())
              entry = pte_mkold(entry);
      
      On a subsequent fault on the same address, the faulting path will see a
      non NULL vmf->pte and instead of reaching the do_pte_missing() path, PTE
      will then be correctly marked young in handle_pte_fault() itself.
      
      Due to this bug, performance degradation in the fault handling path will
      be observed due to unnecessary double faulting.
      
      Link: https://lkml.kernel.org/r/20240710014539.746200-1-rtummala@nvidia.com
      Fixes: 3bd786f7 ("mm: convert do_set_pte() to set_pte_range()")
      Signed-off-by: default avatarRam Tummala <rtummala@nvidia.com>
      Reviewed-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4cd7ba16
    • James Clark's avatar
      dt-bindings: arm: update James Clark's email address · 34e526f6
      James Clark authored
      My new address is james.clark@linaro.org
      
      Link: https://lkml.kernel.org/r/20240709102512.31212-3-james.clark@linaro.orgSigned-off-by: default avatarJames Clark <james.clark@linaro.org>
      Cc: Bjorn Andersson <quic_bjorande@quicinc.com>
      Cc: Conor Dooley <conor+dt@kernel.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Geliang Tang <geliang@kernel.org>
      Cc: Hao Zhang <quic_hazha@quicinc.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Krzysztof Kozlowski <krzk+dt@kernel.org>
      Cc: Mao Jinlong <quic_jinlmao@quicinc.com>
      Cc: Matthieu Baerts <matttbe@kernel.org>
      Cc: Matt Ranostay <matt@ranostay.sg>
      Cc: Mike Leach <mike.leach@linaro.org>
      Cc: Oleksij Rempel <o.rempel@pengutronix.de>
      Cc: Rob Herring (Arm) <robh@kernel.org>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      34e526f6
    • James Clark's avatar
      MAINTAINERS: mailmap: update James Clark's email address · 5bf6f3c5
      James Clark authored
      My new address is james.clark@linaro.org
      
      Link: https://lkml.kernel.org/r/20240709102512.31212-2-james.clark@linaro.orgSigned-off-by: default avatarJames Clark <james.clark@linaro.org>
      Cc: Bjorn Andersson <quic_bjorande@quicinc.com>
      Cc: Conor Dooley <conor+dt@kernel.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Geliang Tang <geliang@kernel.org>
      Cc: Hao Zhang <quic_hazha@quicinc.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Krzysztof Kozlowski <krzk+dt@kernel.org>
      Cc: Mao Jinlong <quic_jinlmao@quicinc.com>
      Cc: Matthieu Baerts <matttbe@kernel.org>
      Cc: Matt Ranostay <matt@ranostay.sg>
      Cc: Mike Leach <mike.leach@linaro.org>
      Cc: Oleksij Rempel <o.rempel@pengutronix.de>
      Cc: Rob Herring (Arm) <robh@kernel.org>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5bf6f3c5
    • Linus Torvalds's avatar
      Merge tag 'auxdisplay-for-v6.11-tag1' of... · 2f8c4f50
      Linus Torvalds authored
      Merge tag 'auxdisplay-for-v6.11-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
      
      Pull auxdisplay updates from Geert Uytterhoeven:
      
        - add support for configuring the boot message on line displays
      
        - miscellaneous fixes and improvements
      
      * tag 'auxdisplay-for-v6.11-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
        auxdisplay: ht16k33: Drop reference after LED registration
        auxdisplay: Use sizeof(*pointer) instead of sizeof(type)
        auxdisplay: hd44780: add missing MODULE_DESCRIPTION() macro
        auxdisplay: linedisp: add missing MODULE_DESCRIPTION() macro
        auxdisplay: linedisp: Support configuring the boot message
        auxdisplay: charlcd: Provide a forward declaration
      2f8c4f50
    • Linus Torvalds's avatar
      Merge tag 'sound-fix-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · eb966e0c
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "A collection of fixes gathered since the previous pull.
      
        We see a bit large LOCs at a HD-audio quirk, but that's only bulk COEF
        data, hence it's safe to take. In addition to that, there were two
        minor fixes for MIDI 2.0 handling for ALSA core, and the rest are all
        rather random small and device-specific fixes"
      
      * tag 'sound-fix-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ASoC: fsl-asoc-card: Dynamically allocate memory for snd_soc_dai_link_components
        ASoC: amd: yc: Support mic on Lenovo Thinkpad E16 Gen 2
        ALSA: hda/realtek: Implement sound init sequence for Samsung Galaxy Book3 Pro 360
        ALSA: hda/realtek: cs35l41: Fixup remaining asus strix models
        ASoC: SOF: ipc4-topology: Preserve the DMA Link ID for ChainDMA on unprepare
        ASoC: SOF: ipc4-topology: Only handle dai_config with HW_PARAMS for ChainDMA
        ALSA: ump: Force 1 Group for MIDI1 FBs
        ALSA: ump: Don't update FB name for static blocks
        ALSA: usb-audio: Add a quirk for Sonix HD USB Camera
        ASoC: TAS2781: Fix tasdev_load_calibrated_data()
        ASoC: tegra: select CONFIG_SND_SIMPLE_CARD_UTILS
        ASoC: Intel: use soc_intel_is_byt_cr() only when IOSF_MBI is reachable
        ALSA: usb-audio: Move HD Webcam quirk to the right place
        ALSA: hda: tas2781: mark const variables as __maybe_unused
        ALSA: usb-audio: Fix microphone sound on HD webcam.
        ASoC: sof: amd: fix for firmware reload failure in Vangogh platform
        ASoC: Intel: Fix RT5650 SSP lookup
        ASOC: SOF: Intel: hda-loader: only wait for HDaudio IOC for IPC4 devices
        ASoC: SOF: imx8m: Fix DSP control regmap retrieval
      eb966e0c
    • Linus Torvalds's avatar
      Merge tag 'drm-next-2024-07-26' of https://gitlab.freedesktop.org/drm/kernel · 0ba9b155
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Fixes for rc1, mostly amdgpu, i915 and xe, with some other misc ones,
        doesn't seem to be anything too serious.
      
        amdgpu:
         - Bump driver version for GFX12 DCC
         - DC documention warning fixes
         - VCN unified queue power fix
         - SMU fix
         - RAS fix
         - Display corruption fix
         - SDMA 5.2 workaround
         - GFX12 fixes
         - Uninitialized variable fix
         - VCN/JPEG 4.0.3 fixes
         - Misc display fixes
         - RAS fixes
         - VCN4/5 harvest fix
         - GPU reset fix
      
        i915:
         - Reset intel_dp->link_trained before retraining the link
         - Don't switch the LTTPR mode on an active link
         - Do not consider preemption during execlists_dequeue for gen8
         - Allow NULL memory region
      
        xe:
         - xe_exec ioctl minor fix on sync entry cleanup upon error
         - SRIOV: limit VF LMEM provisioning
         - Wedge mode fixes
      
        v3d:
         - fix indirect dispatch on newer v3d revs
      
        panel:
         - fix panel backlight bindings"
      
      * tag 'drm-next-2024-07-26' of https://gitlab.freedesktop.org/drm/kernel: (39 commits)
        drm/amdgpu: reset vm state machine after gpu reset(vram lost)
        drm/amdgpu: add missed harvest check for VCN IP v4/v5
        drm/amdgpu: Fix eeprom max record count
        drm/amdgpu: fix ras UE error injection failure issue
        drm/amd/display: Remove ASSERT if significance is zero in math_ceil2
        drm/amd/display: Check for NULL pointer
        drm/amdgpu/vcn: Use offsets local to VCN/JPEG in VF
        drm/amdgpu: Add empty HDP flush function to VCN v4.0.3
        drm/amdgpu: Add empty HDP flush function to JPEG v4.0.3
        drm/amd/amdgpu: Fix uninitialized variable warnings
        drm/amdgpu: Fix atomics on GFX12
        drm/amdgpu/sdma5.2: Update wptr registers as well as doorbell
        drm/i915: Allow NULL memory region
        drm/i915/gt: Do not consider preemption during execlists_dequeue for gen8
        dt-bindings: display: panel: samsung,atna33xc20: Document ATNA45AF01
        drm/xe: Don't suspend device upon wedge
        drm/xe: Wedge the entire device
        drm/xe/pf: Limit fair VF LMEM provisioning
        drm/xe/exec: Fix minor bug related to xe_sync_entry_cleanup
        drm/amd/display: fix corruption with high refresh rates on DCN 3.0
        ...
      0ba9b155
    • Linus Torvalds's avatar
      Merge tag 's390-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 65ad409e
      Linus Torvalds authored
      Pull more s390 updates from Vasily Gorbik:
      
       - Fix KMSAN build breakage caused by the conflict between s390 and
         mm-stable trees
      
       - Add KMSAN page markers for ptdump
      
       - Add runtime constant support
      
       - Fix __pa/__va for modules under non-GPL licenses by exporting
         necessary vm_layout struct with EXPORT_SYMBOL to prevent linkage
         problems
      
       - Fix an endless loop in the CF_DIAG event stop in the CPU Measurement
         Counter Facility code when the counter set size is zero
      
       - Remove the PROTECTED_VIRTUALIZATION_GUEST config option and enable
         its functionality by default
      
       - Support allocation of multiple MSI interrupts per device and improve
         logging of architecture-specific limitations
      
       - Add support for lowcore relocation as a debugging feature to catch
         all null ptr dereferences in the kernel address space, improving
         detection beyond the current implementation's limited write access
         protection
      
       - Clean up and rework CPU alternatives to allow for callbacks and early
         patching for the lowcore relocation
      
      * tag 's390-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (39 commits)
        s390: Remove protvirt and kvm config guards for uv code
        s390/boot: Add cmdline option to relocate lowcore
        s390/kdump: Make kdump ready for lowcore relocation
        s390/entry: Make system_call() ready for lowcore relocation
        s390/entry: Make ret_from_fork() ready for lowcore relocation
        s390/entry: Make __switch_to() ready for lowcore relocation
        s390/entry: Make restart_int_handler() ready for lowcore relocation
        s390/entry: Make mchk_int_handler() ready for lowcore relocation
        s390/entry: Make int handlers ready for lowcore relocation
        s390/entry: Make pgm_check_handler() ready for lowcore relocation
        s390/entry: Add base register to CHECK_VMAP_STACK/CHECK_STACK macro
        s390/entry: Add base register to SIEEXIT macro
        s390/entry: Add base register to MBEAR macro
        s390/entry: Make __sie64a() ready for lowcore relocation
        s390/head64: Make startup code ready for lowcore relocation
        s390: Add infrastructure to patch lowcore accesses
        s390/atomic_ops: Disable flag outputs constraint for GCC versions below 14.2.0
        s390/entry: Move SIE indicator flag to thread info
        s390/nmi: Simplify ptregs setup
        s390/alternatives: Remove alternative facility list
        ...
      65ad409e
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · a6294b5b
      Linus Torvalds authored
      Pull arm64 fixes from Will Deacon:
       "The usual summary below, but the main fix is for the fast GUP lockless
        page-table walk when we have a combination of compile-time and
        run-time folding of the p4d and the pud respectively.
      
         - Remove some redundant Kconfig conditionals
      
         - Fix string output in ptrace selftest
      
         - Fix fast GUP crashes in some page-table configurations
      
         - Remove obsolete linker option when building the vDSO
      
         - Fix some sysreg field definitions for the GIC"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: mm: Fix lockless walks with static and dynamic page-table folding
        arm64/sysreg: Correct the values for GICv4.1
        arm64/vdso: Remove --hash-style=sysv
        kselftest: missing arg in ptrace.c
        arm64/Kconfig: Remove redundant 'if HAVE_FUNCTION_GRAPH_TRACER'
        arm64: remove redundant 'if HAVE_ARCH_KASAN' in Kconfig
      a6294b5b
    • Linus Torvalds's avatar
      Merge tag 'ceph-for-6.11-rc1' of https://github.com/ceph/ceph-client · 6467dfdf
      Linus Torvalds authored
      Pull ceph updates from Ilya Dryomov:
       "A small patchset to address bogus I/O errors and ultimately an
        assertion failure in the face of watch errors with -o exclusive
        mappings in RBD marked for stable and some assorted CephFS fixes"
      
      * tag 'ceph-for-6.11-rc1' of https://github.com/ceph/ceph-client:
        rbd: don't assume rbd_is_lock_owner() for exclusive mappings
        rbd: don't assume RBD_LOCK_STATE_LOCKED for exclusive mappings
        rbd: rename RBD_LOCK_STATE_RELEASING and releasing_wait
        ceph: fix incorrect kmalloc size of pagevec mempool
        ceph: periodically flush the cap releases
        ceph: convert comma to semicolon in __ceph_dentry_dir_lease_touch()
        ceph: use cap_wait_list only if debugfs is enabled
      6467dfdf
    • Linus Torvalds's avatar
      Merge tag 'erofs-for-6.11-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs · 732c2753
      Linus Torvalds authored
      Pull more erofs updates from Gao Xiang:
      
       - Support STATX_DIOALIGN and FS_IOC_GETFSSYSFSPATH
      
       - Fix a race of LZ4 decompression due to recent refactoring
      
       - Another multi-page folio adaption in erofs_bread()
      
      * tag 'erofs-for-6.11-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
        erofs: convert comma to semicolon
        erofs: support multi-page folios for erofs_bread()
        erofs: add support for FS_IOC_GETFSSYSFSPATH
        erofs: fix race in z_erofs_get_gbuf()
        erofs: support STATX_DIOALIGN
      732c2753
    • Linus Torvalds's avatar
      Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · dd90ad50
      Linus Torvalds authored
      Pull struct file leak fixes from Al Viro:
       "a couple of leaks on failure exits missing fdput()"
      
      * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        lirc: rc_dev_get_from_fd(): fix file leak
        powerpc: fix a file leak in kvm_vcpu_ioctl_enable_cap()
      dd90ad50
    • Linus Torvalds's avatar
      arm64: allow installing compressed image by default · 4c7be57f
      Linus Torvalds authored
      On arm64 we build compressed images, but "make install" by default will
      install the old non-compressed one.  To actually get the compressed
      image install, you need to use "make zinstall", which is not the usual
      way to install a kernel.
      
      Which may not sound like much of an issue, but when you deal with
      multiple architectures (and years of your fingers knowing the regular
      "make install" incantation), this inconsistency is pretty annoying.
      
      But as Will Deacon says:
       "Sadly, bootloaders being as top quality as you might expect, I don't
        think we're in a position to rely on decompressor support across the
        board. Our Image.gz is literally just that -- we don't have a built-in
        decompressor (nor do I think we want to rush into that again after the
        fun we had on arm32) and the recent EFI zboot support solves that
        problem for platforms using EFI.
      
        Changing the default 'install' target terrifies me. There are bound to
        be folks with embedded boards who've scripted this and we could really
        ruin their day if we quietly give them a compressed kernel that their
        bootloader doesn't know how to handle :/"
      
      So make this conditional on a new "COMPRESSED_INSTALL" option.
      
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4c7be57f
    • Linus Torvalds's avatar
      Merge tag 'bitmap-6.11-rc1' of https://github.com:/norov/linux · 51c47675
      Linus Torvalds authored
      Pull bitmap updates from Yury Norov:
       "Random fixes"
      
      * tag 'bitmap-6.11-rc1' of https://github.com:/norov/linux:
        riscv: Remove unnecessary int cast in variable_fls()
        radix tree test suite: put definition of bitmap_clear() into lib/bitmap.c
        bitops: Add a comment explaining the double underscore macros
        lib: bitmap: add missing MODULE_DESCRIPTION() macros
        cpumask: introduce assign_cpu() macro
      51c47675
    • Palmer Dabbelt's avatar
      RISC-V: Provide the frequency of time CSR via hwprobe · 52420e48
      Palmer Dabbelt authored
      The RISC-V architecture makes a real time counter CSR (via RDTIME
      instruction) available for applications in U-mode but there is no
      architected mechanism for an application to discover the frequency
      the counter is running at. Some applications (e.g., DPDK) use the
      time counter for basic performance analysis as well as fine grained
      time-keeping.
      
      Add support to the hwprobe system call to export the time CSR
      frequency to code running in U-mode.
      Signed-off-by: default avatarYunhui Cui <cuiyunhui@bytedance.com>
      Reviewed-by: default avatarEvan Green <evan@rivosinc.com>
      Reviewed-by: default avatarAnup Patel <anup@brainfault.org>
      Acked-by: default avatarPunit Agrawal <punit.agrawal@bytedance.com>
      Link: https://lore.kernel.org/r/20240702033731.71955-2-cuiyunhui@bytedance.comSigned-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      52420e48
    • Stuart Menefy's avatar
      riscv: Extend sv39 linear mapping max size to 128G · 5c8405d7
      Stuart Menefy authored
      This harmonizes all virtual addressing modes which can now all map
      (PGDIR_SIZE * PTRS_PER_PGD) / 4 of physical memory.
      
      The RISCV implementation of KASAN requires that the boundary between
      shallow mappings are aligned on an 8G boundary. In this case we need
      VMALLOC_START to be 8G aligned. So although we only need to move the
      start of the linear mapping down by 4GiB to allow 128GiB to be mapped,
      we actually move it down by 8GiB (creating a 4GiB hole between the
      linear mapping and KASAN shadow space) to maintain the alignment
      requirement.
      Signed-off-by: default avatarStuart Menefy <stuart.menefy@codasip.com>
      Reviewed-by: default avatarAlexandre Ghiti <alexghiti@rivosinc.com>
      Link: https://lore.kernel.org/r/20240630110550.1731929-1-stuart.menefy@codasip.comSigned-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      5c8405d7
    • Palmer Dabbelt's avatar
      Merge patch series "RISC-V: Select ACPI PPTT drivers" · 3aa1a7d0
      Palmer Dabbelt authored
      This series adds support for ACPI PPTT via cacheinfo.
      
      * b4-shazam-merge:
        RISC-V: Select ACPI PPTT drivers
        riscv: cacheinfo: initialize cacheinfo's level and type from ACPI PPTT
        riscv: cacheinfo: remove the useless input parameter (node) of ci_leaf_init()
      
      Link: https://lore.kernel.org/r/20240617131425.7526-1-cuiyunhui@bytedance.comSigned-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      3aa1a7d0
    • Palmer Dabbelt's avatar
      Merge patch "Enable SPCR table for console output on RISC-V" · ec1dc56b
      Palmer Dabbelt authored
      Sia Jee Heng <jeeheng.sia@starfivetech.com> says:
      
      The ACPI SPCR code has been used to enable console output for ARM64 and
      X86. The same code can be reused for RISC-V. Furthermore, SPCR table is
      mandated for headless system as outlined in the RISC-V BRS
      Specification, chapter 6.
      
      * b4-shazam-merge:
        RISC-V: ACPI: Enable SPCR table for console output on RISC-V
      
      Link: https://lore.kernel.org/r/20240502073751.102093-1-jeeheng.sia@starfivetech.comSigned-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      ec1dc56b
    • Jisheng Zhang's avatar
      riscv: enable HAVE_ARCH_STACKLEAK · b5db73fb
      Jisheng Zhang authored
      Add support for the stackleak feature. Whenever the kernel returns to user
      space the kernel stack is filled with a poison value.
      
      At the same time, disables the plugin in EFI stub code because EFI stub
      is out of scope for the protection.
      
      Tested on qemu and milkv duo:
      / # echo STACKLEAK_ERASING > /sys/kernel/debug/provoke-crash/DIRECT
      [   38.675575] lkdtm: Performing direct entry STACKLEAK_ERASING
      [   38.678448] lkdtm: stackleak stack usage:
      [   38.678448]   high offset: 288 bytes
      [   38.678448]   current:     496 bytes
      [   38.678448]   lowest:      1328 bytes
      [   38.678448]   tracked:     1328 bytes
      [   38.678448]   untracked:   448 bytes
      [   38.678448]   poisoned:    14312 bytes
      [   38.678448]   low offset:  8 bytes
      [   38.689887] lkdtm: OK: the rest of the thread stack is properly erased
      Signed-off-by: default avatarJisheng Zhang <jszhang@kernel.org>
      Reviewed-by: default avatarCharlie Jenkins <charlie@rivosinc.com>
      Link: https://lore.kernel.org/r/20240623235316.2010-1-jszhang@kernel.orgSigned-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      b5db73fb
    • Zhongqiu Han's avatar
      riscv: signal: Remove unlikely() from WARN_ON() condition · 1d20e5d4
      Zhongqiu Han authored
      "WARN_ON(unlikely(x))" is excessive. WARN_ON() already uses unlikely()
      internally.
      Signed-off-by: default avatarZhongqiu Han <quic_zhonhan@quicinc.com>
      Reviewed-by: default avatarBjorn Andersson <quic_bjorande@quicinc.com>
      Reviewed-by: default avatarAndy Chiu <andy.chiu@sifive.com>
      Link: https://lore.kernel.org/r/20240620033434.3778156-1-quic_zhonhan@quicinc.comSigned-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      1d20e5d4
    • Anton Blanchard's avatar
      riscv: Improve exception and system call latency · 5d5fc33c
      Anton Blanchard authored
      Many CPUs implement return address branch prediction as a stack. The
      RISCV architecture refers to this as a return address stack (RAS). If
      this gets corrupted then the CPU will mispredict at least one but
      potentally many function returns.
      
      There are two issues with the current RISCV exception code:
      
      - We are using the alternate link stack (x5/t0) for the indirect branch
        which makes the hardware think this is a function return. This will
        corrupt the RAS.
      
      - We modify the return address of handle_exception to point to
        ret_from_exception. This will also corrupt the RAS.
      
      Testing the null system call latency before and after the patch:
      
      Visionfive2 (StarFive JH7110 / U74)
      baseline: 189.87 ns
      patched:  176.76 ns
      
      Lichee pi 4a (T-Head TH1520 / C910)
      baseline: 666.58 ns
      patched:  636.90 ns
      
      Just over 7% on the U74 and just over 4% on the C910.
      Signed-off-by: default avatarAnton Blanchard <antonb@tenstorrent.com>
      Signed-off-by: default avatarCyril Bur <cyrilbur@tenstorrent.com>
      Tested-by: default avatarJisheng Zhang <jszhang@kernel.org>
      Reviewed-by: default avatarJisheng Zhang <jszhang@kernel.org>
      Link: https://lore.kernel.org/r/20240607061335.2197383-1-cyrilbur@tenstorrent.comSigned-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      5d5fc33c
    • Chen Ni's avatar
      erofs: convert comma to semicolon · 14e9283f
      Chen Ni authored
      Replace a comma between expression statements by a semicolon.
      Signed-off-by: default avatarChen Ni <nichen@iscas.ac.cn>
      Link: https://lore.kernel.org/r/20240724020721.2389738-1-nichen@iscas.ac.cnReviewed-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarGao Xiang <hsiangkao@linux.alibaba.com>
      14e9283f
    • Gao Xiang's avatar
      erofs: support multi-page folios for erofs_bread() · 5d3bb77e
      Gao Xiang authored
      If the requested page is part of the previous multi-page folio, there
      is no need to call read_mapping_folio() again.
      
      Also, get rid of the remaining one of page->index [1] in our codebase.
      
      [1] https://lore.kernel.org/r/Zp8fgUSIBGQ1TN0D@casper.infradead.org
      
      Cc: Matthew Wilcox <willy@infradead.org>
      Reviewed-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarGao Xiang <hsiangkao@linux.alibaba.com>
      Link: https://lore.kernel.org/r/20240723073024.875290-1-hsiangkao@linux.alibaba.com
      5d3bb77e
    • Huang Xiaojia's avatar
      erofs: add support for FS_IOC_GETFSSYSFSPATH · 684b290a
      Huang Xiaojia authored
      FS_IOC_GETFSSYSFSPATH ioctl exposes /sys/fs path of a given filesystem,
      potentially standarizing sysfs reporting. This patch add support for
      FS_IOC_GETFSSYSFSPATH for erofs, "erofs/<dev>" will be outputted for bdev
      cases, "erofs/[domain_id,]<fs_id>" will be outputted for fscache cases.
      Signed-off-by: default avatarHuang Xiaojia <huangxiaojia2@huawei.com>
      Link: https://lore.kernel.org/r/20240720082335.441563-1-huangxiaojia2@huawei.comReviewed-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarGao Xiang <hsiangkao@linux.alibaba.com>
      684b290a
    • Gao Xiang's avatar
      erofs: fix race in z_erofs_get_gbuf() · 7dc5537c
      Gao Xiang authored
      In z_erofs_get_gbuf(), the current task may be migrated to another
      CPU between `z_erofs_gbuf_id()` and `spin_lock(&gbuf->lock)`.
      
      Therefore, z_erofs_put_gbuf() will trigger the following issue
      which was found by stress test:
      
      <2>[772156.434168] kernel BUG at fs/erofs/zutil.c:58!
      ..
      <4>[772156.435007]
      <4>[772156.439237] CPU: 0 PID: 3078 Comm: stress Kdump: loaded Tainted: G            E      6.10.0-rc7+ #2
      <4>[772156.439239] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 1.0.0 01/01/2017
      <4>[772156.439241] pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
      <4>[772156.439243] pc : z_erofs_put_gbuf+0x64/0x70 [erofs]
      <4>[772156.439252] lr : z_erofs_lz4_decompress+0x600/0x6a0 [erofs]
      ..
      <6>[772156.445958] stress (3127): drop_caches: 1
      <4>[772156.446120] Call trace:
      <4>[772156.446121]  z_erofs_put_gbuf+0x64/0x70 [erofs]
      <4>[772156.446761]  z_erofs_lz4_decompress+0x600/0x6a0 [erofs]
      <4>[772156.446897]  z_erofs_decompress_queue+0x740/0xa10 [erofs]
      <4>[772156.447036]  z_erofs_runqueue+0x428/0x8c0 [erofs]
      <4>[772156.447160]  z_erofs_readahead+0x224/0x390 [erofs]
      ..
      
      Fixes: f36f3010 ("erofs: rename per-CPU buffers to global buffer pool and make it configurable")
      Cc: <stable@vger.kernel.org> # 6.10+
      Reviewed-by: default avatarChunhai Guo <guochunhai@vivo.com>
      Reviewed-by: default avatarSandeep Dhavale <dhavale@google.com>
      Reviewed-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarGao Xiang <hsiangkao@linux.alibaba.com>
      Link: https://lore.kernel.org/r/20240722035110.3456740-1-hsiangkao@linux.alibaba.com
      7dc5537c
    • Hongbo Li's avatar
      erofs: support STATX_DIOALIGN · 9c421ef3
      Hongbo Li authored
      Add support for STATX_DIOALIGN to EROFS, so that direct I/O
      alignment restrictions are exposed to userspace in a generic
      way.
      
      [Before]
      ```
      ./statx_test /mnt/erofs/testfile
      statx(/mnt/erofs/testfile) = 0
      dio mem align:0
      dio offset align:0
      ```
      
      [After]
      ```
      ./statx_test /mnt/erofs/testfile
      statx(/mnt/erofs/testfile) = 0
      dio mem align:512
      dio offset align:512
      ```
      Signed-off-by: default avatarHongbo Li <lihongbo22@huawei.com>
      Reviewed-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarGao Xiang <hsiangkao@linux.alibaba.com>
      Link: https://lore.kernel.org/r/20240718083243.2485437-1-hsiangkao@linux.alibaba.com
      9c421ef3
  3. 25 Jul, 2024 1 commit