1. 29 Mar, 2023 1 commit
    • Yu Kuai's avatar
      md: fix regression for null-ptr-deference in __md_stop() · 433279be
      Yu Kuai authored
      Commit 3e453522 ("md: Free resources in __md_stop") tried to fix
      null-ptr-deference for 'active_io' by moving percpu_ref_exit() to
      __md_stop(), however, the commit also moving 'writes_pending' to
      __md_stop(), and this will cause mdadm tests broken:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000038
      Oops: 0000 [#1] PREEMPT SMP
      CPU: 15 PID: 17830 Comm: mdadm Not tainted 6.3.0-rc3-next-20230324-00009-g520d37
      RIP: 0010:free_percpu+0x465/0x670
      Call Trace:
       <TASK>
       __percpu_ref_exit+0x48/0x70
       percpu_ref_exit+0x1a/0x90
       __md_stop+0xe9/0x170
       do_md_stop+0x1e1/0x7b0
       md_ioctl+0x90c/0x1aa0
       blkdev_ioctl+0x19b/0x400
       vfs_ioctl+0x20/0x50
       __x64_sys_ioctl+0xba/0xe0
       do_syscall_64+0x6c/0xe0
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      And the problem can be reporduced 100% by following test:
      
      mdadm -CR /dev/md0 -l1 -n1 /dev/sda --force
      echo inactive > /sys/block/md0/md/array_state
      echo read-auto  > /sys/block/md0/md/array_state
      echo inactive > /sys/block/md0/md/array_state
      
      Root cause:
      
      // start raid
      raid1_run
       mddev_init_writes_pending
        percpu_ref_init
      
      // inactive raid
      array_state_store
       do_md_stop
        __md_stop
         percpu_ref_exit
      
      // start raid again
      array_state_store
       do_md_run
        raid1_run
         mddev_init_writes_pending
          if (mddev->writes_pending.percpu_count_ptr)
          // won't reinit
      
      // inactive raid again
      ...
      percpu_ref_exit
      -> null-ptr-deference
      
      Before the commit, 'writes_pending' is exited when mddev is freed, and
      it's safe to restart raid because mddev_init_writes_pending() already make
      sure that 'writes_pending' will only be initialized once.
      
      Fix the prblem by moving 'writes_pending' back, it's a litter hard to find
      the relationship between alloc memory and free memory, however, code
      changes is much less and we lived with this for a long time already.
      
      Fixes: 3e453522 ("md: Free resources in __md_stop")
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230328094400.1448955-1-yukuai1@huaweicloud.com
      433279be
  2. 27 Mar, 2023 1 commit
    • Alyssa Ross's avatar
      loop: LOOP_CONFIGURE: send uevents for partitions · bb430b69
      Alyssa Ross authored
      LOOP_CONFIGURE is, as far as I understand it, supposed to be a way to
      combine LOOP_SET_FD and LOOP_SET_STATUS64 into a single syscall.  When
      using LOOP_SET_FD+LOOP_SET_STATUS64, a single uevent would be sent for
      each partition found on the loop device after the second ioctl(), but
      when using LOOP_CONFIGURE, no such uevent was being sent.
      
      In the old setup, uevents are disabled for LOOP_SET_FD, but not for
      LOOP_SET_STATUS64.  This makes sense, as it prevents uevents being
      sent for a partially configured device during LOOP_SET_FD - they're
      only sent at the end of LOOP_SET_STATUS64.  But for LOOP_CONFIGURE,
      uevents were disabled for the entire operation, so that final
      notification was never issued.  To fix this, reduce the critical
      section to exclude the loop_reread_partitions() call, which causes
      the uevents to be issued, to after uevents are re-enabled, matching
      the behaviour of the LOOP_SET_FD+LOOP_SET_STATUS64 combination.
      
      I noticed this because Busybox's losetup program recently changed from
      using LOOP_SET_FD+LOOP_SET_STATUS64 to LOOP_CONFIGURE, and this broke
      my setup, for which I want a notification from the kernel any time a
      new partition becomes available.
      Signed-off-by: default avatarAlyssa Ross <hi@alyssa.is>
      [hch: reduced the critical section]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Fixes: 3448914e ("loop: Add LOOP_CONFIGURE ioctl")
      Link: https://lore.kernel.org/r/20230320125430.55367-1-hch@lst.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bb430b69
  3. 23 Mar, 2023 1 commit
  4. 22 Mar, 2023 2 commits
  5. 21 Mar, 2023 1 commit
  6. 18 Mar, 2023 1 commit
  7. 16 Mar, 2023 2 commits
    • Lukas Bulwahn's avatar
      block: remove obsolete config BLOCK_COMPAT · 8f0d196e
      Lukas Bulwahn authored
      Before commit bdc1ddad ("compat_ioctl: block: move
      blkdev_compat_ioctl() into ioctl.c"), the config BLOCK_COMPAT was used to
      include compat_ioctl.c into the kernel build. With this commit, the code
      is moved into ioctl.c and included with the config COMPAT. So, since then,
      the config BLOCK_COMPAT has no effect and any further purpose.
      
      Remove this obsolete config BLOCK_COMPAT.
      Signed-off-by: default avatarLukas Bulwahn <lukas.bulwahn@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Link: https://lore.kernel.org/r/20230316111630.4897-1-lukas.bulwahn@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8f0d196e
    • Jens Axboe's avatar
      Merge tag 'nvme-6.3-2022-03-16' of git://git.infradead.org/nvme into block-6.3 · 890a2fb0
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "nvme fixes for Linux 6.3
      
       - avoid potential UAF in nvmet_req_complete (Damien Le Moal)
       - more quirks (Elmer Miroslav Mosher Golovin, Philipp Geulen)
       - fix a memory leak in the nvme-pci probe teardown path (Irvin Cote)
       - repair the MAINTAINERS entry (Lukas Bulwahn)
       - fix handling single range discard request (Ming Lei)
       - show more opcode names in trace events (Minwoo Im)
       - fix nvme-tcp timeout reporting (Sagi Grimberg)"
      
      * tag 'nvme-6.3-2022-03-16' of git://git.infradead.org/nvme:
        nvmet: avoid potential UAF in nvmet_req_complete()
        nvme-trace: show more opcode names
        nvme-tcp: add nvme-tcp pdu size build protection
        nvme-tcp: fix opcode reporting in the timeout handler
        nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM620
        nvme-pci: add NVME_QUIRK_BOGUS_NID for Netac NV3000
        nvme-pci: fixing memory leak in probe teardown path
        nvme: fix handling single range discard request
        MAINTAINERS: repair malformed T: entries in NVM EXPRESS DRIVERS
      890a2fb0
  8. 15 Mar, 2023 17 commits
  9. 14 Mar, 2023 1 commit
    • Jan Kara's avatar
      block: do not reverse request order when flushing plug list · 34e0a279
      Jan Kara authored
      Commit 26fed4ac ("block: flush plug based on hardware and software
      queue order") changed flushing of plug list to submit requests one
      device at a time. However while doing that it also started using
      list_add_tail() instead of list_add() used previously thus effectively
      submitting requests in reverse order. Also when forming a rq_list with
      remaining requests (in case two or more devices are used), we
      effectively reverse the ordering of the plug list for each device we
      process. Submitting requests in reverse order has negative impact on
      performance for rotational disks (when BFQ is not in use). We observe
      10-25% regression in random 4k write throughput, as well as ~20%
      regression in MariaDB OLTP benchmark on rotational storage on btrfs
      filesystem.
      
      Fix the problem by preserving ordering of the plug list when inserting
      requests into the queuelist as well as by appending to requeue_list
      instead of prepending to it.
      
      Fixes: 26fed4ac ("block: flush plug based on hardware and software queue order")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20230313093002.11756-1-jack@suse.czSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      34e0a279
  10. 13 Mar, 2023 2 commits
    • NeilBrown's avatar
      md: avoid signed overflow in slot_store() · 3bc57292
      NeilBrown authored
      slot_store() uses kstrtouint() to get a slot number, but stores the
      result in an "int" variable (by casting a pointer).
      This can result in a negative slot number if the unsigned int value is
      very large.
      
      A negative number means that the slot is empty, but setting a negative
      slot number this way will not remove the device from the array.  I don't
      think this is a serious problem, but it could cause confusion and it is
      best to fix it.
      Reported-by: default avatarDan Carpenter <error27@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      3bc57292
    • Xiao Ni's avatar
      md: Free resources in __md_stop · 3e453522
      Xiao Ni authored
      If md_run() fails after ->active_io is initialized, then percpu_ref_exit
      is called in error path. However, later md_free_disk will call
      percpu_ref_exit again which leads to a panic because of null pointer
      dereference. It can also trigger this bug when resources are initialized
      but are freed in error path, then will be freed again in md_free_disk.
      
      BUG: kernel NULL pointer dereference, address: 0000000000000038
      Oops: 0000 [#1] PREEMPT SMP
      Workqueue: md_misc mddev_delayed_delete
      RIP: 0010:free_percpu+0x110/0x630
      Call Trace:
       <TASK>
       __percpu_ref_exit+0x44/0x70
       percpu_ref_exit+0x16/0x90
       md_free_disk+0x2f/0x80
       disk_release+0x101/0x180
       device_release+0x84/0x110
       kobject_put+0x12a/0x380
       kobject_put+0x160/0x380
       mddev_delayed_delete+0x19/0x30
       process_one_work+0x269/0x680
       worker_thread+0x266/0x640
       kthread+0x151/0x1b0
       ret_from_fork+0x1f/0x30
      
      For creating raid device, md raid calls do_md_run->md_run, dm raid calls
      md_run. We alloc those memory in md_run. For stopping raid device, md raid
      calls do_md_stop->__md_stop, dm raid calls md_stop->__md_stop. So we can
      free those memory resources in __md_stop.
      
      Fixes: 72adae23 ("md: Change active_io to percpu")
      Reported-and-tested-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      3e453522
  11. 08 Mar, 2023 1 commit
  12. 07 Mar, 2023 2 commits
  13. 05 Mar, 2023 8 commits
    • Linus Torvalds's avatar
      Linux 6.3-rc1 · fe15c26e
      Linus Torvalds authored
      fe15c26e
    • Linus Torvalds's avatar
      cpumask: re-introduce constant-sized cpumask optimizations · 596ff4a0
      Linus Torvalds authored
      Commit aa47a7c2 ("lib/cpumask: deprecate nr_cpumask_bits") resulted
      in the cpumask operations potentially becoming hugely less efficient,
      because suddenly the cpumask was always considered to be variable-sized.
      
      The optimization was then later added back in a limited form by commit
      6f9c07be ("lib/cpumask: add FORCE_NR_CPUS config option"), but that
      FORCE_NR_CPUS option is not useful in a generic kernel and more of a
      special case for embedded situations with fixed hardware.
      
      Instead, just re-introduce the optimization, with some changes.
      
      Instead of depending on CPUMASK_OFFSTACK being false, and then always
      using the full constant cpumask width, this introduces three different
      cpumask "sizes":
      
       - the exact size (nr_cpumask_bits) remains identical to nr_cpu_ids.
      
         This is used for situations where we should use the exact size.
      
       - the "small" size (small_cpumask_bits) is the NR_CPUS constant if it
         fits in a single word and the bitmap operations thus end up able
         to trigger the "small_const_nbits()" optimizations.
      
         This is used for the operations that have optimized single-word
         cases that get inlined, notably the bit find and scanning functions.
      
       - the "large" size (large_cpumask_bits) is the NR_CPUS constant if it
         is an sufficiently small constant that makes simple "copy" and
         "clear" operations more efficient.
      
         This is arbitrarily set at four words or less.
      
      As a an example of this situation, without this fixed size optimization,
      cpumask_clear() will generate code like
      
              movl    nr_cpu_ids(%rip), %edx
              addq    $63, %rdx
              shrq    $3, %rdx
              andl    $-8, %edx
              callq   memset@PLT
      
      on x86-64, because it would calculate the "exact" number of longwords
      that need to be cleared.
      
      In contrast, with this patch, using a MAX_CPU of 64 (which is quite a
      reasonable value to use), the above becomes a single
      
      	movq $0,cpumask
      
      instruction instead, because instead of caring to figure out exactly how
      many CPU's the system has, it just knows that the cpumask will be a
      single word and can just clear it all.
      
      Note that this does end up tightening the rules a bit from the original
      version in another way: operations that set bits in the cpumask are now
      limited to the actual nr_cpu_ids limit, whereas we used to do the
      nr_cpumask_bits thing almost everywhere in the cpumask code.
      
      But if you just clear bits, or scan for bits, we can use the simpler
      compile-time constants.
      
      In the process, remove 'cpumask_complement()' and 'for_each_cpu_not()'
      which were not useful, and which fundamentally have to be limited to
      'nr_cpu_ids'.  Better remove them now than have somebody introduce use
      of them later.
      
      Of course, on x86-64 with MAXSMP there is no sane small compile-time
      constant for the cpumask sizes, and we end up using the actual CPU bits,
      and will generate the above kind of horrors regardless.  Please don't
      use MAXSMP unless you really expect to have machines with thousands of
      cores.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      596ff4a0
    • Linus Torvalds's avatar
      Merge tag 'v6.3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · f915322f
      Linus Torvalds authored
      Pull crypto fix from Herbert Xu:
       "Fix a regression in the caam driver"
      
      * tag 'v6.3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
        crypto: caam - Fix edesc/iv ordering mixup
      f915322f
    • Linus Torvalds's avatar
      Merge tag 'x86-urgent-2023-03-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7f9ec7d8
      Linus Torvalds authored
      Pull x86 updates from Thomas Gleixner:
       "A small set of updates for x86:
      
         - Return -EIO instead of success when the certificate buffer for SEV
           guests is not large enough
      
         - Allow STIPB to be enabled with legacy IBSR. Legacy IBRS is cleared
           on return to userspace for performance reasons, but the leaves user
           space vulnerable to cross-thread attacks which STIBP prevents.
           Update the documentation accordingly"
      
      * tag 'x86-urgent-2023-03-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        virt/sev-guest: Return -EIO if certificate buffer is not large enough
        Documentation/hw-vuln: Document the interaction between IBRS and STIBP
        x86/speculation: Allow enabling STIBP with legacy IBRS
      7f9ec7d8
    • Linus Torvalds's avatar
      Merge tag 'irq-urgent-2023-03-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 4e9c542c
      Linus Torvalds authored
      Pull irq updates from Thomas Gleixner:
       "A set of updates for the interrupt susbsystem:
      
         - Prevent possible NULL pointer derefences in
           irq_data_get_affinity_mask() and irq_domain_create_hierarchy()
      
         - Take the per device MSI lock before invoking code which relies on
           it being hold
      
         - Make sure that MSI descriptors are unreferenced before freeing
           them. This was overlooked when the platform MSI code was converted
           to use core infrastructure and results in a fals positive warning
      
         - Remove dead code in the MSI subsystem
      
         - Clarify the documentation for pci_msix_free_irq()
      
         - More kobj_type constification"
      
      * tag 'irq-urgent-2023-03-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        genirq/msi, platform-msi: Ensure that MSI descriptors are unreferenced
        genirq/msi: Drop dead domain name assignment
        irqdomain: Add missing NULL pointer check in irq_domain_create_hierarchy()
        genirq/irqdesc: Make kobj_type structures constant
        PCI/MSI: Clarify usage of pci_msix_free_irq()
        genirq/msi: Take the per-device MSI lock before validating the control structure
        genirq/ipi: Fix NULL pointer deref in irq_data_get_affinity_mask()
      4e9c542c
    • Linus Torvalds's avatar
      Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 1a90673e
      Linus Torvalds authored
      Pull vfs update from Al Viro:
       "Adding Christian Brauner as VFS co-maintainer"
      
      * tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        Adding VFS co-maintainer
      1a90673e
    • Linus Torvalds's avatar
      Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 1a8d05a7
      Linus Torvalds authored
      Pull VM_FAULT_RETRY fixes from Al Viro:
       "Some of the page fault handlers do not deal with the following case
        correctly:
      
         - handle_mm_fault() has returned VM_FAULT_RETRY
      
         - there is a pending fatal signal
      
         - fault had happened in kernel mode
      
        Correct action in such case is not "return unconditionally" - fatal
        signals are handled only upon return to userland and something like
        copy_to_user() would end up retrying the faulting instruction and
        triggering the same fault again and again.
      
        What we need to do in such case is to make the caller to treat that as
        failed uaccess attempt - handle exception if there is an exception
        handler for faulting instruction or oops if there isn't one.
      
        Over the years some architectures had been fixed and now are handling
        that case properly; some still do not. This series should fix the
        remaining ones.
      
        Status:
      
         - m68k, riscv, hexagon, parisc: tested/acked by maintainers.
      
         - alpha, sparc32, sparc64: tested locally - bug has been reproduced
           on the unpatched kernel and verified to be fixed by this series.
      
         - ia64, microblaze, nios2, openrisc: build, but otherwise completely
           untested"
      
      * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        openrisc: fix livelock in uaccess
        nios2: fix livelock in uaccess
        microblaze: fix livelock in uaccess
        ia64: fix livelock in uaccess
        sparc: fix livelock in uaccess
        alpha: fix livelock in uaccess
        parisc: fix livelock in uaccess
        hexagon: fix livelock in uaccess
        riscv: fix livelock in uaccess
        m68k: fix livelock in uaccess
      1a8d05a7
    • Masahiro Yamada's avatar
      Remove Intel compiler support · 95207db8
      Masahiro Yamada authored
      include/linux/compiler-intel.h had no update in the past 3 years.
      
      We often forget about the third C compiler to build the kernel.
      
      For example, commit a0a12c3e ("asm goto: eradicate CC_HAS_ASM_GOTO")
      only mentioned GCC and Clang.
      
      init/Kconfig defines CC_IS_GCC and CC_IS_CLANG but not CC_IS_ICC,
      and nobody has reported any issue.
      
      I guess the Intel Compiler support is broken, and nobody is caring
      about it.
      
      Harald Arnesen pointed out ICC (classic Intel C/C++ compiler) is
      deprecated:
      
          $ icc -v
          icc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is
          deprecated and will be removed from product release in the second half
          of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended
          compiler moving forward. Please transition to use this compiler. Use
          '-diag-disable=10441' to disable this message.
          icc version 2021.7.0 (gcc version 12.1.0 compatibility)
      
      Arnd Bergmann provided a link to the article, "Intel C/C++ compilers
      complete adoption of LLVM".
      
      lib/zstd/common/compiler.h and lib/zstd/compress/zstd_fast.c were kept
      untouched for better sync with https://github.com/facebook/zstd
      
      Link: https://www.intel.com/content/www/us/en/developer/articles/technical/adoption-of-llvm-complete-icx.htmlSigned-off-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Reviewed-by: default avatarNathan Chancellor <nathan@kernel.org>
      Reviewed-by: default avatarMiguel Ojeda <ojeda@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      95207db8