1. 25 Mar, 2021 3 commits
    • Sean Christopherson's avatar
      mm/mmu_notifiers: ensure range_end() is paired with range_start() · c2655835
      Sean Christopherson authored
      If one or more notifiers fails .invalidate_range_start(), invoke
      .invalidate_range_end() for "all" notifiers.  If there are multiple
      notifiers, those that did not fail are expecting _start() and _end() to
      be paired, e.g.  KVM's mmu_notifier_count would become imbalanced.
      Disallow notifiers that can fail _start() from implementing _end() so
      that it's unnecessary to either track which notifiers rejected _start(),
      or had already succeeded prior to a failed _start().
      
      Note, the existing behavior of calling _start() on all notifiers even
      after a previous notifier failed _start() was an unintented "feature".
      Make it canon now that the behavior is depended on for correctness.
      
      As of today, the bug is likely benign:
      
        1. The only caller of the non-blocking notifier is OOM kill.
        2. The only notifiers that can fail _start() are the i915 and Nouveau
           drivers.
        3. The only notifiers that utilize _end() are the SGI UV GRU driver
           and KVM.
        4. The GRU driver will never coincide with the i195/Nouveau drivers.
        5. An imbalanced kvm->mmu_notifier_count only causes soft lockup in the
           _guest_, and the guest is already doomed due to being an OOM victim.
      
      Fix the bug now to play nice with future usage, e.g.  KVM has a
      potential use case for blocking memslot updates in KVM while an
      invalidation is in-progress, and failure to unblock would result in said
      updates being blocked indefinitely and hanging.
      
      Found by inspection.  Verified by adding a second notifier in KVM that
      periodically returns -EAGAIN on non-blockable ranges, triggering OOM,
      and observing that KVM exits with an elevated notifier count.
      
      Link: https://lkml.kernel.org/r/20210311180057.1582638-1-seanjc@google.com
      Fixes: 93065ac7 ("mm, oom: distinguish blockable mode for mmu notifiers")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Suggested-by: default avatarJason Gunthorpe <jgg@ziepe.ca>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c2655835
    • Andrey Konovalov's avatar
      kasan: fix per-page tags for non-page_alloc pages · cf10bd4c
      Andrey Konovalov authored
      To allow performing tag checks on page_alloc addresses obtained via
      page_address(), tag-based KASAN modes store tags for page_alloc
      allocations in page->flags.
      
      Currently, the default tag value stored in page->flags is 0x00.
      Therefore, page_address() returns a 0x00ffff...  address for pages that
      were not allocated via page_alloc.
      
      This might cause problems.  A particular case we encountered is a
      conflict with KFENCE.  If a KFENCE-allocated slab object is being freed
      via kfree(page_address(page) + offset), the address passed to kfree()
      will get tagged with 0x00 (as slab pages keep the default per-page
      tags).  This leads to is_kfence_address() check failing, and a KFENCE
      object ending up in normal slab freelist, which causes memory
      corruptions.
      
      This patch changes the way KASAN stores tag in page-flags: they are now
      stored xor'ed with 0xff.  This way, KASAN doesn't need to initialize
      per-page flags for every created page, which might be slow.
      
      With this change, page_address() returns natively-tagged (with 0xff)
      pointers for pages that didn't have tags set explicitly.
      
      This patch fixes the encountered conflict with KFENCE and prevents more
      similar issues that can occur in the future.
      
      Link: https://lkml.kernel.org/r/1a41abb11c51b264511d9e71c303bb16d5cb367b.1615475452.git.andreyknvl@google.com
      Fixes: 2813b9c0 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc")
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Branislav Rankov <Branislav.Rankov@arm.com>
      Cc: Kevin Brodsky <kevin.brodsky@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cf10bd4c
    • Miaohe Lin's avatar
      hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappings · d85aecf2
      Miaohe Lin authored
      The current implementation of hugetlb_cgroup for shared mappings could
      have different behavior.  Consider the following two scenarios:
      
       1.Assume initial css reference count of hugetlb_cgroup is 1:
        1.1 Call hugetlb_reserve_pages with from = 1, to = 2. So css reference
            count is 2 associated with 1 file_region.
        1.2 Call hugetlb_reserve_pages with from = 2, to = 3. So css reference
            count is 3 associated with 2 file_region.
        1.3 coalesce_file_region will coalesce these two file_regions into
            one. So css reference count is 3 associated with 1 file_region
            now.
      
       2.Assume initial css reference count of hugetlb_cgroup is 1 again:
        2.1 Call hugetlb_reserve_pages with from = 1, to = 3. So css reference
            count is 2 associated with 1 file_region.
      
      Therefore, we might have one file_region while holding one or more css
      reference counts. This inconsistency could lead to imbalanced css_get()
      and css_put() pair. If we do css_put one by one (i.g. hole punch case),
      scenario 2 would put one more css reference. If we do css_put all
      together (i.g. truncate case), scenario 1 will leak one css reference.
      
      The imbalanced css_get() and css_put() pair would result in a non-zero
      reference when we try to destroy the hugetlb cgroup. The hugetlb cgroup
      directory is removed __but__ associated resource is not freed. This
      might result in OOM or can not create a new hugetlb cgroup in a busy
      workload ultimately.
      
      In order to fix this, we have to make sure that one file_region must
      hold exactly one css reference. So in coalesce_file_region case, we
      should release one css reference before coalescence. Also only put css
      reference when the entire file_region is removed.
      
      The last thing to note is that the caller of region_add() will only hold
      one reference to h_cg->css for the whole contiguous reservation region.
      But this area might be scattered when there are already some
      file_regions reside in it. As a result, many file_regions may share only
      one h_cg->css reference. In order to ensure that one file_region must
      hold exactly one css reference, we should do css_get() for each
      file_region and release the reference held by caller when they are done.
      
      [linmiaohe@huawei.com: fix imbalanced css_get and css_put pair for shared mappings]
        Link: https://lkml.kernel.org/r/20210316023002.53921-1-linmiaohe@huawei.com
      
      Link: https://lkml.kernel.org/r/20210301120540.37076-1-linmiaohe@huawei.com
      Fixes: 075a61d0 ("hugetlb_cgroup: add accounting for shared mappings")
      Reported-by: kernel test robot <lkp@intel.com> (auto build test ERROR)
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d85aecf2
  2. 23 Mar, 2021 1 commit
  3. 22 Mar, 2021 1 commit
    • Linus Torvalds's avatar
      Merge tag 'selinux-pr-20210322' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux · 84196390
      Linus Torvalds authored
      Pull selinux fixes from Paul Moore:
       "Three SELinux patches:
      
         - Fix a problem where a local variable is used outside its associated
           function. Thankfully this can only be triggered by reloading the
           SELinux policy, which is a restricted operation for other obvious
           reasons.
      
         - Fix some incorrect, and inconsistent, audit and printk messages
           when loading the SELinux policy.
      
        All three patches are relatively minor and have been through our
        testing with no failures"
      
      * tag 'selinux-pr-20210322' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
        selinuxfs: unify policy load error reporting
        selinux: fix variable scope issue in live sidtab conversion
        selinux: don't log MAC_POLICY_LOAD record on failed policy load
      84196390
  4. 21 Mar, 2021 23 commits
  5. 20 Mar, 2021 7 commits
    • Thomas Gleixner's avatar
      genirq: Disable interrupts for force threaded handlers · 81e2073c
      Thomas Gleixner authored
      With interrupt force threading all device interrupt handlers are invoked
      from kernel threads. Contrary to hard interrupt context the invocation only
      disables bottom halfs, but not interrupts. This was an oversight back then
      because any code like this will have an issue:
      
      thread(irq_A)
        irq_handler(A)
          spin_lock(&foo->lock);
      
      interrupt(irq_B)
        irq_handler(B)
          spin_lock(&foo->lock);
      
      This has been triggered with networking (NAPI vs. hrtimers) and console
      drivers where printk() happens from an interrupt which interrupted the
      force threaded handler.
      
      Now people noticed and started to change the spin_lock() in the handler to
      spin_lock_irqsave() which affects performance or add IRQF_NOTHREAD to the
      interrupt request which in turn breaks RT.
      
      Fix the root cause and not the symptom and disable interrupts before
      invoking the force threaded handler which preserves the regular semantics
      and the usefulness of the interrupt force threading as a general debugging
      tool.
      
      For not RT this is not changing much, except that during the execution of
      the threaded handler interrupts are delayed until the handler
      returns. Vs. scheduling and softirq processing there is no difference.
      
      For RT kernels there is no issue.
      
      Fixes: 8d32a307 ("genirq: Provide forced interrupt threading")
      Reported-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJohan Hovold <johan@kernel.org>
      Acked-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://lore.kernel.org/r/20210317143859.513307808@linutronix.de
      81e2073c
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · 812da4d3
      Linus Torvalds authored
      Pull RISC-V fixes from Palmer Dabbelt:
       "A handful of fixes for 5.12:
      
         - fix the SBI remote fence numbers for hypervisor fences, which had
           been transcribed in the wrong order in Linux. These fences are only
           used with the KVM patches applied.
      
         - fix a whole host of build warnings, these should have no functional
           change.
      
         - fix init_resources() to prevent an off-by-one error from causing an
           out-of-bounds array reference. This was manifesting during boot on
           vexriscv.
      
         - ensure the KASAN mappings are visible before proceeding to use
           them"
      
      * tag 'riscv-for-linus-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        riscv: Correct SPARSEMEM configuration
        RISC-V: kasan: Declare kasan_shallow_populate() static
        riscv: Ensure page table writes are flushed when initializing KASAN vmalloc
        RISC-V: Fix out-of-bounds accesses in init_resources()
        riscv: Fix compilation error with Canaan SoC
        ftrace: Fix spelling mistake "disabed" -> "disabled"
        riscv: fix bugon.cocci warnings
        riscv: process: Fix no prototype for arch_dup_task_struct
        riscv: ftrace: Use ftrace_get_regs helper
        riscv: process: Fix no prototype for show_regs
        riscv: syscall_table: Reduce W=1 compilation warnings noise
        riscv: time: Fix no prototype for time_init
        riscv: ptrace: Fix no prototype warnings
        riscv: sbi: Fix comment of __sbi_set_timer_v01
        riscv: irq: Fix no prototype warning
        riscv: traps: Fix no prototype warnings
        RISC-V: correct enum sbi_ext_rfence_fid
      812da4d3
    • Linus Torvalds's avatar
      Merge tag '5.12-rc3-smb3' of git://git.samba.org/sfrench/cifs-2.6 · bfdc4aa9
      Linus Torvalds authored
      Pull cifs fixes from Steve French:
       "Five cifs/smb3 fixes - three for stable, including an important ACL
        fix and security signature fix"
      
      * tag '5.12-rc3-smb3' of git://git.samba.org/sfrench/cifs-2.6:
        cifs: fix allocation size on newly created files
        cifs: warn and fail if trying to use rootfs without the config option
        fs/cifs/: fix misspellings using codespell tool
        cifs: Fix preauth hash corruption
        cifs: update new ACE pointer after populate_new_aces.
      bfdc4aa9
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · af97713d
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Eight fixes, all in drivers, all fairly minor either being fixes in
        error legs, memory leaks on teardown, context errors or semantic
        problems"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: mpt3sas: Do not use GFP_KERNEL in atomic context
        scsi: ufs: ufs-mediatek: Correct operator & -> &&
        scsi: sd_zbc: Update write pointer offset cache
        scsi: lpfc: Fix some error codes in debugfs
        scsi: qla2xxx: Fix broken #endif placement
        scsi: st: Fix a use after free in st_open()
        scsi: myrs: Fix a double free in myrs_cleanup()
        scsi: ibmvfc: Free channel_setup_buf during device tear down
      af97713d
    • Linus Torvalds's avatar
      Merge tag 'zonefs-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs · 1c273e10
      Linus Torvalds authored
      Pull zonefs fixes from Damien Le Moal:
      
       - fix inode write open reference count (Chao)
      
       - Fix wrong write offset for asynchronous O_APPEND writes (me)
      
       - Prevent use of sequential zone file as swap files (me)
      
      * tag 'zonefs-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
        zonefs: fix to update .i_wr_refcnt correctly in zonefs_open_zone()
        zonefs: Fix O_APPEND async write handling
        zonefs: prevent use of seq files as swap file
      1c273e10
    • Linus Torvalds's avatar
      Merge tag 'block-5.12-2021-03-19' of git://git.kernel.dk/linux-block · d626c692
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "Just an NVMe pull request this week:
      
         - fix tag allocation for keep alive
      
         - fix a unit mismatch for the Write Zeroes limits
      
         - various TCP transport fixes (Sagi Grimberg, Elad Grupi)
      
         - fix iosqes and iocqes validation for discovery controllers (Sagi Grimberg)"
      
      * tag 'block-5.12-2021-03-19' of git://git.kernel.dk/linux-block:
        nvmet-tcp: fix kmap leak when data digest in use
        nvmet: don't check iosqes,iocqes for discovery controllers
        nvme-rdma: fix possible hang when failing to set io queues
        nvme-tcp: fix possible hang when failing to set io queues
        nvme-tcp: fix misuse of __smp_processor_id with preemption enabled
        nvme-tcp: fix a NULL deref when receiving a 0-length r2t PDU
        nvme: fix Write Zeroes limitations
        nvme: allocate the keep alive request using BLK_MQ_REQ_NOWAIT
        nvme: merge nvme_keep_alive into nvme_keep_alive_work
        nvme-fabrics: only reserve a single tag
      d626c692
    • Linus Torvalds's avatar
      Merge tag 'io_uring-5.12-2021-03-19' of git://git.kernel.dk/linux-block · 0ada2dad
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
       "Quieter week this time, which was both expected and desired. About
        half of the below is fixes for this release, the other half are just
        fixes in general. In detail:
      
         - Fix the freezing of IO threads, by making the freezer not send them
           fake signals. Make them freezable by default.
      
         - Like we did for personalities, move the buffer IDR to xarray. Kills
           some code and avoids a use-after-free on teardown.
      
         - SQPOLL cleanups and fixes (Pavel)
      
         - Fix linked timeout race (Pavel)
      
         - Fix potential completion post use-after-free (Pavel)
      
         - Cleanup and move internal structures outside of general kernel view
           (Stefan)
      
         - Use MSG_SIGNAL for send/recv from io_uring (Stefan)"
      
      * tag 'io_uring-5.12-2021-03-19' of git://git.kernel.dk/linux-block:
        io_uring: don't leak creds on SQO attach error
        io_uring: use typesafe pointers in io_uring_task
        io_uring: remove structures from include/linux/io_uring.h
        io_uring: imply MSG_NOSIGNAL for send[msg]()/recv[msg]() calls
        io_uring: fix sqpoll cancellation via task_work
        io_uring: add generic callback_head helpers
        io_uring: fix concurrent parking
        io_uring: halt SQO submission on ctx exit
        io_uring: replace sqd rw_semaphore with mutex
        io_uring: fix complete_post use ctx after free
        io_uring: fix ->flags races by linked timeouts
        io_uring: convert io_buffer_idr to XArray
        io_uring: allow IO worker threads to be frozen
        kernel: freezer should treat PF_IO_WORKER like PF_KTHREAD for freezing
      0ada2dad
  6. 19 Mar, 2021 5 commits
    • Johan Hovold's avatar
      x86/apic/of: Fix CPU devicetree-node lookups · dd926880
      Johan Hovold authored
      Architectures that describe the CPU topology in devicetree and do not have
      an identity mapping between physical and logical CPU ids must override the
      default implementation of arch_match_cpu_phys_id().
      
      Failing to do so breaks CPU devicetree-node lookups using of_get_cpu_node()
      and of_cpu_device_node_get() which several drivers rely on. It also causes
      the CPU struct devices exported through sysfs to point to the wrong
      devicetree nodes.
      
      On x86, CPUs are described in devicetree using their APIC ids and those
      do not generally coincide with the logical ids, even if CPU0 typically
      uses APIC id 0.
      
      Add the missing implementation of arch_match_cpu_phys_id() so that CPU-node
      lookups work also with SMP.
      
      Apart from fixing the broken sysfs devicetree-node links this likely does
      not affect current users of mainline kernels on x86.
      
      Fixes: 4e07db9c ("x86/devicetree: Use CPU description from Device Tree")
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20210312092033.26317-1-johan@kernel.org
      dd926880
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · ecd8ee7f
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
       "Fixes for kvm on x86:
      
         - new selftests
      
         - fixes for migration with HyperV re-enlightenment enabled
      
         - fix RCU/SRCU usage
      
         - fixes for local_irq_restore misuse false positive"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        documentation/kvm: additional explanations on KVM_SET_BOOT_CPU_ID
        x86/kvm: Fix broken irq restoration in kvm_wait
        KVM: X86: Fix missing local pCPU when executing wbinvd on all dirty pCPUs
        KVM: x86: Protect userspace MSR filter with SRCU, and set atomically-ish
        selftests: kvm: add set_boot_cpu_id test
        selftests: kvm: add _vm_ioctl
        selftests: kvm: add get_msr_index_features
        selftests: kvm: Add basic Hyper-V clocksources tests
        KVM: x86: hyper-v: Don't touch TSC page values when guest opted for re-enlightenment
        KVM: x86: hyper-v: Track Hyper-V TSC page status
        KVM: x86: hyper-v: Prevent using not-yet-updated TSC page by secondary CPUs
        KVM: x86: hyper-v: Limit guest to writing zero to HV_X64_MSR_TSC_EMULATION_STATUS
        KVM: x86/mmu: Store the address space ID in the TDP iterator
        KVM: x86/mmu: Factor out tdp_iter_return_to_root
        KVM: x86/mmu: Fix RCU usage when atomically zapping SPTEs
        KVM: x86/mmu: Fix RCU usage in handle_removed_tdp_mmu_page
      ecd8ee7f
    • Linus Torvalds's avatar
      Merge tag 'gpio-fixes-for-v5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux · 3149860d
      Linus Torvalds authored
      Pull gpio fixes from Bartosz Golaszewski:
       "Two fixes for the GPIO subsystem. Both address issues in the core GPIO
        code:
      
         - fix the return value in error path in gpiolib_dev_init()
      
         - fix the 'gpio-line-names' property handling correctly this time"
      
      * tag 'gpio-fixes-for-v5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
        gpiolib: Assign fwnode to parent's if no primary one provided
        gpiolib: Fix error return code in gpiolib_dev_init()
      3149860d
    • Linus Torvalds's avatar
      Merge tag 's390-5.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 6bfea141
      Linus Torvalds authored
      Pull s390 updates from Heiko Carstens:
      
       - disable preemption when accessing local per-cpu variables in the new
         counter set driver
      
       - fix by a factor of four increased steal time due to missing
         cputime_to_nsecs() conversion
      
       - fix PCI device structure leak
      
      * tag 's390-5.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390/pci: fix leak of PCI device structure
        s390/vtime: fix increased steal time accounting
        s390/cpumf: disable preemption when accessing per-cpu variable
      6bfea141
    • Linus Torvalds's avatar
      Merge tag 'trace-v5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · 278924cb
      Linus Torvalds authored
      Pull workqueue tracing fix from Steven Rostedt:
       "Fix workqueue trace event unsafe string reference
      
        After adding a verifier to test all strings printed in trace events to
        make sure they either point to a string on the ring buffer, or to read
        only core kernel memory, it triggered on a workqueue trace event. The
        trace event workqueue_queue_work references the allocated name of the
        workqueue in the output. If the workqueue is freed before the trace is
        read, then the trace will dereference freed memory.
      
        Update the trace event to use the __string(), __assign_str(), and
        __get_str() helpers to handle such cases"
      
      * tag 'trace-v5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        workqueue/tracing: Copy workqueue name to buffer in trace event
      278924cb