1. 25 Dec, 2021 1 commit
    • Baokun Li's avatar
      kfence: fix memory leak when cat kfence objects · 0129ab1f
      Baokun Li authored
      Hulk robot reported a kmemleak problem:
      
          unreferenced object 0xffff93d1d8cc02e8 (size 248):
            comm "cat", pid 23327, jiffies 4624670141 (age 495992.217s)
            hex dump (first 32 bytes):
              00 40 85 19 d4 93 ff ff 00 10 00 00 00 00 00 00  .@..............
              00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
            backtrace:
               seq_open+0x2a/0x80
               full_proxy_open+0x167/0x1e0
               do_dentry_open+0x1e1/0x3a0
               path_openat+0x961/0xa20
               do_filp_open+0xae/0x120
               do_sys_openat2+0x216/0x2f0
               do_sys_open+0x57/0x80
               do_syscall_64+0x33/0x40
               entry_SYSCALL_64_after_hwframe+0x44/0xa9
          unreferenced object 0xffff93d419854000 (size 4096):
            comm "cat", pid 23327, jiffies 4624670141 (age 495992.217s)
            hex dump (first 32 bytes):
              6b 66 65 6e 63 65 2d 23 32 35 30 3a 20 30 78 30  kfence-#250: 0x0
              30 30 30 30 30 30 30 37 35 34 62 64 61 31 32 2d  0000000754bda12-
            backtrace:
               seq_read_iter+0x313/0x440
               seq_read+0x14b/0x1a0
               full_proxy_read+0x56/0x80
               vfs_read+0xa5/0x1b0
               ksys_read+0xa0/0xf0
               do_syscall_64+0x33/0x40
               entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      I find that we can easily reproduce this problem with the following
      commands:
      
      	cat /sys/kernel/debug/kfence/objects
      	echo scan > /sys/kernel/debug/kmemleak
      	cat /sys/kernel/debug/kmemleak
      
      The leaked memory is allocated in the stack below:
      
          do_syscall_64
            do_sys_open
              do_dentry_open
                full_proxy_open
                  seq_open            ---> alloc seq_file
            vfs_read
              full_proxy_read
                seq_read
                  seq_read_iter
                    traverse          ---> alloc seq_buf
      
      And it should have been released in the following process:
      
          do_syscall_64
            syscall_exit_to_user_mode
              exit_to_user_mode_prepare
                task_work_run
                  ____fput
                    __fput
                      full_proxy_release  ---> free here
      
      However, the release function corresponding to file_operations is not
      implemented in kfence.  As a result, a memory leak occurs.  Therefore,
      the solution to this problem is to implement the corresponding release
      function.
      
      Link: https://lkml.kernel.org/r/20211206133628.2822545-1-libaokun1@huawei.com
      Fixes: 0ce20dd8 ("mm: add Kernel Electric-Fence infrastructure")
      Signed-off-by: default avatarBaokun Li <libaokun1@huawei.com>
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Yu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0129ab1f
  2. 22 Dec, 2021 7 commits
  3. 21 Dec, 2021 9 commits
    • Linus Torvalds's avatar
      Merge tag 'pm-5.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 2f47a9a4
      Linus Torvalds authored
      Pull power management fix from Rafael Wysocki:
       "Fix a recent regression causing the loop in dpm_prepare() to become
        infinite if one of the device ->prepare() callbacks returns an error"
      
      * tag 'pm-5.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        PM: sleep: Fix error handling in dpm_prepare()
      2f47a9a4
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · ca0ea8a6
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
      
       - Fix for compilation of selftests on non-x86 architectures
      
       - Fix for kvm_run->if_flag on SEV-ES
      
       - Fix for page table use-after-free if yielding during exit_mm()
      
       - Improve behavior when userspace starts a nested guest with invalid
         state
      
       - Fix missed wakeup with assigned devices but no VT-d posted interrupts
      
       - Do not tell userspace to save/restore an unsupported PMU MSR
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this vCPU
        KVM: selftests: Add test to verify TRIPLE_FAULT on invalid L2 guest state
        KVM: VMX: Fix stale docs for kvm-intel.emulate_invalid_guest_state
        KVM: nVMX: Synthesize TRIPLE_FAULT for L2 if emulation is required
        KVM: VMX: Always clear vmx->fail on emulation_required
        selftests: KVM: Fix non-x86 compiling
        KVM: x86: Always set kvm_run->if_flag
        KVM: x86/mmu: Don't advance iterator after restart due to yielding
        KVM: x86: remove PMU FIXED_CTR3 from msrs_to_save_all
      ca0ea8a6
    • John David Anglin's avatar
      parisc: Fix mask used to select futex spinlock · d3a5a68c
      John David Anglin authored
      The address bits used to select the futex spinlock need to match those used in
      the LWS code in syscall.S. The mask 0x3f8 only selects 7 bits.  It should
      select 8 bits.
      
      This change fixes the glibc nptl/tst-cond24 and nptl/tst-cond25 tests.
      Signed-off-by: default avatarJohn David Anglin <dave.anglin@bell.net>
      Fixes: 53a42b63 ("parisc: Switch to more fine grained lws locks")
      Cc: stable@vger.kernel.org # 5.10+
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      d3a5a68c
    • John David Anglin's avatar
      parisc: Correct completer in lws start · 8f66fce0
      John David Anglin authored
      The completer in the "or,ev %r1,%r30,%r30" instruction is reversed, so we are
      not clipping the LWS number when we are called from a 32-bit process (W=0).
      We need to nulify the following depdi instruction when the least-significant
      bit of %r30 is 1.
      
      If the %r20 register is not clipped, a user process could perform a LWS call
      that would branch to an undefined location in the kernel and potentially crash
      the machine.
      Signed-off-by: default avatarJohn David Anglin <dave.anglin@bell.net>
      Cc: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      8f66fce0
    • Linus Torvalds's avatar
      Merge tag 'nfsd-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux · 5dbdc4c5
      Linus Torvalds authored
      Pull nfsd fix from Chuck Lever:
       "Address a buffer overrun reported by Anatoly Trosinenko"
      
      * tag 'nfsd-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
        NFSD: Fix READDIR buffer overflow
      5dbdc4c5
    • Sean Christopherson's avatar
      KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this vCPU · fdba608f
      Sean Christopherson authored
      Drop a check that guards triggering a posted interrupt on the currently
      running vCPU, and more importantly guards waking the target vCPU if
      triggering a posted interrupt fails because the vCPU isn't IN_GUEST_MODE.
      If a vIRQ is delivered from asynchronous context, the target vCPU can be
      the currently running vCPU and can also be blocking, in which case
      skipping kvm_vcpu_wake_up() is effectively dropping what is supposed to
      be a wake event for the vCPU.
      
      The "do nothing" logic when "vcpu == running_vcpu" mostly works only
      because the majority of calls to ->deliver_posted_interrupt(), especially
      when using posted interrupts, come from synchronous KVM context.  But if
      a device is exposed to the guest using vfio-pci passthrough, the VFIO IRQ
      and vCPU are bound to the same pCPU, and the IRQ is _not_ configured to
      use posted interrupts, wake events from the device will be delivered to
      KVM from IRQ context, e.g.
      
        vfio_msihandler()
        |
        |-> eventfd_signal()
            |
            |-> ...
                |
                |->  irqfd_wakeup()
                     |
                     |->kvm_arch_set_irq_inatomic()
                        |
                        |-> kvm_irq_delivery_to_apic_fast()
                            |
                            |-> kvm_apic_set_irq()
      
      This also aligns the non-nested and nested usage of triggering posted
      interrupts, and will allow for additional cleanups.
      
      Fixes: 379a3c8e ("KVM: VMX: Optimize posted-interrupt delivery for timer fastpath")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarLongpeng (Mike) <longpeng2@huawei.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-18-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fdba608f
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid · 1c3e979b
      Linus Torvalds authored
      Pull HID fixes from Jiri Kosina:
      
       - NULL pointer dereference fix in Vivaldi driver (Jiasheng Jiang)
      
       - regression fix for device probing in Holtek driver (Benjamin
         Tissoires)
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
        HID: potential dereference of null pointer
        HID: holtek: fix mouse probing
      1c3e979b
    • Wu Bo's avatar
      ipmi: Fix UAF when uninstall ipmi_si and ipmi_msghandler module · ffb76a86
      Wu Bo authored
      Hi,
      
      When testing install and uninstall of ipmi_si.ko and ipmi_msghandler.ko,
      the system crashed.
      
      The log as follows:
      [  141.087026] BUG: unable to handle kernel paging request at ffffffffc09b3a5a
      [  141.087241] PGD 8fe4c0d067 P4D 8fe4c0d067 PUD 8fe4c0f067 PMD 103ad89067 PTE 0
      [  141.087464] Oops: 0010 [#1] SMP NOPTI
      [  141.087580] CPU: 67 PID: 668 Comm: kworker/67:1 Kdump: loaded Not tainted 4.18.0.x86_64 #47
      [  141.088009] Workqueue: events 0xffffffffc09b3a40
      [  141.088009] RIP: 0010:0xffffffffc09b3a5a
      [  141.088009] Code: Bad RIP value.
      [  141.088009] RSP: 0018:ffffb9094e2c3e88 EFLAGS: 00010246
      [  141.088009] RAX: 0000000000000000 RBX: ffff9abfdb1f04a0 RCX: 0000000000000000
      [  141.088009] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246
      [  141.088009] RBP: 0000000000000000 R08: ffff9abfffee3cb8 R09: 00000000000002e1
      [  141.088009] R10: ffffb9094cb73d90 R11: 00000000000f4240 R12: ffff9abfffee8700
      [  141.088009] R13: 0000000000000000 R14: ffff9abfdb1f04a0 R15: ffff9abfdb1f04a8
      [  141.088009] FS:  0000000000000000(0000) GS:ffff9abfffec0000(0000) knlGS:0000000000000000
      [  141.088009] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  141.088009] CR2: ffffffffc09b3a30 CR3: 0000008fe4c0a001 CR4: 00000000007606e0
      [  141.088009] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  141.088009] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  141.088009] PKRU: 55555554
      [  141.088009] Call Trace:
      [  141.088009]  ? process_one_work+0x195/0x390
      [  141.088009]  ? worker_thread+0x30/0x390
      [  141.088009]  ? process_one_work+0x390/0x390
      [  141.088009]  ? kthread+0x10d/0x130
      [  141.088009]  ? kthread_flush_work_fn+0x10/0x10
      [  141.088009]  ? ret_from_fork+0x35/0x40] BUG: unable to handle kernel paging request at ffffffffc0b28a5a
      [  200.223240] PGD 97fe00d067 P4D 97fe00d067 PUD 97fe00f067 PMD a580cbf067 PTE 0
      [  200.223464] Oops: 0010 [#1] SMP NOPTI
      [  200.223579] CPU: 63 PID: 664 Comm: kworker/63:1 Kdump: loaded Not tainted 4.18.0.x86_64 #46
      [  200.224008] Workqueue: events 0xffffffffc0b28a40
      [  200.224008] RIP: 0010:0xffffffffc0b28a5a
      [  200.224008] Code: Bad RIP value.
      [  200.224008] RSP: 0018:ffffbf3c8e2a3e88 EFLAGS: 00010246
      [  200.224008] RAX: 0000000000000000 RBX: ffffa0799ad6bca0 RCX: 0000000000000000
      [  200.224008] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246
      [  200.224008] RBP: 0000000000000000 R08: ffff9fe43fde3cb8 R09: 00000000000000d5
      [  200.224008] R10: ffffbf3c8cb53d90 R11: 00000000000f4240 R12: ffff9fe43fde8700
      [  200.224008] R13: 0000000000000000 R14: ffffa0799ad6bca0 R15: ffffa0799ad6bca8
      [  200.224008] FS:  0000000000000000(0000) GS:ffff9fe43fdc0000(0000) knlGS:0000000000000000
      [  200.224008] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  200.224008] CR2: ffffffffc0b28a30 CR3: 00000097fe00a002 CR4: 00000000007606e0
      [  200.224008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  200.224008] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  200.224008] PKRU: 55555554
      [  200.224008] Call Trace:
      [  200.224008]  ? process_one_work+0x195/0x390
      [  200.224008]  ? worker_thread+0x30/0x390
      [  200.224008]  ? process_one_work+0x390/0x390
      [  200.224008]  ? kthread+0x10d/0x130
      [  200.224008]  ? kthread_flush_work_fn+0x10/0x10
      [  200.224008]  ? ret_from_fork+0x35/0x40
      [  200.224008] kernel fault(0x1) notification starting on CPU 63
      [  200.224008] kernel fault(0x1) notification finished on CPU 63
      [  200.224008] CR2: ffffffffc0b28a5a
      [  200.224008] ---[ end trace c82a412d93f57412 ]---
      
      The reason is as follows:
      T1: rmmod ipmi_si.
          ->ipmi_unregister_smi()
              -> ipmi_bmc_unregister()
                  -> __ipmi_bmc_unregister()
                      -> kref_put(&bmc->usecount, cleanup_bmc_device);
                          -> schedule_work(&bmc->remove_work);
      
      T2: rmmod ipmi_msghandler.
          ipmi_msghander module uninstalled, and the module space
          will be freed.
      
      T3: bmc->remove_work doing cleanup the bmc resource.
          -> cleanup_bmc_work()
              -> platform_device_unregister(&bmc->pdev);
                  -> platform_device_del(pdev);
                      -> device_del(&pdev->dev);
                          -> kobject_uevent(&dev->kobj, KOBJ_REMOVE);
                              -> kobject_uevent_env()
                                  -> dev_uevent()
                                      -> if (dev->type && dev->type->name)
      
         'dev->type'(bmc_device_type) pointer space has freed when uninstall
          ipmi_msghander module, 'dev->type->name' cause the system crash.
      
      drivers/char/ipmi/ipmi_msghandler.c:
      2820 static const struct device_type bmc_device_type = {
      2821         .groups         = bmc_dev_attr_groups,
      2822 };
      
      Steps to reproduce:
      Add a time delay in cleanup_bmc_work() function,
      and uninstall ipmi_si and ipmi_msghandler module.
      
      2910 static void cleanup_bmc_work(struct work_struct *work)
      2911 {
      2912         struct bmc_device *bmc = container_of(work, struct bmc_device,
      2913                                               remove_work);
      2914         int id = bmc->pdev.id; /* Unregister overwrites id */
      2915
      2916         msleep(3000);   <---
      2917         platform_device_unregister(&bmc->pdev);
      2918         ida_simple_remove(&ipmi_bmc_ida, id);
      2919 }
      
      Use 'remove_work_wq' instead of 'system_wq' to solve this issues.
      
      Fixes: b2cfd8ab ("ipmi: Rework device id and guid handling to catch changing BMCs")
      Signed-off-by: default avatarWu Bo <wubo40@huawei.com>
      Message-Id: <1640070034-56671-1-git-send-email-wubo40@huawei.com>
      Signed-off-by: default avatarCorey Minyard <cminyard@mvista.com>
      ffb76a86
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 6e0567b7
      Linus Torvalds authored
      Pull rdma fixes from Jason Gunthorpe:
       "Last fixes before holidays. Nothing very exciting:
      
         - Work around a HW bug in HNS HIP08
      
         - Recent memory leak regression in qib
      
         - Incorrect use of kfree() for vmalloc memory in hns"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
        RDMA/hns: Replace kfree() with kvfree()
        IB/qib: Fix memory leak in qib_user_sdma_queue_pkts()
        RDMA/hns: Fix RNR retransmission issue for HIP08
      6e0567b7
  4. 20 Dec, 2021 14 commits
    • Linus Torvalds's avatar
      Merge tag 'spi-fix-v5.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi · 86085fe7
      Linus Torvalds authored
      Pull spi fix from Mark Brown:
       "One small fix for a long standing issue with error handling on probe
        in the Armada driver"
      
      * tag 'spi-fix-v5.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
        spi: change clk_disable_unprepare to clk_unprepare
      86085fe7
    • Linus Torvalds's avatar
      Merge tag 'regulator-fix-v5.16-rc6' of... · 3856c1b3
      Linus Torvalds authored
      Merge tag 'regulator-fix-v5.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
      
      Pull regulator fix from Mark Brown:
       "Binding fix for v5.16
      
        This fixes problems validating DT bindings using op_mode which wasn't
        described as it should have been when converting to DT schema"
      
      * tag 'regulator-fix-v5.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
        regulator: dt-bindings: samsung,s5m8767: add missing op_mode to bucks
      3856c1b3
    • Linus Torvalds's avatar
      Merge branch 'xsa' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · 59b3f944
      Linus Torvalds authored
      Merge xen fixes from Juergen Gross:
       "Fixes for two issues related to Xen and malicious guests:
      
         - Guest can force the netback driver to hog large amounts of memory
      
         - Denial of Service in other guests due to event storms"
      
      * 'xsa' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        xen/netback: don't queue unlimited number of packages
        xen/netback: fix rx queue stall detection
        xen/console: harden hvc_xen against event channel storms
        xen/netfront: harden netfront against event channel storms
        xen/blkfront: harden blkfront against event channel storms
      59b3f944
    • Helge Deller's avatar
      parisc: Clear stale IIR value on instruction access rights trap · 484730e5
      Helge Deller authored
      When a trap 7 (Instruction access rights) occurs, this means the CPU
      couldn't execute an instruction due to missing execute permissions on
      the memory region.  In this case it seems the CPU didn't even fetched
      the instruction from memory and thus did not store it in the cr19 (IIR)
      register before calling the trap handler. So, the trap handler will find
      some random old stale value in cr19.
      
      This patch simply overwrites the stale IIR value with a constant magic
      "bad food" value (0xbaadf00d), in the hope people don't start to try to
      understand the various random IIR values in trap 7 dumps.
      Noticed-by: default avatarJohn David Anglin <dave.anglin@bell.net>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      484730e5
    • Sean Christopherson's avatar
      KVM: selftests: Add test to verify TRIPLE_FAULT on invalid L2 guest state · ab1ef344
      Sean Christopherson authored
      Add a selftest to attempt to enter L2 with invalid guests state by
      exiting to userspace via I/O from L2, and then using KVM_SET_SREGS to set
      invalid guest state (marking TR unusable is arbitrary chosen for its
      relative simplicity).
      
      This is a regression test for a bug introduced by commit c8607e4a
      ("KVM: x86: nVMX: don't fail nested VM entry on invalid guest state if
      !from_vmentry"), which incorrectly set vmx->fail=true when L2 had invalid
      guest state and ultimately triggered a WARN due to nested_vmx_vmexit()
      seeing vmx->fail==true while attempting to synthesize a nested VM-Exit.
      
      The is also a functional test to verify that KVM sythesizes TRIPLE_FAULT
      for L2, which is somewhat arbitrary behavior, instead of emulating L2.
      KVM should never emulate L2 due to invalid guest state, as it's
      architecturally impossible for L1 to run an L2 guest with invalid state
      as nested VM-Enter should always fail, i.e. L1 needs to do the emulation.
      Stuffing state via KVM ioctl() is a non-architctural, out-of-band case,
      hence the TRIPLE_FAULT being rather arbitrary.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211207193006.120997-5-seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ab1ef344
    • Sean Christopherson's avatar
      KVM: VMX: Fix stale docs for kvm-intel.emulate_invalid_guest_state · 0ff29701
      Sean Christopherson authored
      Update the documentation for kvm-intel's emulate_invalid_guest_state to
      rectify the description of KVM's default behavior, and to document that
      the behavior and thus parameter only applies to L1.
      
      Fixes: a27685c3 ("KVM: VMX: Emulate invalid guest state by default")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211207193006.120997-4-seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0ff29701
    • Sean Christopherson's avatar
      KVM: nVMX: Synthesize TRIPLE_FAULT for L2 if emulation is required · cd0e615c
      Sean Christopherson authored
      Synthesize a triple fault if L2 guest state is invalid at the time of
      VM-Enter, which can happen if L1 modifies SMRAM or if userspace stuffs
      guest state via ioctls(), e.g. KVM_SET_SREGS.  KVM should never emulate
      invalid guest state, since from L1's perspective, it's architecturally
      impossible for L2 to have invalid state while L2 is running in hardware.
      E.g. attempts to set CR0 or CR4 to unsupported values will either VM-Exit
      or #GP.
      
      Modifying vCPU state via RSM+SMRAM and ioctl() are the only paths that
      can trigger this scenario, as nested VM-Enter correctly rejects any
      attempt to enter L2 with invalid state.
      
      RSM is a straightforward case as (a) KVM follows AMD's SMRAM layout and
      behavior, and (b) Intel's SDM states that loading reserved CR0/CR4 bits
      via RSM results in shutdown, i.e. there is precedent for KVM's behavior.
      Following AMD's SMRAM layout is important as AMD's layout saves/restores
      the descriptor cache information, including CS.RPL and SS.RPL, and also
      defines all the fields relevant to invalid guest state as read-only, i.e.
      so long as the vCPU had valid state before the SMI, which is guaranteed
      for L2, RSM will generate valid state unless SMRAM was modified.  Intel's
      layout saves/restores only the selector, which means that scenarios where
      the selector and cached RPL don't match, e.g. conforming code segments,
      would yield invalid guest state.  Intel CPUs fudge around this issued by
      stuffing SS.RPL and CS.RPL on RSM.  Per Intel's SDM on the "Default
      Treatment of RSM", paraphrasing for brevity:
      
        IF internal storage indicates that the [CPU was post-VMXON]
        THEN
           enter VMX operation (root or non-root);
           restore VMX-critical state as defined in Section 34.14.1;
           set to their fixed values any bits in CR0 and CR4 whose values must
           be fixed in VMX operation [unless coming from an unrestricted guest];
           IF RFLAGS.VM = 0 AND (in VMX root operation OR the
              “unrestricted guest” VM-execution control is 0)
           THEN
             CS.RPL := SS.DPL;
             SS.RPL := SS.DPL;
           FI;
           restore current VMCS pointer;
        FI;
      
      Note that Intel CPUs also overwrite the fixed CR0/CR4 bits, whereas KVM
      will sythesize TRIPLE_FAULT in this scenario.  KVM's behavior is allowed
      as both Intel and AMD define CR0/CR4 SMRAM fields as read-only, i.e. the
      only way for CR0 and/or CR4 to have illegal values is if they were
      modified by the L1 SMM handler, and Intel's SDM "SMRAM State Save Map"
      section states "modifying these registers will result in unpredictable
      behavior".
      
      KVM's ioctl() behavior is less straightforward.  Because KVM allows
      ioctls() to be executed in any order, rejecting an ioctl() if it would
      result in invalid L2 guest state is not an option as KVM cannot know if
      a future ioctl() would resolve the invalid state, e.g. KVM_SET_SREGS, or
      drop the vCPU out of L2, e.g. KVM_SET_NESTED_STATE.  Ideally, KVM would
      reject KVM_RUN if L2 contained invalid guest state, but that carries the
      risk of a false positive, e.g. if RSM loaded invalid guest state and KVM
      exited to userspace.  Setting a flag/request to detect such a scenario is
      undesirable because (a) it's extremely unlikely to add value to KVM as a
      whole, and (b) KVM would need to consider ioctl() interactions with such
      a flag, e.g. if userspace migrated the vCPU while the flag were set.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211207193006.120997-3-seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cd0e615c
    • Sean Christopherson's avatar
      KVM: VMX: Always clear vmx->fail on emulation_required · a80dfc02
      Sean Christopherson authored
      Revert a relatively recent change that set vmx->fail if the vCPU is in L2
      and emulation_required is true, as that behavior is completely bogus.
      Setting vmx->fail and synthesizing a VM-Exit is contradictory and wrong:
      
        (a) it's impossible to have both a VM-Fail and VM-Exit
        (b) vmcs.EXIT_REASON is not modified on VM-Fail
        (c) emulation_required refers to guest state and guest state checks are
            always VM-Exits, not VM-Fails.
      
      For KVM specifically, emulation_required is handled before nested exits
      in __vmx_handle_exit(), thus setting vmx->fail has no immediate effect,
      i.e. KVM calls into handle_invalid_guest_state() and vmx->fail is ignored.
      Setting vmx->fail can ultimately result in a WARN in nested_vmx_vmexit()
      firing when tearing down the VM as KVM never expects vmx->fail to be set
      when L2 is active, KVM always reflects those errors into L1.
      
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 21158 at arch/x86/kvm/vmx/nested.c:4548
                                      nested_vmx_vmexit+0x16bd/0x17e0
                                      arch/x86/kvm/vmx/nested.c:4547
        Modules linked in:
        CPU: 0 PID: 21158 Comm: syz-executor.1 Not tainted 5.16.0-rc3-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:nested_vmx_vmexit+0x16bd/0x17e0 arch/x86/kvm/vmx/nested.c:4547
        Code: <0f> 0b e9 2e f8 ff ff e8 57 b3 5d 00 0f 0b e9 00 f1 ff ff 89 e9 80
        Call Trace:
         vmx_leave_nested arch/x86/kvm/vmx/nested.c:6220 [inline]
         nested_vmx_free_vcpu+0x83/0xc0 arch/x86/kvm/vmx/nested.c:330
         vmx_free_vcpu+0x11f/0x2a0 arch/x86/kvm/vmx/vmx.c:6799
         kvm_arch_vcpu_destroy+0x6b/0x240 arch/x86/kvm/x86.c:10989
         kvm_vcpu_destroy+0x29/0x90 arch/x86/kvm/../../../virt/kvm/kvm_main.c:441
         kvm_free_vcpus arch/x86/kvm/x86.c:11426 [inline]
         kvm_arch_destroy_vm+0x3ef/0x6b0 arch/x86/kvm/x86.c:11545
         kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1189 [inline]
         kvm_put_kvm+0x751/0xe40 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1220
         kvm_vcpu_release+0x53/0x60 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3489
         __fput+0x3fc/0x870 fs/file_table.c:280
         task_work_run+0x146/0x1c0 kernel/task_work.c:164
         exit_task_work include/linux/task_work.h:32 [inline]
         do_exit+0x705/0x24f0 kernel/exit.c:832
         do_group_exit+0x168/0x2d0 kernel/exit.c:929
         get_signal+0x1740/0x2120 kernel/signal.c:2852
         arch_do_signal_or_restart+0x9c/0x730 arch/x86/kernel/signal.c:868
         handle_signal_work kernel/entry/common.c:148 [inline]
         exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
         exit_to_user_mode_prepare+0x191/0x220 kernel/entry/common.c:207
         __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
         syscall_exit_to_user_mode+0x2e/0x70 kernel/entry/common.c:300
         do_syscall_64+0x53/0xd0 arch/x86/entry/common.c:86
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: c8607e4a ("KVM: x86: nVMX: don't fail nested VM entry on invalid guest state if !from_vmentry")
      Reported-by: syzbot+f1d2136db9c80d4733e8@syzkaller.appspotmail.com
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211207193006.120997-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a80dfc02
    • Andrew Jones's avatar
      selftests: KVM: Fix non-x86 compiling · 577e022b
      Andrew Jones authored
      Attempting to compile on a non-x86 architecture fails with
      
      include/kvm_util.h: In function ‘vm_compute_max_gfnâ€:
      include/kvm_util.h:79:21: error: dereferencing pointer to incomplete type ‘struct kvm_vmâ€
        return ((1ULL << vm->pa_bits) >> vm->page_shift) - 1;
                           ^~
      
      This is because the declaration of struct kvm_vm is in
      lib/kvm_util_internal.h as an effort to make it private to
      the test lib code. We can still provide arch specific functions,
      though, by making the generic function symbols weak. Do that to
      fix the compile error.
      
      Fixes: c8cc43c1 ("selftests: KVM: avoid failures due to reserved HyperTransport region")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAndrew Jones <drjones@redhat.com>
      Message-Id: <20211214151842.848314-1-drjones@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      577e022b
    • Marc Orr's avatar
      KVM: x86: Always set kvm_run->if_flag · c5063551
      Marc Orr authored
      The kvm_run struct's if_flag is a part of the userspace/kernel API. The
      SEV-ES patches failed to set this flag because it's no longer needed by
      QEMU (according to the comment in the source code). However, other
      hypervisors may make use of this flag. Therefore, set the flag for
      guests with encrypted registers (i.e., with guest_state_protected set).
      
      Fixes: f1c6366e ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
      Signed-off-by: default avatarMarc Orr <marcorr@google.com>
      Message-Id: <20211209155257.128747-1-marcorr@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      c5063551
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't advance iterator after restart due to yielding · 3a0f64de
      Sean Christopherson authored
      After dropping mmu_lock in the TDP MMU, restart the iterator during
      tdp_iter_next() and do not advance the iterator.  Advancing the iterator
      results in skipping the top-level SPTE and all its children, which is
      fatal if any of the skipped SPTEs were not visited before yielding.
      
      When zapping all SPTEs, i.e. when min_level == root_level, restarting the
      iter and then invoking tdp_iter_next() is always fatal if the current gfn
      has as a valid SPTE, as advancing the iterator results in try_step_side()
      skipping the current gfn, which wasn't visited before yielding.
      
      Sprinkle WARNs on iter->yielded being true in various helpers that are
      often used in conjunction with yielding, and tag the helper with
      __must_check to reduce the probabily of improper usage.
      
      Failing to zap a top-level SPTE manifests in one of two ways.  If a valid
      SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
      the shadow page will be leaked and KVM will WARN accordingly.
      
        WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
        RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
        Call Trace:
         <TASK>
         kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
         kvm_destroy_vm+0x162/0x2a0 [kvm]
         kvm_vcpu_release+0x34/0x60 [kvm]
         __fput+0x82/0x240
         task_work_run+0x5c/0x90
         do_exit+0x364/0xa10
         ? futex_unqueue+0x38/0x60
         do_group_exit+0x33/0xa0
         get_signal+0x155/0x850
         arch_do_signal_or_restart+0xed/0x750
         exit_to_user_mode_prepare+0xc5/0x120
         syscall_exit_to_user_mode+0x1d/0x40
         do_syscall_64+0x48/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
      kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
      marking a struct page as dirty/accessed after it has been put back on the
      free list.  This directly triggers a WARN due to encountering a page with
      page_count() == 0, but it can also lead to data corruption and additional
      errors in the kernel.
      
        WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
        RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
        Call Trace:
         <TASK>
         kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
         __handle_changed_spte+0x92e/0xca0 [kvm]
         __handle_changed_spte+0x63c/0xca0 [kvm]
         __handle_changed_spte+0x63c/0xca0 [kvm]
         __handle_changed_spte+0x63c/0xca0 [kvm]
         zap_gfn_range+0x549/0x620 [kvm]
         kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
         mmu_free_root_page+0x219/0x2c0 [kvm]
         kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
         kvm_mmu_unload+0x1c/0xa0 [kvm]
         kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
         kvm_put_kvm+0x3b1/0x8b0 [kvm]
         kvm_vcpu_release+0x4e/0x70 [kvm]
         __fput+0x1f7/0x8c0
         task_work_run+0xf8/0x1a0
         do_exit+0x97b/0x2230
         do_group_exit+0xda/0x2a0
         get_signal+0x3be/0x1e50
         arch_do_signal_or_restart+0x244/0x17f0
         exit_to_user_mode_prepare+0xcb/0x120
         syscall_exit_to_user_mode+0x1d/0x40
         do_syscall_64+0x4d/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Note, the underlying bug existed even before commit 1af4a960 ("KVM:
      x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
      tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
      incorrectly advance past a top-level entry when yielding on a lower-level
      entry.  But with respect to leaking shadow pages, the bug was introduced
      by yielding before processing the current gfn.
      
      Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
      callers could jump to their "retry" label.  The downside of that approach
      is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
      in the loop, and there's no easy way to enfornce that requirement.
      
      Ideally, KVM would handling the cond_resched() fully within the iterator
      macro (the code is actually quite clean) and avoid this entire class of
      bugs, but that is extremely difficult do while also supporting yielding
      after tdp_mmu_set_spte_atomic() fails.  Yielding after failing to set a
      SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
      bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
      may block operations on the SPTE for a significant amount of time.
      
      Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
      Fixes: 1af4a960 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
      Reported-by: default avatarIgnat Korchagin <ignat@cloudflare.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211214033528.123268-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3a0f64de
    • Jiasheng Jiang's avatar
      HID: potential dereference of null pointer · 13251ce1
      Jiasheng Jiang authored
      The return value of devm_kzalloc() needs to be checked.
      To avoid hdev->dev->driver_data to be null in case of the failure of
      alloc.
      
      Fixes: 14c9c014 ("HID: add vivaldi HID driver")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJiasheng Jiang <jiasheng@iscas.ac.cn>
      Signed-off-by: default avatarBenjamin Tissoires <benjamin.tissoires@redhat.com>
      Link: https://lore.kernel.org/r/20211215083605.117638-1-jiasheng@iscas.ac.cn
      13251ce1
    • Benjamin Tissoires's avatar
      HID: holtek: fix mouse probing · 93a2207c
      Benjamin Tissoires authored
      An overlook from the previous commit: we don't even parse or start the
      device, meaning that the device is not presented to user space.
      
      Fixes: 93020953 ("HID: check for valid USB device for many HID drivers")
      Cc: stable@vger.kernel.org
      Link: https://bugs.archlinux.org/task/73048
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=215341
      Link: https://lore.kernel.org/r/e4efbf13-bd8d-0370-629b-6c80c0044b15@leemhuis.info/Signed-off-by: default avatarBenjamin Tissoires <benjamin.tissoires@redhat.com>
      93a2207c
    • Wei Wang's avatar
      KVM: x86: remove PMU FIXED_CTR3 from msrs_to_save_all · 9fb12fe5
      Wei Wang authored
      The fixed counter 3 is used for the Topdown metrics, which hasn't been
      enabled for KVM guests. Userspace accessing to it will fail as it's not
      included in get_fixed_pmc(). This breaks KVM selftests on ICX+ machines,
      which have this counter.
      
      To reproduce it on ICX+ machines, ./state_test reports:
      ==== Test Assertion Failure ====
      lib/x86_64/processor.c:1078: r == nmsrs
      pid=4564 tid=4564 - Argument list too long
      1  0x000000000040b1b9: vcpu_save_state at processor.c:1077
      2  0x0000000000402478: main at state_test.c:209 (discriminator 6)
      3  0x00007fbe21ed5f92: ?? ??:0
      4  0x000000000040264d: _start at ??:?
       Unexpected result from KVM_GET_MSRS, r: 17 (failed MSR was 0x30c)
      
      With this patch, it works well.
      Signed-off-by: default avatarWei Wang <wei.w.wang@intel.com>
      Message-Id: <20211217124934.32893-1-wei.w.wang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9fb12fe5
  5. 19 Dec, 2021 9 commits