1. 03 Sep, 2020 32 commits
  2. 26 Aug, 2020 8 commits
    • Greg Kroah-Hartman's avatar
      c6a15d15
    • Will Deacon's avatar
      KVM: arm/arm64: Don't reschedule in unmap_stage2_range() · c0ca97bc
      Will Deacon authored
      Upstream commits fdfe7cbd ("KVM: Pass MMU notifier range flags to
      kvm_unmap_hva_range()") and b5331379 ("KVM: arm64: Only reschedule
      if MMU_NOTIFIER_RANGE_BLOCKABLE is not set") fix a "sleeping from invalid
      context" BUG caused by unmap_stage2_range() attempting to reschedule when
      called on the OOM path.
      
      Unfortunately, these patches rely on the MMU notifier callback being
      passed knowledge about whether or not blocking is permitted, which was
      introduced in 4.19. Rather than backport this considerable amount of
      infrastructure just for KVM on arm, instead just remove the conditional
      reschedule.
      
      Cc: <stable@vger.kernel.org> # v4.9 only
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Acked-by: default avatarMarc Zyngier <maz@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c0ca97bc
    • Juergen Gross's avatar
      xen: don't reschedule in preemption off sections · 606c6eb9
      Juergen Gross authored
      For support of long running hypercalls xen_maybe_preempt_hcall() is
      calling cond_resched() in case a hypercall marked as preemptible has
      been interrupted.
      
      Normally this is no problem, as only hypercalls done via some ioctl()s
      are marked to be preemptible. In rare cases when during such a
      preemptible hypercall an interrupt occurs and any softirq action is
      started from irq_exit(), a further hypercall issued by the softirq
      handler will be regarded to be preemptible, too. This might lead to
      rescheduling in spite of the softirq handler potentially having set
      preempt_disable(), leading to splats like:
      
      BUG: sleeping function called from invalid context at drivers/xen/preempt.c:37
      in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 20775, name: xl
      INFO: lockdep is turned off.
      CPU: 1 PID: 20775 Comm: xl Tainted: G D W 5.4.46-1_prgmr_debug.el7.x86_64 #1
      Call Trace:
      <IRQ>
      dump_stack+0x8f/0xd0
      ___might_sleep.cold.76+0xb2/0x103
      xen_maybe_preempt_hcall+0x48/0x70
      xen_do_hypervisor_callback+0x37/0x40
      RIP: e030:xen_hypercall_xen_version+0xa/0x20
      Code: ...
      RSP: e02b:ffffc900400dcc30 EFLAGS: 00000246
      RAX: 000000000004000d RBX: 0000000000000200 RCX: ffffffff8100122a
      RDX: ffff88812e788000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffffffff83ee3ad0 R08: 0000000000000001 R09: 0000000000000001
      R10: 0000000000000000 R11: 0000000000000246 R12: ffff8881824aa0b0
      R13: 0000000865496000 R14: 0000000865496000 R15: ffff88815d040000
      ? xen_hypercall_xen_version+0xa/0x20
      ? xen_force_evtchn_callback+0x9/0x10
      ? check_events+0x12/0x20
      ? xen_restore_fl_direct+0x1f/0x20
      ? _raw_spin_unlock_irqrestore+0x53/0x60
      ? debug_dma_sync_single_for_cpu+0x91/0xc0
      ? _raw_spin_unlock_irqrestore+0x53/0x60
      ? xen_swiotlb_sync_single_for_cpu+0x3d/0x140
      ? mlx4_en_process_rx_cq+0x6b6/0x1110 [mlx4_en]
      ? mlx4_en_poll_rx_cq+0x64/0x100 [mlx4_en]
      ? net_rx_action+0x151/0x4a0
      ? __do_softirq+0xed/0x55b
      ? irq_exit+0xea/0x100
      ? xen_evtchn_do_upcall+0x2c/0x40
      ? xen_do_hypervisor_callback+0x29/0x40
      </IRQ>
      ? xen_hypercall_domctl+0xa/0x20
      ? xen_hypercall_domctl+0x8/0x20
      ? privcmd_ioctl+0x221/0x990 [xen_privcmd]
      ? do_vfs_ioctl+0xa5/0x6f0
      ? ksys_ioctl+0x60/0x90
      ? trace_hardirqs_off_thunk+0x1a/0x20
      ? __x64_sys_ioctl+0x16/0x20
      ? do_syscall_64+0x62/0x250
      ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fix that by testing preempt_count() before calling cond_resched().
      
      In kernel 5.8 this can't happen any more due to the entry code rework
      (more than 100 patches, so not a candidate for backporting).
      
      The issue was introduced in kernel 4.3, so this patch should go into
      all stable kernels in [4.3 ... 5.7].
      Reported-by: default avatarSarah Newman <srn@prgmr.com>
      Fixes: 0fa2f5cb ("sched/preempt, xen: Use need_resched() instead of should_resched()")
      Cc: Sarah Newman <srn@prgmr.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Tested-by: default avatarChris Brannon <cmb@prgmr.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      606c6eb9
    • Peter Xu's avatar
      mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible · fe5f83b1
      Peter Xu authored
      commit 75802ca6 upstream.
      
      This is found by code observation only.
      
      Firstly, the worst case scenario should assume the whole range was covered
      by pmd sharing.  The old algorithm might not work as expected for ranges
      like (1g-2m, 1g+2m), where the adjusted range should be (0, 1g+2m) but the
      expected range should be (0, 2g).
      
      Since at it, remove the loop since it should not be required.  With that,
      the new code should be faster too when the invalidating range is huge.
      
      Mike said:
      
      : With range (1g-2m, 1g+2m) within a vma (0, 2g) the existing code will only
      : adjust to (0, 1g+2m) which is incorrect.
      :
      : We should cc stable.  The original reason for adjusting the range was to
      : prevent data corruption (getting wrong page).  Since the range is not
      : always adjusted correctly, the potential for corruption still exists.
      :
      : However, I am fairly confident that adjust_range_if_pmd_sharing_possible
      : is only gong to be called in two cases:
      :
      : 1) for a single page
      : 2) for range == entire vma
      :
      : In those cases, the current code should produce the correct results.
      :
      : To be safe, let's just cc stable.
      
      Fixes: 017b1660 ("mm: migration: fix migration of huge PMD shared pages")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200730201636.74778-1-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fe5f83b1
    • Al Viro's avatar
      do_epoll_ctl(): clean the failure exits up a bit · b3ce6ca9
      Al Viro authored
      commit 52c47969 upstream.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b3ce6ca9
    • Marc Zyngier's avatar
      epoll: Keep a reference on files added to the check list · 9bbd2032
      Marc Zyngier authored
      commit a9ed4a65 upstream.
      
      When adding a new fd to an epoll, and that this new fd is an
      epoll fd itself, we recursively scan the fds attached to it
      to detect cycles, and add non-epool files to a "check list"
      that gets subsequently parsed.
      
      However, this check list isn't completely safe when deletions
      can happen concurrently. To sidestep the issue, make sure that
      a struct file placed on the check list sees its f_count increased,
      ensuring that a concurrent deletion won't result in the file
      disapearing from under our feet.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9bbd2032
    • Michael Ellerman's avatar
      powerpc: Allow 4224 bytes of stack expansion for the signal frame · a7fef53a
      Michael Ellerman authored
      commit 63dee5df upstream.
      
      We have powerpc specific logic in our page fault handling to decide if
      an access to an unmapped address below the stack pointer should expand
      the stack VMA.
      
      The code was originally added in 2004 "ported from 2.4". The rough
      logic is that the stack is allowed to grow to 1MB with no extra
      checking. Over 1MB the access must be within 2048 bytes of the stack
      pointer, or be from a user instruction that updates the stack pointer.
      
      The 2048 byte allowance below the stack pointer is there to cover the
      288 byte "red zone" as well as the "about 1.5kB" needed by the signal
      delivery code.
      
      Unfortunately since then the signal frame has expanded, and is now
      4224 bytes on 64-bit kernels with transactional memory enabled. This
      means if a process has consumed more than 1MB of stack, and its stack
      pointer lies less than 4224 bytes from the next page boundary, signal
      delivery will fault when trying to expand the stack and the process
      will see a SEGV.
      
      The total size of the signal frame is the size of struct rt_sigframe
      (which includes the red zone) plus __SIGNAL_FRAMESIZE (128 bytes on
      64-bit).
      
      The 2048 byte allowance was correct until 2008 as the signal frame
      was:
      
      struct rt_sigframe {
              struct ucontext    uc;                           /*     0  1440 */
              /* --- cacheline 11 boundary (1408 bytes) was 32 bytes ago --- */
              long unsigned int          _unused[2];           /*  1440    16 */
              unsigned int               tramp[6];             /*  1456    24 */
              struct siginfo *           pinfo;                /*  1480     8 */
              void *                     puc;                  /*  1488     8 */
              struct siginfo     info;                         /*  1496   128 */
              /* --- cacheline 12 boundary (1536 bytes) was 88 bytes ago --- */
              char                       abigap[288];          /*  1624   288 */
      
              /* size: 1920, cachelines: 15, members: 7 */
              /* padding: 8 */
      };
      
      1920 + 128 = 2048
      
      Then in commit ce48b210 ("powerpc: Add VSX context save/restore,
      ptrace and signal support") (Jul 2008) the signal frame expanded to
      2304 bytes:
      
      struct rt_sigframe {
              struct ucontext    uc;                           /*     0  1696 */	<--
              /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
              long unsigned int          _unused[2];           /*  1696    16 */
              unsigned int               tramp[6];             /*  1712    24 */
              struct siginfo *           pinfo;                /*  1736     8 */
              void *                     puc;                  /*  1744     8 */
              struct siginfo     info;                         /*  1752   128 */
              /* --- cacheline 14 boundary (1792 bytes) was 88 bytes ago --- */
              char                       abigap[288];          /*  1880   288 */
      
              /* size: 2176, cachelines: 17, members: 7 */
              /* padding: 8 */
      };
      
      2176 + 128 = 2304
      
      At this point we should have been exposed to the bug, though as far as
      I know it was never reported. I no longer have a system old enough to
      easily test on.
      
      Then in 2010 commit 320b2b8d ("mm: keep a guard page below a
      grow-down stack segment") caused our stack expansion code to never
      trigger, as there was always a VMA found for a write up to PAGE_SIZE
      below r1.
      
      That meant the bug was hidden as we continued to expand the signal
      frame in commit 2b0a576d ("powerpc: Add new transactional memory
      state to the signal context") (Feb 2013):
      
      struct rt_sigframe {
              struct ucontext    uc;                           /*     0  1696 */
              /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
              struct ucontext    uc_transact;                  /*  1696  1696 */	<--
              /* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */
              long unsigned int          _unused[2];           /*  3392    16 */
              unsigned int               tramp[6];             /*  3408    24 */
              struct siginfo *           pinfo;                /*  3432     8 */
              void *                     puc;                  /*  3440     8 */
              struct siginfo     info;                         /*  3448   128 */
              /* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */
              char                       abigap[288];          /*  3576   288 */
      
              /* size: 3872, cachelines: 31, members: 8 */
              /* padding: 8 */
              /* last cacheline: 32 bytes */
      };
      
      3872 + 128 = 4000
      
      And commit 573ebfa6 ("powerpc: Increase stack redzone for 64-bit
      userspace to 512 bytes") (Feb 2014):
      
      struct rt_sigframe {
              struct ucontext    uc;                           /*     0  1696 */
              /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
              struct ucontext    uc_transact;                  /*  1696  1696 */
              /* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */
              long unsigned int          _unused[2];           /*  3392    16 */
              unsigned int               tramp[6];             /*  3408    24 */
              struct siginfo *           pinfo;                /*  3432     8 */
              void *                     puc;                  /*  3440     8 */
              struct siginfo     info;                         /*  3448   128 */
              /* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */
              char                       abigap[512];          /*  3576   512 */	<--
      
              /* size: 4096, cachelines: 32, members: 8 */
              /* padding: 8 */
      };
      
      4096 + 128 = 4224
      
      Then finally in 2017, commit 1be7107f ("mm: larger stack guard
      gap, between vmas") exposed us to the existing bug, because it changed
      the stack VMA to be the correct/real size, meaning our stack expansion
      code is now triggered.
      
      Fix it by increasing the allowance to 4224 bytes.
      
      Hard-coding 4224 is obviously unsafe against future expansions of the
      signal frame in the same way as the existing code. We can't easily use
      sizeof() because the signal frame structure is not in a header. We
      will either fix that, or rip out all the custom stack expansion
      checking logic entirely.
      
      Fixes: ce48b210 ("powerpc: Add VSX context save/restore, ptrace and signal support")
      Cc: stable@vger.kernel.org # v2.6.27+
      Reported-by: default avatarTom Lane <tgl@sss.pgh.pa.us>
      Tested-by: default avatarDaniel Axtens <dja@axtens.net>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200724092528.1578671-2-mpe@ellerman.id.auSigned-off-by: default avatarDaniel Axtens <dja@axtens.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a7fef53a
    • Vasant Hegde's avatar
      powerpc/pseries: Do not initiate shutdown when system is running on UPS · 13ad4324
      Vasant Hegde authored
      commit 90a9b102 upstream.
      
      As per PAPR we have to look for both EPOW sensor value and event
      modifier to identify the type of event and take appropriate action.
      
      In LoPAPR v1.1 section 10.2.2 includes table 136 "EPOW Action Codes":
      
        SYSTEM_SHUTDOWN 3
      
        The system must be shut down. An EPOW-aware OS logs the EPOW error
        log information, then schedules the system to be shut down to begin
        after an OS defined delay internal (default is 10 minutes.)
      
      Then in section 10.3.2.2.8 there is table 146 "Platform Event Log
      Format, Version 6, EPOW Section", which includes the "EPOW Event
      Modifier":
      
        For EPOW sensor value = 3
        0x01 = Normal system shutdown with no additional delay
        0x02 = Loss of utility power, system is running on UPS/Battery
        0x03 = Loss of system critical functions, system should be shutdown
        0x04 = Ambient temperature too high
        All other values = reserved
      
      We have a user space tool (rtas_errd) on LPAR to monitor for
      EPOW_SHUTDOWN_ON_UPS. Once it gets an event it initiates shutdown
      after predefined time. It also starts monitoring for any new EPOW
      events. If it receives "Power restored" event before predefined time
      it will cancel the shutdown. Otherwise after predefined time it will
      shutdown the system.
      
      Commit 79872e35 ("powerpc/pseries: All events of
      EPOW_SYSTEM_SHUTDOWN must initiate shutdown") changed our handling of
      the "on UPS/Battery" case, to immediately shutdown the system. This
      breaks existing setups that rely on the userspace tool to delay
      shutdown and let the system run on the UPS.
      
      Fixes: 79872e35 ("powerpc/pseries: All events of EPOW_SYSTEM_SHUTDOWN must initiate shutdown")
      Cc: stable@vger.kernel.org # v4.0+
      Signed-off-by: default avatarVasant Hegde <hegdevasant@linux.vnet.ibm.com>
      [mpe: Massage change log and add PAPR references]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200820061844.306460-1-hegdevasant@linux.vnet.ibm.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      13ad4324