1. 17 Jun, 2019 30 commits
    • Peter Zijlstra's avatar
      x86/percpu, sched/fair: Avoid local_clock() · 8dc2d993
      Peter Zijlstra authored
      Nadav reported that code-gen changed because of the this_cpu_*()
      constraints, avoid this for select_idle_cpu() because that runs with
      preemption (and IRQs) disabled anyway.
      Reported-by: default avatarNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8dc2d993
    • Peter Zijlstra's avatar
      x86/percpu, x86/irq: Relax {set,get}_irq_regs() · 602447f9
      Peter Zijlstra authored
      Nadav reported that since the this_cpu_*() ops got asm-volatile
      constraints on, code generation suffered for do_IRQ(), but since this
      is all with IRQs disabled we can use __this_cpu_*().
      
        smp_x86_platform_ipi                                      234        222   -12,+0
        smp_kvm_posted_intr_ipi                                    74         66   -8,+0
        smp_kvm_posted_intr_wakeup_ipi                             86         78   -8,+0
        smp_apic_timer_interrupt                                  292        284   -8,+0
        smp_kvm_posted_intr_nested_ipi                             74         66   -8,+0
        do_IRQ                                                    195        187   -8,+0
      Reported-by: default avatarNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      602447f9
    • Peter Zijlstra's avatar
      x86/percpu: Relax smp_processor_id() · 9ed7d75b
      Peter Zijlstra authored
      Nadav reported that since this_cpu_read() became asm-volatile, many
      smp_processor_id() users generated worse code due to the extra
      constraints.
      
      However since smp_processor_id() is reading a stable value, we can use
      __this_cpu_read().
      
      While this does reduce text size somewhat, this mostly results in code
      movement to .text.unlikely as a result of more/larger .cold.
      subfunctions. Less text on the hotpath is good for I$.
      
        $ ./compare.sh defconfig-build1 defconfig-build2 vmlinux.o
        setup_APIC_ibs                                             90         98   -12,+20
        force_ibs_eilvt_setup                                     400        413   -57,+70
        pci_serr_error                                            109        104   -54,+49
        pci_serr_error                                            109        104   -54,+49
        unknown_nmi_error                                         125        120   -76,+71
        unknown_nmi_error                                         125        120   -76,+71
        io_check_error                                            125        132   -97,+104
        intel_thermal_interrupt                                   730        822   +92,+0
        intel_init_thermal                                        951        945   -6,+0
        generic_get_mtrr                                          301        294   -7,+0
        generic_get_mtrr                                          301        294   -7,+0
        generic_set_all                                           749        754   -44,+49
        get_fixed_ranges                                          352        360   -41,+49
        x86_acpi_suspend_lowlevel                                 369        363   -6,+0
        check_tsc_sync_source                                     412        412   -71,+71
        irq_migrate_all_off_this_cpu                              662        674   -14,+26
        clocksource_watchdog                                      748        748   -113,+113
        __perf_event_account_interrupt                            204        197   -7,+0
        attempt_merge                                            1748       1741   -7,+0
        intel_guc_send_ct                                        1424       1409   -15,+0
        __fini_doorbell                                           235        231   -4,+0
        bdw_set_cdclk                                             928        923   -5,+0
        gen11_dsi_disable                                        1571       1556   -15,+0
        gmbus_wait                                                493        488   -5,+0
        md_make_request                                           376        369   -7,+0
        __split_and_process_bio                                   543        536   -7,+0
        delay_tsc                                                  96         89   -7,+0
        hsw_disable_pc8                                           696        691   -5,+0
        tsc_verify_tsc_adjust                                     215        228   -22,+35
        cpuidle_driver_unref                                       56         49   -7,+0
        blk_account_io_completion                                 159        148   -11,+0
        mtrr_wrmsr                                                 95         99   -29,+33
        __intel_wait_for_register_fw                              401        419   +18,+0
        cpuidle_driver_ref                                         43         36   -7,+0
        cpuidle_get_driver                                         15          8   -7,+0
        blk_account_io_done                                       535        528   -7,+0
        irq_migrate_all_off_this_cpu                              662        674   -14,+26
        check_tsc_sync_source                                     412        412   -71,+71
        irq_wait_for_poll                                         170        163   -7,+0
        generic_end_io_acct                                       329        322   -7,+0
        x86_acpi_suspend_lowlevel                                 369        363   -6,+0
        nohz_balance_enter_idle                                   198        191   -7,+0
        generic_start_io_acct                                     254        247   -7,+0
        blk_account_io_start                                      341        334   -7,+0
        perf_event_task_tick                                      682        675   -7,+0
        intel_init_thermal                                        951        945   -6,+0
        amd_e400_c1e_apic_setup                                    47         51   -28,+32
        setup_APIC_eilvt                                          350        328   -22,+0
        hsw_enable_pc8                                           1611       1605   -6,+0
                                                     total   12985947   12985892   -994,+939
      Reported-by: default avatarNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      9ed7d75b
    • Peter Zijlstra's avatar
      x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() · 0b9ccc0a
      Peter Zijlstra authored
      Nadav Amit reported that commit:
      
        b59167ac ("x86/percpu: Fix this_cpu_read()")
      
      added a bunch of constraints to all sorts of code; and while some of
      that was correct and desired, some of that seems superfluous.
      
      The thing is, the this_cpu_*() operations are defined IRQ-safe, this
      means the values are subject to change from IRQs, and thus must be
      reloaded.
      
      Also, the generic form:
      
        local_irq_save()
        __this_cpu_read()
        local_irq_restore()
      
      would not allow the re-use of previous values; if by nothing else,
      then the barrier()s implied by local_irq_*().
      
      Which raises the point that percpu_from_op() and the others also need
      that volatile.
      
      OTOH __this_cpu_*() operations are not IRQ-safe and assume external
      preempt/IRQ disabling and could thus be allowed more room for
      optimization.
      
      This makes the this_cpu_*() vs __this_cpu_*() behaviour more
      consistent with other architectures.
      
        $ ./compare.sh defconfig-build defconfig-build1 vmlinux.o
        x86_pmu_cancel_txn                                         80         71   -9,+0
        __text_poke                                               919        964   +45,+0
        do_user_addr_fault                                       1082       1058   -24,+0
        __do_page_fault                                          1194       1178   -16,+0
        do_exit                                                  2995       3027   -43,+75
        process_one_work                                         1008        989   -67,+48
        finish_task_switch                                        524        505   -19,+0
        __schedule_bug                                            103         98   -59,+54
        __schedule_bug                                            103         98   -59,+54
        __sched_setscheduler                                     2015       2030   +15,+0
        freeze_processes                                          203        230   +31,-4
        rcu_gp_kthread_wake                                       106         99   -7,+0
        rcu_core                                                 1841       1834   -7,+0
        call_timer_fn                                             298        286   -12,+0
        can_stop_idle_tick                                        146        139   -31,+24
        perf_pending_event                                        253        239   -14,+0
        shmem_alloc_page                                          209        213   +4,+0
        __alloc_pages_slowpath                                   3284       3269   -15,+0
        umount_tree                                               671        694   +23,+0
        advance_transaction                                       803        798   -5,+0
        con_put_char                                               71         51   -20,+0
        xhci_urb_enqueue                                         1302       1295   -7,+0
        xhci_urb_enqueue                                         1302       1295   -7,+0
        tcp_sacktag_write_queue                                  2130       2075   -55,+0
        tcp_try_undo_loss                                         229        208   -21,+0
        tcp_v4_inbound_md5_hash                                   438        411   -31,+4
        tcp_v4_inbound_md5_hash                                   438        411   -31,+4
        tcp_v6_inbound_md5_hash                                   469        411   -33,-25
        tcp_v6_inbound_md5_hash                                   469        411   -33,-25
        restricted_pointer                                        434        420   -14,+0
        irq_exit                                                  162        154   -8,+0
        get_perf_callchain                                        638        624   -14,+0
        rt_mutex_trylock                                          169        156   -13,+0
        avc_has_extended_perms                                   1092       1089   -3,+0
        avc_has_perm_noaudit                                      309        306   -3,+0
        __perf_sw_event                                           138        122   -16,+0
        perf_swevent_get_recursion_context                        116        102   -14,+0
        __local_bh_enable_ip                                       93         72   -21,+0
        xfrm_input                                               4175       4161   -14,+0
        avc_has_perm                                              446        443   -3,+0
        vm_events_fold_cpu                                         57         56   -1,+0
        vfree                                                      68         61   -7,+0
        freeze_processes                                          203        230   +31,-4
        _local_bh_enable                                           44         30   -14,+0
        ip_do_fragment                                           1982       1944   -38,+0
        do_exit                                                  2995       3027   -43,+75
        __do_softirq                                              742        724   -18,+0
        cpu_init                                                 1510       1489   -21,+0
        account_system_time                                        80         79   -1,+0
                                                     total   12985281   12984819   -742,+280
      Reported-by: default avatarNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20181206112433.GB13675@hirez.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0b9ccc0a
    • Waiman Long's avatar
      locking/rwsem: Guard against making count negative · a15ea1a3
      Waiman Long authored
      The upper bits of the count field is used as reader count. When
      sufficient number of active readers are present, the most significant
      bit will be set and the count becomes negative. If the number of active
      readers keep on piling up, we may eventually overflow the reader counts.
      This is not likely to happen unless the number of bits reserved for
      reader count is reduced because those bits are need for other purpose.
      
      To prevent this count overflow from happening, the most significant
      bit is now treated as a guard bit (RWSEM_FLAG_READFAIL). Read-lock
      attempts will now fail for both the fast and slow paths whenever this
      bit is set. So all those extra readers will be put to sleep in the wait
      list. Wakeup will not happen until the reader count reaches 0.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Link: https://lkml.kernel.org/r/20190520205918.22251-17-longman@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a15ea1a3
    • Waiman Long's avatar
      locking/rwsem: Adaptive disabling of reader optimistic spinning · 5cfd92e1
      Waiman Long authored
      Reader optimistic spinning is helpful when the reader critical section
      is short and there aren't that many readers around. It makes readers
      relatively more preferred than writers. When a writer times out spinning
      on a reader-owned lock and set the nospinnable bits, there are two main
      reasons for that.
      
       1) The reader critical section is long, perhaps the task sleeps after
          acquiring the read lock.
       2) There are just too many readers contending the lock causing it to
          take a while to service all of them.
      
      In the former case, long reader critical section will impede the progress
      of writers which is usually more important for system performance.
      In the later case, reader optimistic spinning tends to make the reader
      groups that contain readers that acquire the lock together smaller
      leading to more of them. That may hurt performance in some cases. In
      other words, the setting of nonspinnable bits indicates that reader
      optimistic spinning may not be helpful for those workloads that cause it.
      
      Therefore, any writers that have observed the setting of the writer
      nonspinnable bit for a given rwsem after they fail to acquire the lock
      via optimistic spinning will set the reader nonspinnable bit once they
      acquire the write lock. Similarly, readers that observe the setting
      of reader nonspinnable bit at slowpath entry will also set the reader
      nonspinnable bit when they acquire the read lock via the wakeup path.
      
      Once the reader nonspinnable bit is on, it will only be reset when
      a writer is able to acquire the rwsem in the fast path or somehow a
      reader or writer in the slowpath doesn't observe the nonspinable bit.
      
      This is to discourage reader optmistic spinning on that particular
      rwsem and make writers more preferred. This adaptive disabling of reader
      optimistic spinning will alleviate some of the negative side effect of
      this feature.
      
      In addition, this patch tries to make readers in the spinning queue
      follow the phase-fair principle after quitting optimistic spinning
      by checking if another reader has somehow acquired a read lock after
      this reader enters the optimistic spinning queue. If so and the rwsem
      is still reader-owned, this reader is in the right read-phase and can
      attempt to acquire the lock.
      
      On a 2-socket 40-core 80-thread Skylake system, the page_fault1 test of
      the will-it-scale benchmark was run with various number of threads. The
      number of operations done before reader optimistic spinning patches,
      this patch and after this patch were:
      
        Threads  Before rspin  Before patch  After patch    %change
        -------  ------------  ------------  -----------    -------
          20        5541068      5345484       5455667    -3.5%/ +2.1%
          40       10185150      7292313       9219276   -28.5%/+26.4%
          60        8196733      6460517       7181209   -21.2%/+11.2%
          80        9508864      6739559       8107025   -29.1%/+20.3%
      
      This patch doesn't recover all the lost performance, but it is more
      than half. Given the fact that reader optimistic spinning does benefit
      some workloads, this is a good compromise.
      
      Using the rwsem locking microbenchmark with very short critical section,
      this patch doesn't have too much impact on locking performance as shown
      by the locking rates (kops/s) below with equal numbers of readers and
      writers before and after this patch:
      
         # of Threads  Pre-patch    Post-patch
         ------------  ---------    ----------
              2          4,730        4,969
              4          4,814        4,786
              8          4,866        4,815
             16          4,715        4,511
             32          3,338        3,500
             64          3,212        3,389
             80          3,110        3,044
      
      When running the locking microbenchmark with 40 dedicated reader and writer
      threads, however, the reader performance is curtailed to favor the writer.
      
      Before patch:
      
        40 readers, Iterations Min/Mean/Max = 204,026/234,309/254,816
        40 writers, Iterations Min/Mean/Max = 88,515/95,884/115,644
      
      After patch:
      
        40 readers, Iterations Min/Mean/Max = 33,813/35,260/36,791
        40 writers, Iterations Min/Mean/Max = 95,368/96,565/97,798
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Link: https://lkml.kernel.org/r/20190520205918.22251-16-longman@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5cfd92e1
    • Waiman Long's avatar
      locking/rwsem: Enable time-based spinning on reader-owned rwsem · 7d43f1ce
      Waiman Long authored
      When the rwsem is owned by reader, writers stop optimistic spinning
      simply because there is no easy way to figure out if all the readers
      are actively running or not. However, there are scenarios where
      the readers are unlikely to sleep and optimistic spinning can help
      performance.
      
      This patch provides a simple mechanism for spinning on a reader-owned
      rwsem by a writer. It is a time threshold based spinning where the
      allowable spinning time can vary from 10us to 25us depending on the
      condition of the rwsem.
      
      When the time threshold is exceeded, the nonspinnable bits will be set
      in the owner field to indicate that no more optimistic spinning will
      be allowed on this rwsem until it becomes writer owned again. Not even
      readers is allowed to acquire the reader-locked rwsem by optimistic
      spinning for fairness.
      
      We also want a writer to acquire the lock after the readers hold the
      lock for a relatively long time. In order to give preference to writers
      under such a circumstance, the single RWSEM_NONSPINNABLE bit is now split
      into two - one for reader and one for writer. When optimistic spinning
      is disabled, both bits will be set. When the reader count drop down
      to 0, the writer nonspinnable bit will be cleared to allow writers to
      spin on the lock, but not the readers. When a writer acquires the lock,
      it will write its own task structure pointer into sem->owner and clear
      the reader nonspinnable bit in the process.
      
      The time taken for each iteration of the reader-owned rwsem spinning
      loop varies. Below are sample minimum elapsed times for 16 iterations
      of the loop.
      
            System                 Time for 16 Iterations
            ------                 ----------------------
        1-socket Skylake                  ~800ns
        4-socket Broadwell                ~300ns
        2-socket ThunderX2 (arm64)        ~250ns
      
      When the lock cacheline is contended, we can see up to almost 10X
      increase in elapsed time.  So 25us will be at most 500, 1300 and 1600
      iterations for each of the above systems.
      
      With a locking microbenchmark running on 5.1 based kernel, the total
      locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
      equal numbers of readers and writers before and after this patch were
      as follows:
      
         # of Threads  Pre-patch    Post-patch
         ------------  ---------    ----------
              2          1,759        6,684
              4          1,684        6,738
              8          1,074        7,222
             16            900        7,163
             32            458        7,316
             64            208          520
            128            168          425
            240            143          474
      
      This patch gives a big boost in performance for mixed reader/writer
      workloads.
      
      With 32 locking threads, the rwsem lock event data were:
      
      rwsem_opt_fail=79850
      rwsem_opt_nospin=5069
      rwsem_opt_rlock=597484
      rwsem_opt_wlock=957339
      rwsem_sleep_reader=57782
      rwsem_sleep_writer=55663
      
      With 64 locking threads, the data looked like:
      
      rwsem_opt_fail=346723
      rwsem_opt_nospin=6293
      rwsem_opt_rlock=1127119
      rwsem_opt_wlock=1400628
      rwsem_sleep_reader=308201
      rwsem_sleep_writer=72281
      
      So a lot more threads acquired the lock in the slowpath and more threads
      went to sleep.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Link: https://lkml.kernel.org/r/20190520205918.22251-15-longman@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7d43f1ce
    • Waiman Long's avatar
      locking/rwsem: Make rwsem->owner an atomic_long_t · 94a9717b
      Waiman Long authored
      The rwsem->owner contains not just the task structure pointer, it also
      holds some flags for storing the current state of the rwsem. Some of
      the flags may have to be atomically updated. To reflect the new reality,
      the owner is now changed to an atomic_long_t type.
      
      New helper functions are added to properly separate out the task
      structure pointer and the embedded flags.
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Link: https://lkml.kernel.org/r/20190520205918.22251-14-longman@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      94a9717b
    • Waiman Long's avatar
      locking/rwsem: Enable readers spinning on writer · cf69482d
      Waiman Long authored
      This patch enables readers to optimistically spin on a
      rwsem when it is owned by a writer instead of going to sleep
      directly.  The rwsem_can_spin_on_owner() function is extracted
      out of rwsem_optimistic_spin() and is called directly by
      rwsem_down_read_slowpath() and rwsem_down_write_slowpath().
      
      With a locking microbenchmark running on 5.1 based kernel, the total
      locking rates (in kops/s) on a 8-socket IvyBrige-EX system with equal
      numbers of readers and writers before and after the patch were as
      follows:
      
         # of Threads  Pre-patch    Post-patch
         ------------  ---------    ----------
              4          1,674        1,684
              8          1,062        1,074
             16            924          900
             32            300          458
             64            195          208
            128            164          168
            240            149          143
      
      The performance change wasn't significant in this case, but this change
      is required by a follow-on patch.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Link: https://lkml.kernel.org/r/20190520205918.22251-13-longman@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      cf69482d
    • Waiman Long's avatar
      locking/rwsem: Clarify usage of owner's nonspinaable bit · 02f1082b
      Waiman Long authored
      Bit 1 of sem->owner (RWSEM_ANONYMOUSLY_OWNED) is used to designate an
      anonymous owner - readers or an anonymous writer. The setting of this
      anonymous bit is used as an indicator that optimistic spinning cannot
      be done on this rwsem.
      
      With the upcoming reader optimistic spinning patches, a reader-owned
      rwsem can be spinned on for a limit period of time. We still need
      this bit to indicate a rwsem is nonspinnable, but not setting this
      bit loses its meaning that the owner is known. So rename the bit
      to RWSEM_NONSPINNABLE to clarify its meaning.
      
      This patch also fixes a DEBUG_RWSEMS_WARN_ON() bug in __up_write().
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Link: https://lkml.kernel.org/r/20190520205918.22251-12-longman@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      02f1082b
    • Waiman Long's avatar
      locking/rwsem: Wake up almost all readers in wait queue · d3681e26
      Waiman Long authored
      When the front of the wait queue is a reader, other readers
      immediately following the first reader will also be woken up at the
      same time. However, if there is a writer in between. Those readers
      behind the writer will not be woken up.
      
      Because of optimistic spinning, the lock acquisition order is not FIFO
      anyway. The lock handoff mechanism will ensure that lock starvation
      will not happen.
      
      Assuming that the lock hold times of the other readers still in the
      queue will be about the same as the readers that are being woken up,
      there is really not much additional cost other than the additional
      latency due to the wakeup of additional tasks by the waker. Therefore
      all the readers up to a maximum of 256 in the queue are woken up when
      the first waiter is a reader to improve reader throughput. This is
      somewhat similar in concept to a phase-fair R/W lock.
      
      With a locking microbenchmark running on 5.1 based kernel, the total
      locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
      equal numbers of readers and writers before and after this patch were
      as follows:
      
         # of Threads  Pre-Patch   Post-patch
         ------------  ---------   ----------
              4          1,641        1,674
              8            731        1,062
             16            564          924
             32             78          300
             64             38          195
            240             50          149
      
      There is no performance gain at low contention level. At high contention
      level, however, this patch gives a pretty decent performance boost.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Link: https://lkml.kernel.org/r/20190520205918.22251-11-longman@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d3681e26
    • Waiman Long's avatar
      locking/rwsem: More optimal RT task handling of null owner · 990fa738
      Waiman Long authored
      An RT task can do optimistic spinning only if the lock holder is
      actually running. If the state of the lock holder isn't known, there
      is a possibility that high priority of the RT task may block forward
      progress of the lock holder if it happens to reside on the same CPU.
      This will lead to deadlock. So we have to make sure that an RT task
      will not spin on a reader-owned rwsem.
      
      When the owner is temporarily set to NULL, there are two cases
      where we may want to continue spinning:
      
       1) The lock owner is in the process of releasing the lock, sem->owner
          is cleared but the lock has not been released yet.
      
       2) The lock was free and owner cleared, but another task just comes
          in and acquire the lock before we try to get it. The new owner may
          be a spinnable writer.
      
      So an RT task is now made to retry one more time to see if it can
      acquire the lock or continue spinning on the new owning writer.
      
      When testing on a 8-socket IvyBridge-EX system, the one additional retry
      seems to improve locking performance of RT write locking threads under
      heavy contentions. The table below shows the locking rates (in kops/s)
      with various write locking threads before and after the patch.
      
          Locking threads     Pre-patch     Post-patch
          ---------------     ---------     -----------
                  4             2,753          2,608
                  8             2,529          2,520
                 16             1,727          1,918
                 32             1,263          1,956
                 64               889          1,343
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Link: https://lkml.kernel.org/r/20190520205918.22251-10-longman@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      990fa738
    • Waiman Long's avatar
      locking/rwsem: Always release wait_lock before waking up tasks · 00f3c5a3
      Waiman Long authored
      With the use of wake_q, we can do task wakeups without holding the
      wait_lock. There is one exception in the rwsem code, though. It is
      when the writer in the slowpath detects that there are waiters ahead
      but the rwsem is not held by a writer. This can lead to a long wait_lock
      hold time especially when a large number of readers are to be woken up.
      
      Remediate this situation by releasing the wait_lock before waking
      up tasks and re-acquiring it afterward. The rwsem_try_write_lock()
      function is also modified to read the rwsem count directly to avoid
      stale count value.
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Link: https://lkml.kernel.org/r/20190520205918.22251-9-longman@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      00f3c5a3
    • Waiman Long's avatar
      locking/rwsem: Implement lock handoff to prevent lock starvation · 4f23dbc1
      Waiman Long authored
      Because of writer lock stealing, it is possible that a constant
      stream of incoming writers will cause a waiting writer or reader to
      wait indefinitely leading to lock starvation.
      
      This patch implements a lock handoff mechanism to disable lock stealing
      and force lock handoff to the first waiter or waiters (for readers)
      in the queue after at least a 4ms waiting period unless it is a RT
      writer task which doesn't need to wait. The waiting period is used to
      avoid discouraging lock stealing too much to affect performance.
      
      The setting and clearing of the handoff bit is serialized by the
      wait_lock. So racing is not possible.
      
      A rwsem microbenchmark was run for 5 seconds on a 2-socket 40-core
      80-thread Skylake system with a v5.1 based kernel and 240 write_lock
      threads with 5us sleep critical section.
      
      Before the patch, the min/mean/max numbers of locking operations for
      the locking threads were 1/7,792/173,696. After the patch, the figures
      became 5,842/6,542/7,458.  It can be seen that the rwsem became much
      more fair, though there was a drop of about 16% in the mean locking
      operations done which was a tradeoff of having better fairness.
      
      Making the waiter set the handoff bit right after the first wakeup can
      impact performance especially with a mixed reader/writer workload. With
      the same microbenchmark with short critical section and equal number of
      reader and writer threads (40/40), the reader/writer locking operation
      counts with the current patch were:
      
        40 readers, Iterations Min/Mean/Max = 1,793/1,794/1,796
        40 writers, Iterations Min/Mean/Max = 1,793/34,956/86,081
      
      By making waiter set handoff bit immediately after wakeup:
      
        40 readers, Iterations Min/Mean/Max = 43/44/46
        40 writers, Iterations Min/Mean/Max = 43/1,263/3,191
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Link: https://lkml.kernel.org/r/20190520205918.22251-8-longman@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      4f23dbc1
    • Waiman Long's avatar
      locking/rwsem: Make rwsem_spin_on_owner() return owner state · 3f6d517a
      Waiman Long authored
      This patch modifies rwsem_spin_on_owner() to return four possible
      values to better reflect the state of lock holder which enables us to
      make a better decision of what to do next.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Link: https://lkml.kernel.org/r/20190520205918.22251-7-longman@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      3f6d517a
    • Waiman Long's avatar
      locking/rwsem: Code cleanup after files merging · 6cef7ff6
      Waiman Long authored
      After merging all the relevant rwsem code into one single file, there
      are a number of optimizations and cleanups that can be done:
      
       1) Remove all the EXPORT_SYMBOL() calls for functions that are not
          accessed elsewhere.
       2) Remove all the __visible tags as none of the functions will be
          called from assembly code anymore.
       3) Make all the internal functions static.
       4) Remove some unneeded blank lines.
       5) Remove the intermediate rwsem_down_{read|write}_failed*() functions
          and rename __rwsem_down_{read|write}_failed_common() to
          rwsem_down_{read|write}_slowpath().
       6) Remove "__" prefix of __rwsem_mark_wake().
       7) Use atomic_long_try_cmpxchg_acquire() as much as possible.
       8) Remove the rwsem_rtrylock and rwsem_wtrylock lock events as they
          are not that useful.
      
      That enables the compiler to do better optimization and reduce code
      size. The text+data size of rwsem.o on an x86-64 machine with gcc8 was
      reduced from 10237 bytes to 5030 bytes with this change.
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Link: https://lkml.kernel.org/r/20190520205918.22251-6-longman@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6cef7ff6
    • Waiman Long's avatar
      locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c · 5dec94d4
      Waiman Long authored
      Now we only have one implementation of rwsem. Even though we still use
      xadd to handle reader locking, we use cmpxchg for writer instead. So
      the filename rwsem-xadd.c is not strictly correct. Also no one outside
      of the rwsem code need to know the internal implementation other than
      function prototypes for two internal functions that are called directly
      from percpu-rwsem.c.
      
      So the rwsem-xadd.c and rwsem.h files are now merged into rwsem.c in
      the following order:
      
        <upper part of rwsem.h>
        <rwsem-xadd.c>
        <lower part of rwsem.h>
        <rwsem.c>
      
      The rwsem.h file now contains only 2 function declarations for
      __up_read() and __down_read().
      
      This is a code relocation patch with no code change at all except
      making __up_read() and __down_read() non-static functions so they
      can be used by percpu-rwsem.c.
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Link: https://lkml.kernel.org/r/20190520205918.22251-5-longman@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5dec94d4
    • Waiman Long's avatar
      locking/rwsem: Implement a new locking scheme · 64489e78
      Waiman Long authored
      The current way of using various reader, writer and waiting biases
      in the rwsem code are confusing and hard to understand. I have to
      reread the rwsem count guide in the rwsem-xadd.c file from time to
      time to remind myself how this whole thing works. It also makes the
      rwsem code harder to be optimized.
      
      To make rwsem more sane, a new locking scheme similar to the one in
      qrwlock is now being used.  The atomic long count has the following
      bit definitions:
      
        Bit  0   - writer locked bit
        Bit  1   - waiters present bit
        Bits 2-7 - reserved for future extension
        Bits 8-X - reader count (24/56 bits)
      
      The cmpxchg instruction is now used to acquire the write lock. The read
      lock is still acquired with xadd instruction, so there is no change here.
      This scheme will allow up to 16M/64P active readers which should be
      more than enough. We can always use some more reserved bits if necessary.
      
      With that change, we can deterministically know if a rwsem has been
      write-locked. Looking at the count alone, however, one cannot determine
      for certain if a rwsem is owned by readers or not as the readers that
      set the reader count bits may be in the process of backing out. So we
      still need the reader-owned bit in the owner field to be sure.
      
      With a locking microbenchmark running on 5.1 based kernel, the total
      locking rates (in kops/s) of the benchmark on a 8-socket 120-core
      IvyBridge-EX system before and after the patch were as follows:
      
                        Before Patch      After Patch
         # of Threads  wlock    rlock    wlock    rlock
         ------------  -----    -----    -----    -----
              1        30,659   31,341   31,055   31,283
              2         8,909   16,457    9,884   17,659
              4         9,028   15,823    8,933   20,233
              8         8,410   14,212    7,230   17,140
             16         8,217   25,240    7,479   24,607
      
      The locking rates of the benchmark on a Power8 system were as follows:
      
                        Before Patch      After Patch
         # of Threads  wlock    rlock    wlock    rlock
         ------------  -----    -----    -----    -----
              1        12,963   13,647   13,275   13,601
              2         7,570   11,569    7,902   10,829
              4         5,232    5,516    5,466    5,435
              8         5,233    3,386    5,467    3,168
      
      The locking rates of the benchmark on a 2-socket ARM64 system were
      as follows:
      
                        Before Patch      After Patch
         # of Threads  wlock    rlock    wlock    rlock
         ------------  -----    -----    -----    -----
              1        21,495   21,046   21,524   21,074
              2         5,293   10,502    5,333   10,504
              4         5,325   11,463    5,358   11,631
              8         5,391   11,712    5,470   11,680
      
      The performance are roughly the same before and after the patch. There
      are run-to-run variations in performance. Runs with higher variances
      usually have higher throughput.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Link: https://lkml.kernel.org/r/20190520205918.22251-4-longman@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      64489e78
    • Waiman Long's avatar
      locking/rwsem: Remove rwsem_wake() wakeup optimization · 5c1ec49b
      Waiman Long authored
      After the following commit:
      
        59aabfc7 ("locking/rwsem: Reduce spinlock contention in wakeup after up_read()/up_write()")
      
      the rwsem_wake() forgoes doing a wakeup if the wait_lock cannot be directly
      acquired and an optimistic spinning locker is present.  This can help performance
      by avoiding spinning on the wait_lock when it is contended.
      
      With the later commit:
      
        133e89ef ("locking/rwsem: Enable lockless waiter wakeup(s)")
      
      the performance advantage of the above optimization diminishes as the average
      wait_lock hold time become much shorter.
      
      With a later patch that supports rwsem lock handoff, we can no
      longer relies on the fact that the presence of an optimistic spinning
      locker will ensure that the lock will be acquired by a task soon and
      rwsem_wake() will be called later on to wake up waiters. This can lead
      to missed wakeup and application hang.
      
      So the original 59aabfc7 commit has to be reverted.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Link: https://lkml.kernel.org/r/20190520205918.22251-3-longman@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5c1ec49b
    • Waiman Long's avatar
      locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER · c71fd893
      Waiman Long authored
      The owner field in the rw_semaphore structure is used primarily for
      optimistic spinning. However, identifying the rwsem owner can also be
      helpful in debugging as well as tracing locking related issues when
      analyzing crash dump. The owner field may also store state information
      that can be important to the operation of the rwsem.
      
      So the owner field is now made a permanent member of the rw_semaphore
      structure irrespective of CONFIG_RWSEM_SPIN_ON_OWNER.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Link: https://lkml.kernel.org/r/20190520205918.22251-2-longman@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      c71fd893
    • Peter Zijlstra's avatar
      x86/atomic: Fix smp_mb__{before,after}_atomic() · 69d927bb
      Peter Zijlstra authored
      Recent probing at the Linux Kernel Memory Model uncovered a
      'surprise'. Strongly ordered architectures where the atomic RmW
      primitive implies full memory ordering and
      smp_mb__{before,after}_atomic() are a simple barrier() (such as x86)
      fail for:
      
      	*x = 1;
      	atomic_inc(u);
      	smp_mb__after_atomic();
      	r0 = *y;
      
      Because, while the atomic_inc() implies memory order, it
      (surprisingly) does not provide a compiler barrier. This then allows
      the compiler to re-order like so:
      
      	atomic_inc(u);
      	*x = 1;
      	smp_mb__after_atomic();
      	r0 = *y;
      
      Which the CPU is then allowed to re-order (under TSO rules) like:
      
      	atomic_inc(u);
      	r0 = *y;
      	*x = 1;
      
      And this very much was not intended. Therefore strengthen the atomic
      RmW ops to include a compiler barrier.
      
      NOTE: atomic_{or,and,xor} and the bitops already had the compiler
      barrier.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      69d927bb
    • Kobe Wu's avatar
      locking/lockdep: Remove unnecessary DEBUG_LOCKS_WARN_ON() · dd471efe
      Kobe Wu authored
      DEBUG_LOCKS_WARN_ON() will turn off debug_locks and
      makes print_unlock_imbalance_bug() return directly.
      
      Remove a redundant whitespace.
      Signed-off-by: default avatarKobe Wu <kobe-cp.wu@mediatek.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <linux-mediatek@lists.infradead.org>
      Cc: <wsd_upstream@mediatek.com>
      Cc: Eason Lin <eason-yh.lin@mediatek.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Link: https://lkml.kernel.org/r/1559217575-30298-1-git-send-email-kobe-cp.wu@mediatek.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      dd471efe
    • Nikolay Borisov's avatar
      locking/lockdep: Rename lockdep_assert_held_exclusive() -> lockdep_assert_held_write() · 9ffbe8ac
      Nikolay Borisov authored
      All callers of lockdep_assert_held_exclusive() use it to verify the
      correct locking state of either a semaphore (ldisc_sem in tty,
      mmap_sem for perf events, i_rwsem of inode for dax) or rwlock by
      apparmor. Thus it makes sense to rename _exclusive to _write since
      that's the semantics callers care. Additionally there is already
      lockdep_assert_held_read(), which this new naming is more consistent with.
      
      No functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190531100651.3969-1-nborisov@suse.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      9ffbe8ac
    • Daniel Bristot de Oliveira's avatar
      x86/jump_label: Batch jump label updates · ba54f0c3
      Daniel Bristot de Oliveira authored
      Currently, the jump label of a static key is transformed via the arch
      specific function:
      
          void arch_jump_label_transform(struct jump_entry *entry,
                                         enum jump_label_type type)
      
      The new approach (batch mode) uses two arch functions, the first has the
      same arguments of the arch_jump_label_transform(), and is the function:
      
          bool arch_jump_label_transform_queue(struct jump_entry *entry,
                                               enum jump_label_type type)
      
      Rather than transforming the code, it adds the jump_entry in a queue of
      entries to be updated. This functions returns true in the case of a
      successful enqueue of an entry. If it returns false, the caller must to
      apply the queue and then try to queue again, for instance, because the
      queue is full.
      
      This function expects the caller to sort the entries by the address before
      enqueueuing then. This is already done by the arch independent code, though.
      
      After queuing all jump_entries, the function:
      
          void arch_jump_label_transform_apply(void)
      
      Applies the changes in the queue.
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris von Recklinghausen <crecklin@redhat.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott Wood <swood@redhat.com>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/57b4caa654bad7e3b066301c9a9ae233dea065b5.1560325897.git.bristot@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ba54f0c3
    • Daniel Bristot de Oliveira's avatar
      jump_label: Batch updates if arch supports it · c2ba8a15
      Daniel Bristot de Oliveira authored
      If the architecture supports the batching of jump label updates, use it!
      
      An easy way to see the benefits of this patch is switching the
      schedstats on and off. For instance:
      
      -------------------------- %< ----------------------------
        #!/bin/sh
        while [ true ]; do
            sysctl -w kernel.sched_schedstats=1
            sleep 2
            sysctl -w kernel.sched_schedstats=0
            sleep 2
        done
      -------------------------- >% ----------------------------
      
      while watching the IPI count:
      
      -------------------------- %< ----------------------------
        # watch -n1 "cat /proc/interrupts | grep Function"
      -------------------------- >% ----------------------------
      
      With the current mode, it is possible to see +- 168 IPIs each 2 seconds,
      while with this patch the number of IPIs goes to 3 each 2 seconds.
      
      Regarding the performance impact of this patch set, I made two measurements:
      
          The time to update a key (the task that is causing the change)
          The time to run the int3 handler (the side effect on a thread that
                                            hits the code being changed)
      
      The schedstats static key was chosen as the key to being switched on and off.
      The reason being is that it is used in more than 56 places, in a hot path. The
      change in the schedstats static key will be done with the following command:
      
      while [ true ]; do
          sysctl -w kernel.sched_schedstats=1
          usleep 500000
          sysctl -w kernel.sched_schedstats=0
          usleep 500000
      done
      
      In this way, they key will be updated twice per second. To force the hit of the
      int3 handler, the system will also run a kernel compilation with two jobs per
      CPU. The test machine is a two nodes/24 CPUs box with an Intel Xeon processor
      @2.27GHz.
      
      Regarding the update part, on average, the regular kernel takes 57 ms to update
      the schedstats key, while the kernel with the batch updates takes just 1.4 ms
      on average. Although it seems to be too good to be true, it makes sense: the
      schedstats key is used in 56 places, so it was expected that it would take
      around 56 times to update the keys with the current implementation, as the
      IPIs are the most expensive part of the update.
      
      Regarding the int3 handler, the non-batch handler takes 45 ns on average, while
      the batch version takes around 180 ns. At first glance, it seems to be a high
      value. But it is not, considering that it is doing 56 updates, rather than one!
      It is taking four times more, only. This gain is possible because the patch
      uses a binary search in the vector: log2(56)=5.8. So, it was expected to have
      an overhead within four times.
      
      (voice of tv propaganda) But, that is not all! As the int3 handler keeps on for
      a shorter period (because the update part is on for a shorter time), the number
      of hits in the int3 handler decreased by 10%.
      
      The question then is: Is it worth paying the price of "135 ns" more in the int3
      handler?
      
      Considering that, in this test case, we are saving the handling of 53 IPIs,
      that takes more than these 135 ns, it seems to be a meager price to be paid.
      Moreover, the test case was forcing the hit of the int3, in practice, it
      does not take that often. While the IPI takes place on all CPUs, hitting
      the int3 handler or not!
      
      For instance, in an isolated CPU with a process running in user-space
      (nohz_full use-case), the chances of hitting the int3 handler is barely zero,
      while there is no way to avoid the IPIs. By bounding the IPIs, we are improving
      a lot this scenario.
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris von Recklinghausen <crecklin@redhat.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott Wood <swood@redhat.com>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/acc891dbc2dbc9fd616dd680529a2337b1d1274c.1560325897.git.bristot@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      c2ba8a15
    • Daniel Bristot de Oliveira's avatar
      x86/alternative: Batch of patch operations · c0213b0a
      Daniel Bristot de Oliveira authored
      Currently, the patch of an address is done in three steps:
      
      -- Pseudo-code #1 - Current implementation ---
      
              1) add an int3 trap to the address that will be patched
                  sync cores (send IPI to all other CPUs)
              2) update all but the first byte of the patched range
                  sync cores (send IPI to all other CPUs)
              3) replace the first byte (int3) by the first byte of replacing opcode
                  sync cores (send IPI to all other CPUs)
      
      -- Pseudo-code #1 ---
      
      When a static key has more than one entry, these steps are called once for
      each entry. The number of IPIs then is linear with regard to the number 'n' of
      entries of a key: O(n*3), which is O(n).
      
      This algorithm works fine for the update of a single key. But we think
      it is possible to optimize the case in which a static key has more than
      one entry. For instance, the sched_schedstats jump label has 56 entries
      in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
      which the thread that is enabling the key is _not_ running.
      
      With this patch, rather than receiving a single patch to be processed, a vector
      of patches is passed, enabling the rewrite of the pseudo-code #1 in this
      way:
      
      -- Pseudo-code #2 - This patch  ---
      1)  for each patch in the vector:
              add an int3 trap to the address that will be patched
      
          sync cores (send IPI to all other CPUs)
      
      2)  for each patch in the vector:
              update all but the first byte of the patched range
      
          sync cores (send IPI to all other CPUs)
      
      3)  for each patch in the vector:
              replace the first byte (int3) by the first byte of replacing opcode
      
          sync cores (send IPI to all other CPUs)
      -- Pseudo-code #2 - This patch  ---
      
      Doing the update in this way, the number of IPI becomes O(3) with regard
      to the number of keys, which is O(1).
      
      The batch mode is done with the function text_poke_bp_batch(), that receives
      two arguments: a vector of "struct text_to_poke", and the number of entries
      in the vector.
      
      The vector must be sorted by the addr field of the text_to_poke structure,
      enabling the binary search of a handler in the poke_int3_handler function
      (a fast path).
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris von Recklinghausen <crecklin@redhat.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott Wood <swood@redhat.com>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/ca506ed52584c80f64de23f6f55ca288e5d079de.1560325897.git.bristot@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      c0213b0a
    • Daniel Bristot de Oliveira's avatar
      jump_label: Sort entries of the same key by the code · 0f133021
      Daniel Bristot de Oliveira authored
      In the batching mode, all the entries of a given key are updated at once.
      During the update of a key, a hit in the int3 handler will check if the
      hitting code address belongs to one of these keys.
      
      To optimize the search of a given code in the vector of entries being
      updated, a binary search is used. The binary search relies on the order
      of the entries of a key by its code. Hence the keys need to be sorted
      by the code too, so sort the entries of a given key by the code.
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris von Recklinghausen <crecklin@redhat.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott Wood <swood@redhat.com>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/f57ae83e0592418ba269866bb7ade570fc8632e0.1560325897.git.bristot@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0f133021
    • Daniel Bristot de Oliveira's avatar
      x86/jump_label: Add a __jump_label_set_jump_code() helper · 4cc6620b
      Daniel Bristot de Oliveira authored
      Move the definition of the code to be written from
      __jump_label_transform() to a specialized function. No functional
      change.
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris von Recklinghausen <crecklin@redhat.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott Wood <swood@redhat.com>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/d2f52a0010ecd399cf9b02a65bcf5836571b9e52.1560325897.git.bristot@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      4cc6620b
    • Daniel Bristot de Oliveira's avatar
      jump_label: Add a jump_label_can_update() helper · e1aacb3f
      Daniel Bristot de Oliveira authored
      Move the check if a jump_entry is valid to a function. No functional
      change.
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris von Recklinghausen <crecklin@redhat.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott Wood <swood@redhat.com>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/56b69bd3f8e644ed64f2dbde7c088030b8cbe76b.1560325897.git.bristot@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e1aacb3f
    • Ingo Molnar's avatar
      410df0c5
  2. 16 Jun, 2019 4 commits
    • Linus Torvalds's avatar
      Linux 5.2-rc5 · 9e0babf2
      Linus Torvalds authored
      9e0babf2
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 963172d9
      Linus Torvalds authored
      Pull x86 fixes from Thomas Gleixner:
       "The accumulated fixes from this and last week:
      
         - Fix vmalloc TLB flush and map range calculations which lead to
           stale TLBs, spurious faults and other hard to diagnose issues.
      
         - Use fault_in_pages_writable() for prefaulting the user stack in the
           FPU code as it's less fragile than the current solution
      
         - Use the PF_KTHREAD flag when checking for a kernel thread instead
           of current->mm as the latter can give the wrong answer due to
           use_mm()
      
         - Compute the vmemmap size correctly for KASLR and 5-Level paging.
           Otherwise this can end up with a way too small vmemmap area.
      
         - Make KASAN and 5-level paging work again by making sure that all
           invalid bits are masked out when computing the P4D offset. This
           worked before but got broken recently when the LDT remap area was
           moved.
      
         - Prevent a NULL pointer dereference in the resource control code
           which can be triggered with certain mount options when the
           requested resource is not available.
      
         - Enforce ordering of microcode loading vs. perf initialization on
           secondary CPUs. Otherwise perf tries to access a non-existing MSR
           as the boot CPU marked it as available.
      
         - Don't stop the resource control group walk early otherwise the
           control bitmaps are not updated correctly and become inconsistent.
      
         - Unbreak kgdb by returning 0 on success from
           kgdb_arch_set_breakpoint() instead of an error code.
      
         - Add more Icelake CPU model defines so depending changes can be
           queued in other trees"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback
        x86/kasan: Fix boot with 5-level paging and KASAN
        x86/fpu: Don't use current->mm to check for a kthread
        x86/kgdb: Return 0 from kgdb_arch_set_breakpoint()
        x86/resctrl: Prevent NULL pointer dereference when local MBM is disabled
        x86/resctrl: Don't stop walking closids when a locksetup group is found
        x86/fpu: Update kernel's FPU state before using for the fsave header
        x86/mm/KASLR: Compute the size of the vmemmap section properly
        x86/fpu: Use fault_in_pages_writeable() for pre-faulting
        x86/CPU: Add more Icelake model numbers
        mm/vmalloc: Avoid rare case of flushing TLB with weird arguments
        mm/vmalloc: Fix calculation of direct map addr range
      963172d9
    • Linus Torvalds's avatar
      Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · efba92d5
      Linus Torvalds authored
      Pull timer fixes from Thomas Gleixner:
       "A set of small fixes:
      
         - Repair the ktime_get_coarse() functions so they actually deliver
           what they are supposed to: tick granular time stamps. The current
           code missed to add the accumulated nanoseconds part of the
           timekeeper so the resulting granularity was 1 second.
      
         - Prevent the tracer from infinitely recursing into time getter
           functions in the arm architectured timer by marking these functions
           notrace
      
         - Fix a trivial compiler warning caused by wrong qualifier ordering"
      
      * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        timekeeping: Repair ktime_get_coarse*() granularity
        clocksource/drivers/arm_arch_timer: Don't trace count reader functions
        clocksource/drivers/timer-ti-dm: Change to new style declaration
      efba92d5
    • Linus Torvalds's avatar
      Merge branch 'ras-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · f763cf8e
      Linus Torvalds authored
      Pull RAS fixes from Thomas Gleixner:
       "Two small fixes for RAS:
      
         - Use a proper search algorithm to find the correct element in the
           CEC array. The replacement was a better choice than fixing the
           crash causes by the original search function with horrible duct
           tape.
      
         - Move the timer based decay function into thread context so it can
           actually acquire the mutex which protects the CEC array to prevent
           corruption"
      
      * 'ras-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        RAS/CEC: Convert the timer callback to a workqueue
        RAS/CEC: Fix binary search function
      f763cf8e
  3. 15 Jun, 2019 6 commits
    • Linus Torvalds's avatar
      Merge tag 'platform-drivers-x86-v5.2-3' of git://git.infradead.org/linux-platform-drivers-x86 · e01e060f
      Linus Torvalds authored
      Pull x86 platform driver fixes from Andy Shevchenko:
      
       - fix a couple of Mellanox driver enumeration issues
      
       - fix ASUS laptop regression with backlight
      
       - fix Dell computers that got a wrong mode (tablet versus laptop) after
         resume
      
      * tag 'platform-drivers-x86-v5.2-3' of git://git.infradead.org/linux-platform-drivers-x86:
        platform/mellanox: mlxreg-hotplug: Add devm_free_irq call to remove flow
        platform/x86: mlx-platform: Fix parent device in i2c-mux-reg device registration
        platform/x86: intel-vbtn: Report switch events when event wakes device
        platform/x86: asus-wmi: Only Tell EC the OS will handle display hotkeys from asus_nb_wmi
      e01e060f
    • Linus Torvalds's avatar
      Merge tag 'usb-5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · ff39074b
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "Here are some small USB driver fixes for 5.2-rc5
      
        Nothing major, just some small gadget fixes, usb-serial new device
        ids, a few new quirks, and some small fixes for some regressions that
        have been found after the big 5.2-rc1 merge.
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'usb-5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
        usb: typec: Make sure an alt mode exist before getting its partner
        usb: gadget: udc: lpc32xx: fix return value check in lpc32xx_udc_probe()
        usb: gadget: dwc2: fix zlp handling
        usb: dwc2: Set actual frame number for completed ISOC transfer for none DDMA
        usb: gadget: udc: lpc32xx: allocate descriptor with GFP_ATOMIC
        usb: gadget: fusb300_udc: Fix memory leak of fusb300->ep[i]
        usb: phy: mxs: Disable external charger detect in mxs_phy_hw_init()
        usb: dwc2: Fix DMA cache alignment issues
        usb: dwc2: host: Fix wMaxPacketSize handling (fix webcam regression)
        USB: Fix chipmunk-like voice when using Logitech C270 for recording audio.
        USB: usb-storage: Add new ID to ums-realtek
        usb: typec: ucsi: ccg: fix memory leak in do_flash
        USB: serial: option: add Telit 0x1260 and 0x1261 compositions
        USB: serial: pl2303: add Allied Telesis VT-Kit3
        USB: serial: option: add support for Simcom SIM7500/SIM7600 RNDIS mode
      ff39074b
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.2-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · fa1827d7
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
       "One fix for a regression introduced by our 32-bit KASAN support, which
        broke booting on machines with "bootx" early debugging enabled.
      
        A fix for a bug which broke kexec on 32-bit, introduced by changes to
        the 32-bit STRICT_KERNEL_RWX support in v5.1.
      
        Finally two fixes going to stable for our THP split/collapse handling,
        discovered by Nick. The first fixes random crashes and/or corruption
        in guests under sufficient load.
      
        Thanks to: Nicholas Piggin, Christophe Leroy, Aaro Koskinen, Mathieu
        Malaterre"
      
      * tag 'powerpc-5.2-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/32s: fix booting with CONFIG_PPC_EARLY_DEBUG_BOOTX
        powerpc/64s: __find_linux_pte() synchronization vs pmdp_invalidate()
        powerpc/64s: Fix THP PMD collapse serialisation
        powerpc: Fix kexec failure on book3s/32
      fa1827d7
    • Linus Torvalds's avatar
      Merge tag 'trace-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · 6a71398c
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
      
       - Out of range read of stack trace output
      
       - Fix for NULL pointer dereference in trace_uprobe_create()
      
       - Fix to a livepatching / ftrace permission race in the module code
      
       - Fix for NULL pointer dereference in free_ftrace_func_mapper()
      
       - A couple of build warning clean ups
      
      * tag 'trace-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        ftrace: Fix NULL pointer dereference in free_ftrace_func_mapper()
        module: Fix livepatch/ftrace module text permissions race
        tracing/uprobe: Fix obsolete comment on trace_uprobe_create()
        tracing/uprobe: Fix NULL pointer dereference in trace_uprobe_create()
        tracing: Make two symbols static
        tracing: avoid build warning with HAVE_NOP_MCOUNT
        tracing: Fix out-of-range read in trace_stack_print()
      6a71398c
    • Borislav Petkov's avatar
      x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback · 78f4e932
      Borislav Petkov authored
      Adric Blake reported the following warning during suspend-resume:
      
        Enabling non-boot CPUs ...
        x86: Booting SMP configuration:
        smpboot: Booting Node 0 Processor 1 APIC 0x2
        unchecked MSR access error: WRMSR to 0x10f (tried to write 0x0000000000000000) \
         at rIP: 0xffffffff8d267924 (native_write_msr+0x4/0x20)
        Call Trace:
         intel_set_tfa
         intel_pmu_cpu_starting
         ? x86_pmu_dead_cpu
         x86_pmu_starting_cpu
         cpuhp_invoke_callback
         ? _raw_spin_lock_irqsave
         notify_cpu_starting
         start_secondary
         secondary_startup_64
        microcode: sig=0x806ea, pf=0x80, revision=0x96
        microcode: updated to revision 0xb4, date = 2019-04-01
        CPU1 is up
      
      The MSR in question is MSR_TFA_RTM_FORCE_ABORT and that MSR is emulated
      by microcode. The log above shows that the microcode loader callback
      happens after the PMU restoration, leading to the conjecture that
      because the microcode hasn't been updated yet, that MSR is not present
      yet, leading to the #GP.
      
      Add a microcode loader-specific hotplug vector which comes before
      the PERF vectors and thus executes earlier and makes sure the MSR is
      present.
      
      Fixes: 400816f6 ("perf/x86/intel: Implement support for TSX Force Abort")
      Reported-by: default avatarAdric Blake <promarbler14@gmail.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: x86@kernel.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=203637
      78f4e932
    • Linus Torvalds's avatar
      Merge branch 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 0011572c
      Linus Torvalds authored
      Pull cgroup fixes from Tejun Heo:
       "This has an unusually high density of tricky fixes:
      
         - task_get_css() could deadlock when it races against a dying cgroup.
      
         - cgroup.procs didn't list thread group leaders with live threads.
      
           This could mislead readers to think that a cgroup is empty when
           it's not. Fixed by making PROCS iterator include dead tasks. I made
           a couple mistakes making this change and this pull request contains
           a couple follow-up patches.
      
         - When cpusets run out of online cpus, it updates cpusmasks of member
           tasks in bizarre ways. Joel improved the behavior significantly"
      
      * 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cpuset: restore sanity to cpuset_cpus_allowed_fallback()
        cgroup: Fix css_task_iter_advance_css_set() cset skip condition
        cgroup: css_task_iter_skip()'d iterators must be advanced before accessed
        cgroup: Include dying leaders with live threads in PROCS iterations
        cgroup: Implement css_task_iter_skip()
        cgroup: Call cgroup_release() before __exit_signal()
        docs cgroups: add another example size for hugetlb
        cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()
      0011572c