1. 01 Aug, 2023 9 commits
    • Darrick J. Wong's avatar
      Merge tag 'xfs-async-dio.6-2023-08-01' of git://git.kernel.dk/linux into iomap-6.6-mergeA · 377698d4
      Darrick J. Wong authored
      Improve iomap/xfs async dio write performance
      
      iomap always punts async dio write completions to a workqueue, which has
      a cost in terms of efficiency (now you need an unrelated worker to
      process it) and latency (now you're bouncing a completion through an
      async worker, which is a classic slowdown scenario).
      
      io_uring handles IRQ completions via task_work, and for writes that
      don't need to do extra IO at completion time, we can safely complete
      them inline from that. This patchset adds IOCB_DIO_CALLER_COMP, which an
      IO issuer can set to inform the completion side that any extra work that
      needs doing for that completion can be punted to a safe task context.
      
      The iomap dio completion will happen in hard/soft irq context, and we
      need a saner context to process these completions. IOCB_DIO_CALLER_COMP
      is added, which can be set in a struct kiocb->ki_flags by the issuer. If
      the completion side of the iocb handling understands this flag, it can
      choose to set a kiocb->dio_complete() handler and just call ki_complete
      from IRQ context. The issuer must then ensure that this callback is
      processed from a task. io_uring punts IRQ completions to task_work
      already, so it's trivial wire it up to run more of the completion before
      posting a CQE. This is good for up to a 37% improvement in
      throughput/latency for low queue depth IO, patch 5 has the details.
      
      If we need to do real work at completion time, iomap will clear the
      IOMAP_DIO_CALLER_COMP flag.
      
      This work came about when Andres tested low queue depth dio writes for
      postgres and compared it to doing sync dio writes, showing that the
      async processing slows us down a lot.
      
      * tag 'xfs-async-dio.6-2023-08-01' of git://git.kernel.dk/linux:
        iomap: support IOCB_DIO_CALLER_COMP
        io_uring/rw: add write support for IOCB_DIO_CALLER_COMP
        fs: add IOCB flags related to passing back dio completions
        iomap: add IOMAP_DIO_INLINE_COMP
        iomap: only set iocb->private for polled bio
        iomap: treat a write through cache the same as FUA
        iomap: use an unsigned type for IOMAP_DIO_* defines
        iomap: cleanup up iomap_dio_bio_end_io()
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      377698d4
    • Jens Axboe's avatar
      iomap: support IOCB_DIO_CALLER_COMP · 8c052fb3
      Jens Axboe authored
      If IOCB_DIO_CALLER_COMP is set, utilize that to set kiocb->dio_complete
      handler and data for that callback. Rather than punt the completion to a
      workqueue, we pass back the handler and data to the issuer and will get
      a callback from a safe task context.
      
      Using the following fio job to randomly dio write 4k blocks at
      queue depths of 1..16:
      
      fio --name=dio-write --filename=/data1/file --time_based=1 \
      --runtime=10 --bs=4096 --rw=randwrite --norandommap --buffered=0 \
      --cpus_allowed=4 --ioengine=io_uring --iodepth=$depth
      
      shows the following results before and after this patch:
      
      	Stock	Patched		Diff
      =======================================
      QD1	155K	162K		+ 4.5%
      QD2	290K	313K		+ 7.9%
      QD4	533K	597K		+12.0%
      QD8	604K	827K		+36.9%
      QD16	615K	845K		+37.4%
      
      which shows nice wins all around. If we factored in per-IOP efficiency,
      the wins look even nicer. This becomes apparent as queue depth rises,
      as the offloaded workqueue completions runs out of steam.
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8c052fb3
    • Jens Axboe's avatar
      io_uring/rw: add write support for IOCB_DIO_CALLER_COMP · 099ada2c
      Jens Axboe authored
      If the filesystem dio handler understands IOCB_DIO_CALLER_COMP, we'll
      get a kiocb->ki_complete() callback with kiocb->dio_complete set. In
      that case, rather than complete the IO directly through task_work, queue
      up an intermediate task_work handler that first processes this callback
      and then immediately completes the request.
      
      For XFS, this avoids a punt through a workqueue, which is a lot less
      efficient and adds latency to lower queue depth (or sync) O_DIRECT
      writes.
      
      Only do this for non-polled IO, as polled IO doesn't need this kind
      of deferral as it always completes within the task itself. This then
      avoids a check for deferral in the polled IO completion handler.
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      099ada2c
    • Jens Axboe's avatar
      fs: add IOCB flags related to passing back dio completions · 9cf3516c
      Jens Axboe authored
      Async dio completions generally happen from hard/soft IRQ context, which
      means that users like iomap may need to defer some of the completion
      handling to a workqueue. This is less efficient than having the original
      issuer handle it, like we do for sync IO, and it adds latency to the
      completions.
      
      Add IOCB_DIO_CALLER_COMP, which the issuer can set if it is able to
      safely punt these completions to a safe context. If the dio handler is
      aware of this flag, assign a callback handler in kiocb->dio_complete and
      associated data io kiocb->private. The issuer will then call this
      handler with that data from task context.
      
      No functional changes in this patch.
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9cf3516c
    • Jens Axboe's avatar
      iomap: add IOMAP_DIO_INLINE_COMP · 7b3c14d1
      Jens Axboe authored
      Rather than gate whether or not we need to punt a dio completion to a
      workqueue on whether the IO is a write or not, add an explicit flag for
      it. For now we treat them the same, reads always set the flags and async
      writes do not.
      
      No functional changes in this patch.
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7b3c14d1
    • Jens Axboe's avatar
      iomap: only set iocb->private for polled bio · daa99c5a
      Jens Axboe authored
      iocb->private is only used for polled IO, where the completer will
      find the bio to poll through that field.
      
      Assign it when we're submitting a polled bio, and get rid of the
      dio->poll_bio indirection.
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      daa99c5a
    • Jens Axboe's avatar
      iomap: treat a write through cache the same as FUA · 3a0be38c
      Jens Axboe authored
      Whether we have a write back cache and are using FUA or don't have
      a write back cache at all is the same situation. Treat them the same.
      
      This makes the IOMAP_DIO_WRITE_FUA name a bit misleading, as we have
      two cases that provide stable writes:
      
      1) Volatile write cache with FUA writes
      2) Normal write without a volatile write cache
      
      Rename that flag to IOMAP_DIO_STABLE_WRITE to make that clearer, and
      update some of the FUA comments as well.
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3a0be38c
    • Jens Axboe's avatar
      iomap: use an unsigned type for IOMAP_DIO_* defines · 44842f64
      Jens Axboe authored
      IOMAP_DIO_DIRTY shifts by 31 bits, which makes UBSAN unhappy. Clean up
      all the defines by making the shifted value an unsigned value.
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reported-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      44842f64
    • Jens Axboe's avatar
      iomap: cleanup up iomap_dio_bio_end_io() · 3486237c
      Jens Axboe authored
      Make the logic a bit easier to follow:
      
      1) Add a release_bio out path, as everybody needs to touch that, and
         have our bio ref check jump there if it's non-zero.
      2) Add a kiocb local variable.
      3) Add comments for each of the three conditions (sync, inline, or
         async workqueue punt).
      
      No functional changes in this patch.
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3486237c
  2. 25 Jul, 2023 9 commits
  3. 24 Jul, 2023 11 commits
  4. 23 Jul, 2023 11 commits
    • Linus Torvalds's avatar
      Linux 6.5-rc3 · 6eaae198
      Linus Torvalds authored
      6eaae198
    • Linus Torvalds's avatar
      Merge tag 'trace-v6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace · 3b4e48b8
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
      
       - Swapping the ring buffer for snapshotting (for things like irqsoff)
         can crash if the ring buffer is being resized. Disable swapping when
         this happens. The missed swap will be reported to the tracer
      
       - Report error if the histogram fails to be created due to an error in
         adding a histogram variable, in event_hist_trigger_parse()
      
       - Remove unused declaration of tracing_map_set_field_descr()
      
      * tag 'trace-v6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        tracing/histograms: Return an error if we fail to add histogram to hist_vars list
        ring-buffer: Do not swap cpu_buffer during resize process
        tracing: Remove unused extern declaration tracing_map_set_field_descr()
      3b4e48b8
    • Linus Torvalds's avatar
      Merge tag 'kbuild-fixes-v6.5' of... · 12a5336c
      Linus Torvalds authored
      Merge tag 'kbuild-fixes-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
      
      Pull Kbuild fixes from Masahiro Yamada:
      
       - Fix stale help text in gconfig
      
       - Support *.S files in compile_commands.json
      
       - Flatten KBUILD_CFLAGS
      
       - Fix external module builds with Rust so that temporary files are
         created in the modules directories instead of the kernel tree
      
      * tag 'kbuild-fixes-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
        kbuild: rust: avoid creating temporary files
        kbuild: flatten KBUILD_CFLAGS
        gen_compile_commands: add assembly files to compilation database
        kconfig: gconfig: correct program name in help text
        kconfig: gconfig: drop the Show Debug Info help text
      12a5336c
    • Miguel Ojeda's avatar
      kbuild: rust: avoid creating temporary files · df01b7cf
      Miguel Ojeda authored
      `rustc` outputs by default the temporary files (i.e. the ones saved
      by `-Csave-temps`, such as `*.rcgu*` files) in the current working
      directory when `-o` and `--out-dir` are not given (even if
      `--emit=x=path` is given, i.e. it does not use those for temporaries).
      
      Since out-of-tree modules are compiled from the `linux` tree,
      `rustc` then tries to create them there, which may not be accessible.
      
      Thus pass `--out-dir` explicitly, even if it is just for the temporary
      files.
      
      Similarly, do so for Rust host programs too.
      Reported-by: default avatarRaphael Nestler <raphael.nestler@gmail.com>
      Closes: https://github.com/Rust-for-Linux/linux/issues/1015Reported-by: default avatarAndrea Righi <andrea.righi@canonical.com>
      Tested-by: Raphael Nestler <raphael.nestler@gmail.com> # non-hostprogs
      Tested-by: Andrea Righi <andrea.righi@canonical.com> # non-hostprogs
      Fixes: 295d8398 ("kbuild: specify output names separately for each emission type from rustc")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMiguel Ojeda <ojeda@kernel.org>
      Tested-by: default avatarMartin Rodriguez Reboredo <yakoyoku@gmail.com>
      Signed-off-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      df01b7cf
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 269f4a4b
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
       "ARM:
      
         - Avoid pKVM finalization if KVM initialization fails
      
         - Add missing BTI instructions in the hypervisor, fixing an early
           boot failure on BTI systems
      
         - Handle MMU notifiers correctly for non hugepage-aligned memslots
      
         - Work around a bug in the architecture where hypervisor timer
           controls have UNKNOWN behavior under nested virt
      
         - Disable preemption in kvm_arch_hardware_enable(), fixing a kernel
           BUG in cpu hotplug resulting from per-CPU accessor sanity checking
      
         - Make WFI emulation on GICv4 systems robust w.r.t. preemption,
           consistently requesting a doorbell interrupt on vcpu_put()
      
         - Uphold RES0 sysreg behavior when emulating older PMU versions
      
         - Avoid macro expansion when initializing PMU register names,
           ensuring the tracepoints pretty-print the sysreg
      
        s390:
      
         - Two fixes for asynchronous destroy
      
        x86 fixes will come early next week"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: s390: pv: fix index value of replaced ASCE
        KVM: s390: pv: simplify shutdown and fix race
        KVM: arm64: Fix the name of sys_reg_desc related to PMU
        KVM: arm64: Correctly handle RES0 bits PMEVTYPER<n>_EL0.evtCount
        KVM: arm64: vgic-v4: Make the doorbell request robust w.r.t preemption
        KVM: arm64: Add missing BTI instructions
        KVM: arm64: Correctly handle page aging notifiers for unaligned memslot
        KVM: arm64: Disable preemption in kvm_arch_hardware_enable()
        KVM: arm64: Handle kvm_arm_init failure correctly in finalize_pkvm
        KVM: arm64: timers: Use CNTHCTL_EL2 when setting non-CNTKCTL_EL1 bits
      269f4a4b
    • Linus Torvalds's avatar
      Merge tag 'ext4_for_linus-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · 15b593ba
      Linus Torvalds authored
      Pull ext4 fixes from Ted Ts'o:
       "Bug and regression fixes for 6.5-rc3 for ext4's mballoc and jbd2's
        checkpoint code"
      
      * tag 'ext4_for_linus-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
        ext4: fix rbtree traversal bug in ext4_mb_use_preallocated
        ext4: fix off by one issue in ext4_mb_choose_next_group_best_avail()
        ext4: correct inline offset when handling xattrs in inode body
        jbd2: remove __journal_try_to_free_buffer()
        jbd2: fix a race when checking checkpoint buffer busy
        jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint
        jbd2: remove journal_clean_one_cp_list()
        jbd2: remove t_checkpoint_io_list
        jbd2: recheck chechpointing non-dirty buffer
      15b593ba
    • Linus Torvalds's avatar
      Merge tag '6.5-rc2-smb3-client-fixes-ver2' of git://git.samba.org/sfrench/cifs-2.6 · 8266f53b
      Linus Torvalds authored
      Pull smb client fix from Steve French:
       "Add minor debugging improvement.
      
        The change improves ability to read a network trace to debug problems
        on encrypted connections which are very common (e.g. using wireshark
        or tcpdump).
      
        That works today with tools like 'smbinfo keys /mnt/file' but requires
        passing in a filename on the mount (see e.g. [1]), but it often makes
        more sense to just pass in the mount point path (ie a directory not a
        filename).
      
        So this fix was needed to debug some types of problems (an obvious
        example is on an encrypted connection failing operations on an empty
        share or with no files in the root of the directory) - so you can
        simply pass in the 'smbinfo keys <mntpoint>' and get the information
        that wireshark needs"
      
      Link: https://wiki.samba.org/index.php/Wireshark_Decryption [1]
      
      * tag '6.5-rc2-smb3-client-fixes-ver2' of git://git.samba.org/sfrench/cifs-2.6:
        cifs: update internal module version number for cifs.ko
        cifs: allow dumping keys for directories too
      8266f53b
    • Paolo Bonzini's avatar
      Merge tag 'kvm-s390-master-6.5-1' of... · 0c189708
      Paolo Bonzini authored
      Merge tag 'kvm-s390-master-6.5-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
      
      Two fixes for asynchronous destroy
      0c189708
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-fixes-6.5-1' of... · 675a15f4
      Paolo Bonzini authored
      Merge tag 'kvmarm-fixes-6.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
      
      KVM/arm64 fixes for 6.5, part #1
      
       - Avoid pKVM finalization if KVM initialization fails
      
       - Add missing BTI instructions in the hypervisor, fixing an early boot
         failure on BTI systems
      
       - Handle MMU notifiers correctly for non hugepage-aligned memslots
      
       - Work around a bug in the architecture where hypervisor timer controls
         have UNKNOWN behavior under nested virt.
      
       - Disable preemption in kvm_arch_hardware_enable(), fixing a kernel BUG
         in cpu hotplug resulting from per-CPU accessor sanity checking.
      
       - Make WFI emulation on GICv4 systems robust w.r.t. preemption,
         consistently requesting a doorbell interrupt on vcpu_put()
      
       - Uphold RES0 sysreg behavior when emulating older PMU versions
      
       - Avoid macro expansion when initializing PMU register names, ensuring
         the tracepoints pretty-print the sysreg.
      675a15f4
    • Mohamed Khalfella's avatar
      tracing/histograms: Return an error if we fail to add histogram to hist_vars list · 4b8b3905
      Mohamed Khalfella authored
      Commit 6018b585 ("tracing/histograms: Add histograms to hist_vars if
      they have referenced variables") added a check to fail histogram creation
      if save_hist_vars() failed to add histogram to hist_vars list. But the
      commit failed to set ret to failed return code before jumping to
      unregister histogram, fix it.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20230714203341.51396-1-mkhalfella@purestorage.com
      
      Cc: stable@vger.kernel.org
      Fixes: 6018b585 ("tracing/histograms: Add histograms to hist_vars if they have referenced variables")
      Signed-off-by: default avatarMohamed Khalfella <mkhalfella@purestorage.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      4b8b3905
    • Chen Lin's avatar
      ring-buffer: Do not swap cpu_buffer during resize process · 8a96c028
      Chen Lin authored
      When ring_buffer_swap_cpu was called during resize process,
      the cpu buffer was swapped in the middle, resulting in incorrect state.
      Continuing to run in the wrong state will result in oops.
      
      This issue can be easily reproduced using the following two scripts:
      /tmp # cat test1.sh
      //#! /bin/sh
      for i in `seq 0 100000`
      do
               echo 2000 > /sys/kernel/debug/tracing/buffer_size_kb
               sleep 0.5
               echo 5000 > /sys/kernel/debug/tracing/buffer_size_kb
               sleep 0.5
      done
      /tmp # cat test2.sh
      //#! /bin/sh
      for i in `seq 0 100000`
      do
              echo irqsoff > /sys/kernel/debug/tracing/current_tracer
              sleep 1
              echo nop > /sys/kernel/debug/tracing/current_tracer
              sleep 1
      done
      /tmp # ./test1.sh &
      /tmp # ./test2.sh &
      
      A typical oops log is as follows, sometimes with other different oops logs.
      
      [  231.711293] WARNING: CPU: 0 PID: 9 at kernel/trace/ring_buffer.c:2026 rb_update_pages+0x378/0x3f8
      [  231.713375] Modules linked in:
      [  231.714735] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G        W          6.5.0-rc1-00276-g20edcec2 #15
      [  231.716750] Hardware name: linux,dummy-virt (DT)
      [  231.718152] Workqueue: events update_pages_handler
      [  231.719714] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      [  231.721171] pc : rb_update_pages+0x378/0x3f8
      [  231.722212] lr : rb_update_pages+0x25c/0x3f8
      [  231.723248] sp : ffff800082b9bd50
      [  231.724169] x29: ffff800082b9bd50 x28: ffff8000825f7000 x27: 0000000000000000
      [  231.726102] x26: 0000000000000001 x25: fffffffffffff010 x24: 0000000000000ff0
      [  231.728122] x23: ffff0000c3a0b600 x22: ffff0000c3a0b5c0 x21: fffffffffffffe0a
      [  231.730203] x20: ffff0000c3a0b600 x19: ffff0000c0102400 x18: 0000000000000000
      [  231.732329] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffffe7aa8510
      [  231.734212] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
      [  231.736291] x11: ffff8000826998a8 x10: ffff800082b9baf0 x9 : ffff800081137558
      [  231.738195] x8 : fffffc00030e82c8 x7 : 0000000000000000 x6 : 0000000000000001
      [  231.740192] x5 : ffff0000ffbafe00 x4 : 0000000000000000 x3 : 0000000000000000
      [  231.742118] x2 : 00000000000006aa x1 : 0000000000000001 x0 : ffff0000c0007208
      [  231.744196] Call trace:
      [  231.744892]  rb_update_pages+0x378/0x3f8
      [  231.745893]  update_pages_handler+0x1c/0x38
      [  231.746893]  process_one_work+0x1f0/0x468
      [  231.747852]  worker_thread+0x54/0x410
      [  231.748737]  kthread+0x124/0x138
      [  231.749549]  ret_from_fork+0x10/0x20
      [  231.750434] ---[ end trace 0000000000000000 ]---
      [  233.720486] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
      [  233.721696] Mem abort info:
      [  233.721935]   ESR = 0x0000000096000004
      [  233.722283]   EC = 0x25: DABT (current EL), IL = 32 bits
      [  233.722596]   SET = 0, FnV = 0
      [  233.722805]   EA = 0, S1PTW = 0
      [  233.723026]   FSC = 0x04: level 0 translation fault
      [  233.723458] Data abort info:
      [  233.723734]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
      [  233.724176]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
      [  233.724589]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
      [  233.725075] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000104943000
      [  233.725592] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
      [  233.726231] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
      [  233.726720] Modules linked in:
      [  233.727007] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G        W          6.5.0-rc1-00276-g20edcec2 #15
      [  233.727777] Hardware name: linux,dummy-virt (DT)
      [  233.728225] Workqueue: events update_pages_handler
      [  233.728655] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      [  233.729054] pc : rb_update_pages+0x1a8/0x3f8
      [  233.729334] lr : rb_update_pages+0x154/0x3f8
      [  233.729592] sp : ffff800082b9bd50
      [  233.729792] x29: ffff800082b9bd50 x28: ffff8000825f7000 x27: 0000000000000000
      [  233.730220] x26: 0000000000000000 x25: ffff800082a8b840 x24: ffff0000c0102418
      [  233.730653] x23: 0000000000000000 x22: fffffc000304c880 x21: 0000000000000003
      [  233.731105] x20: 00000000000001f4 x19: ffff0000c0102400 x18: ffff800082fcbc58
      [  233.731727] x17: 0000000000000000 x16: 0000000000000001 x15: 0000000000000001
      [  233.732282] x14: ffff8000825fe0c8 x13: 0000000000000001 x12: 0000000000000000
      [  233.732709] x11: ffff8000826998a8 x10: 0000000000000ae0 x9 : ffff8000801b760c
      [  233.733148] x8 : fefefefefefefeff x7 : 0000000000000018 x6 : ffff0000c03298c0
      [  233.733553] x5 : 0000000000000002 x4 : 0000000000000000 x3 : 0000000000000000
      [  233.733972] x2 : ffff0000c3a0b600 x1 : 0000000000000000 x0 : 0000000000000000
      [  233.734418] Call trace:
      [  233.734593]  rb_update_pages+0x1a8/0x3f8
      [  233.734853]  update_pages_handler+0x1c/0x38
      [  233.735148]  process_one_work+0x1f0/0x468
      [  233.735525]  worker_thread+0x54/0x410
      [  233.735852]  kthread+0x124/0x138
      [  233.736064]  ret_from_fork+0x10/0x20
      [  233.736387] Code: 92400000 910006b5 aa000021 aa0303f7 (f9400060)
      [  233.736959] ---[ end trace 0000000000000000 ]---
      
      After analysis, the seq of the error is as follows [1-5]:
      
      int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
      			int cpu_id)
      {
      	for_each_buffer_cpu(buffer, cpu) {
      		cpu_buffer = buffer->buffers[cpu];
      		//1. get cpu_buffer, aka cpu_buffer(A)
      		...
      		...
      		schedule_work_on(cpu,
      		 &cpu_buffer->update_pages_work);
      		//2. 'update_pages_work' is queue on 'cpu', cpu_buffer(A) is passed to
      		// update_pages_handler, do the update process, set 'update_done' in
      		// complete(&cpu_buffer->update_done) and to wakeup resize process.
      	//---->
      		//3. Just at this moment, ring_buffer_swap_cpu is triggered,
      		//cpu_buffer(A) be swaped to cpu_buffer(B), the max_buffer.
      		//ring_buffer_swap_cpu is called as the 'Call trace' below.
      
      		Call trace:
      		 dump_backtrace+0x0/0x2f8
      		 show_stack+0x18/0x28
      		 dump_stack+0x12c/0x188
      		 ring_buffer_swap_cpu+0x2f8/0x328
      		 update_max_tr_single+0x180/0x210
      		 check_critical_timing+0x2b4/0x2c8
      		 tracer_hardirqs_on+0x1c0/0x200
      		 trace_hardirqs_on+0xec/0x378
      		 el0_svc_common+0x64/0x260
      		 do_el0_svc+0x90/0xf8
      		 el0_svc+0x20/0x30
      		 el0_sync_handler+0xb0/0xb8
      		 el0_sync+0x180/0x1c0
      	//<----
      
      	/* wait for all the updates to complete */
      	for_each_buffer_cpu(buffer, cpu) {
      		cpu_buffer = buffer->buffers[cpu];
      		//4. get cpu_buffer, cpu_buffer(B) is used in the following process,
      		//the state of cpu_buffer(A) and cpu_buffer(B) is totally wrong.
      		//for example, cpu_buffer(A)->update_done will leave be set 1, and will
      		//not 'wait_for_completion' at the next resize round.
      		  if (!cpu_buffer->nr_pages_to_update)
      			continue;
      
      		if (cpu_online(cpu))
      			wait_for_completion(&cpu_buffer->update_done);
      		cpu_buffer->nr_pages_to_update = 0;
      	}
      	...
      }
      	//5. the state of cpu_buffer(A) and cpu_buffer(B) is totally wrong,
      	//Continuing to run in the wrong state, then oops occurs.
      
      Link: https://lore.kernel.org/linux-trace-kernel/202307191558478409990@zte.com.cnSigned-off-by: default avatarChen Lin <chen.lin5@zte.com.cn>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      8a96c028