1. 24 Feb, 2024 4 commits
    • Ankit Agrawal's avatar
      vfio: Convey kvm that the vfio-pci device is wc safe · a39d3a96
      Ankit Agrawal authored
      The VM_ALLOW_ANY_UNCACHED flag is implemented for ARM64,
      allowing KVM stage 2 device mapping attributes to use Normal-NC
      rather than DEVICE_nGnRE, which allows guest mappings supporting
      write-combining attributes (WC). ARM does not architecturally
      guarantee this is safe, and indeed some MMIO regions like the GICv2
      VCPU interface can trigger uncontained faults if Normal-NC is used.
      
      To safely use VFIO in KVM the platform must guarantee full safety
      in the guest where no action taken against a MMIO mapping can
      trigger an uncontained failure. The expectation is that most VFIO PCI
      platforms support this for both mapping types, at least in common
      flows, based on some expectations of how PCI IP is integrated. So
      make vfio-pci set the VM_ALLOW_ANY_UNCACHED flag.
      Suggested-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMarc Zyngier <maz@kernel.org>
      Signed-off-by: default avatarAnkit Agrawal <ankita@nvidia.com>
      Link: https://lore.kernel.org/r/20240224150546.368-5-ankita@nvidia.comSigned-off-by: default avatarOliver Upton <oliver.upton@linux.dev>
      a39d3a96
    • Ankit Agrawal's avatar
      KVM: arm64: Set io memory s2 pte as normalnc for vfio pci device · 8c47ce3e
      Ankit Agrawal authored
      To provide VM with the ability to get device IO memory with NormalNC
      property, map device MMIO in KVM for ARM64 at stage2 as NormalNC.
      Having NormalNC S2 default puts guests in control (based on [1],
      "Combining stage 1 and stage 2 memory type attributes") of device
      MMIO regions memory mappings. The rules are summarized below:
      ([(S1) - stage1], [(S2) - stage 2])
      
      S1           |  S2           | Result
      NORMAL-WB    |  NORMAL-NC    | NORMAL-NC
      NORMAL-WT    |  NORMAL-NC    | NORMAL-NC
      NORMAL-NC    |  NORMAL-NC    | NORMAL-NC
      DEVICE<attr> |  NORMAL-NC    | DEVICE<attr>
      
      Still this cannot be generalized to non PCI devices such as GICv2.
      There is insufficient information and uncertainity in the behavior
      of non PCI driver. A driver must indicate support using the
      new flag VM_ALLOW_ANY_UNCACHED.
      
      Adapt KVM to make use of the flag VM_ALLOW_ANY_UNCACHED as indicator to
      activate the S2 setting to NormalNc.
      
      [1] section D8.5.5 of DDI0487J_a_a-profile_architecture_reference_manual.pdf
      Suggested-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: default avatarMarc Zyngier <maz@kernel.org>
      Signed-off-by: default avatarAnkit Agrawal <ankita@nvidia.com>
      Link: https://lore.kernel.org/r/20240224150546.368-4-ankita@nvidia.comSigned-off-by: default avatarOliver Upton <oliver.upton@linux.dev>
      8c47ce3e
    • Ankit Agrawal's avatar
      mm: Introduce new flag to indicate wc safe · 5c656fcd
      Ankit Agrawal authored
      The VM_ALLOW_ANY_UNCACHED flag is implemented for ARM64, allowing KVM
      stage 2 device mapping attributes to use NormalNC rather than
      DEVICE_nGnRE, which allows guest mappings supporting write-combining
      attributes (WC). ARM does not architecturally guarantee this is safe,
      and indeed some MMIO regions like the GICv2 VCPU interface can trigger
      uncontained faults if NormalNC is used.
      
      Even worse, the expectation is that there are platforms where even
      DEVICE_nGnRE can allow uncontained faults in corner cases. Unfortunately
      existing ARM IP requires platform integration to take responsibility to
      prevent this.
      
      To safely use VFIO in KVM the platform must guarantee full safety in the
      guest where no action taken against a MMIO mapping can trigger an
      uncontained failure. The assumption is that most VFIO PCI platforms
      support this for both mapping types, at least in common flows, based
      on some expectations of how PCI IP is integrated. This can be enabled
      more broadly, for instance into vfio-platform drivers, but only after
      the platform vendor completes auditing for safety.
      
      The VMA flag VM_ALLOW_ANY_UNCACHED was found to be the simplest and
      cleanest way to communicate the information from VFIO to KVM that
      mapping the region in S2 as NormalNC is safe. KVM consumes it to
      activate the code that does the S2 mapping as NormalNC.
      Suggested-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: default avatarMarc Zyngier <maz@kernel.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAnkit Agrawal <ankita@nvidia.com>
      Link: https://lore.kernel.org/r/20240224150546.368-3-ankita@nvidia.comSigned-off-by: default avatarOliver Upton <oliver.upton@linux.dev>
      5c656fcd
    • Ankit Agrawal's avatar
      KVM: arm64: Introduce new flag for non-cacheable IO memory · c034ec84
      Ankit Agrawal authored
      Currently, KVM for ARM64 maps at stage 2 memory that is considered device
      (i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting
      overrides (as per the ARM architecture [1]) any device MMIO mapping
      present at stage 1, resulting in a set-up whereby a guest operating
      system cannot determine device MMIO mapping memory attributes on its
      own but it is always overridden by the KVM stage 2 default.
      
      This set-up does not allow guest operating systems to select device
      memory attributes independently from KVM stage-2 mappings
      (refer to [1], "Combining stage 1 and stage 2 memory type attributes"),
      which turns out to be an issue in that guest operating systems
      (e.g. Linux) may request to map devices MMIO regions with memory
      attributes that guarantee better performance (e.g. gathering
      attribute - that for some devices can generate larger PCIe memory
      writes TLPs) and specific operations (e.g. unaligned transactions)
      such as the NormalNC memory type.
      
      The default device stage 2 mapping was chosen in KVM for ARM64 since
      it was considered safer (i.e. it would not allow guests to trigger
      uncontained failures ultimately crashing the machine) but this
      turned out to be asynchronous (SError) defeating the purpose.
      
      Failures containability is a property of the platform and is independent
      from the memory type used for MMIO device memory mappings.
      
      Actually, DEVICE_nGnRE memory type is even more problematic than
      Normal-NC memory type in terms of faults containability in that e.g.
      aborts triggered on DEVICE_nGnRE loads cannot be made, architecturally,
      synchronous (i.e. that would imply that the processor should issue at
      most 1 load transaction at a time - it cannot pipeline them - otherwise
      the synchronous abort semantics would break the no-speculation attribute
      attached to DEVICE_XXX memory).
      
      This means that regardless of the combined stage1+stage2 mappings a
      platform is safe if and only if device transactions cannot trigger
      uncontained failures and that in turn relies on platform capabilities
      and the device type being assigned (i.e. PCIe AER/DPC error containment
      and RAS architecture[3]); therefore the default KVM device stage 2
      memory attributes play no role in making device assignment safer
      for a given platform (if the platform design adheres to design
      guidelines outlined in [3]) and therefore can be relaxed.
      
      For all these reasons, relax the KVM stage 2 device memory attributes
      from DEVICE_nGnRE to Normal-NC.
      
      The NormalNC was chosen over a different Normal memory type default
      at stage-2 (e.g. Normal Write-through) to avoid cache allocation/snooping.
      
      Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
      trigger any issue on guest device reclaim use cases either (i.e. device
      MMIO unmap followed by a device reset) at least for PCIe devices, in that
      in PCIe a device reset is architected and carried out through PCI config
      space transactions that are naturally ordered with respect to MMIO
      transactions according to the PCI ordering rules.
      
      Having Normal-NC S2 default puts guests in control (thanks to
      stage1+stage2 combined memory attributes rules [1]) of device MMIO
      regions memory mappings, according to the rules described in [1]
      and summarized here ([(S1) - stage1], [(S2) - stage 2]):
      
      S1           |  S2           | Result
      NORMAL-WB    |  NORMAL-NC    | NORMAL-NC
      NORMAL-WT    |  NORMAL-NC    | NORMAL-NC
      NORMAL-NC    |  NORMAL-NC    | NORMAL-NC
      DEVICE<attr> |  NORMAL-NC    | DEVICE<attr>
      
      It is worth noting that currently, to map devices MMIO space to user
      space in a device pass-through use case the VFIO framework applies memory
      attributes derived from pgprot_noncached() settings applied to VMAs, which
      result in device-nGnRnE memory attributes for the stage-1 VMM mappings.
      
      This means that a userspace mapping for device MMIO space carried
      out with the current VFIO framework and a guest OS mapping for the same
      MMIO space may result in a mismatched alias as described in [2].
      
      Defaulting KVM device stage-2 mappings to Normal-NC attributes does not
      change anything in this respect, in that the mismatched aliases would
      only affect (refer to [2] for a detailed explanation) ordering between
      the userspace and GuestOS mappings resulting stream of transactions
      (i.e. it does not cause loss of property for either stream of
      transactions on its own), which is harmless given that the userspace
      and GuestOS access to the device is carried out through independent
      transactions streams.
      
      A Normal-NC flag is not present today. So add a new kvm_pgtable_prot
      (KVM_PGTABLE_PROT_NORMAL_NC) flag for it, along with its
      corresponding PTE value 0x5 (0b101) determined from [1].
      
      Lastly, adapt the stage2 PTE property setter function
      (stage2_set_prot_attr) to handle the NormalNC attribute.
      
      The entire discussion leading to this patch series may be followed through
      the following links.
      Link: https://lore.kernel.org/all/20230907181459.18145-3-ankita@nvidia.com
      Link: https://lore.kernel.org/r/20231205033015.10044-1-ankita@nvidia.com
      
      [1] section D8.5.5 - DDI0487J_a_a-profile_architecture_reference_manual.pdf
      [2] section B2.8 - DDI0487J_a_a-profile_architecture_reference_manual.pdf
      [3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf
      Suggested-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Reviewed-by: default avatarMarc Zyngier <maz@kernel.org>
      Signed-off-by: default avatarAnkit Agrawal <ankita@nvidia.com>
      Link: https://lore.kernel.org/r/20240224150546.368-2-ankita@nvidia.comSigned-off-by: default avatarOliver Upton <oliver.upton@linux.dev>
      c034ec84
  2. 21 Jan, 2024 36 commits
    • Linus Torvalds's avatar
      Linux 6.8-rc1 · 6613476e
      Linus Torvalds authored
      6613476e
    • Linus Torvalds's avatar
      Merge tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs · 35a4474b
      Linus Torvalds authored
      Pull more bcachefs updates from Kent Overstreet:
       "Some fixes, Some refactoring, some minor features:
      
         - Assorted prep work for disk space accounting rewrite
      
         - BTREE_TRIGGER_ATOMIC: after combining our trigger callbacks, this
           makes our trigger context more explicit
      
         - A few fixes to avoid excessive transaction restarts on
           multithreaded workloads: fstests (in addition to ktest tests) are
           now checking slowpath counters, and that's shaking out a few bugs
      
         - Assorted tracepoint improvements
      
         - Starting to break up bcachefs_format.h and move on disk types so
           they're with the code they belong to; this will make room to start
           documenting the on disk format better.
      
         - A few minor fixes"
      
      * tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs: (46 commits)
        bcachefs: Improve inode_to_text()
        bcachefs: logged_ops_format.h
        bcachefs: reflink_format.h
        bcachefs; extents_format.h
        bcachefs: ec_format.h
        bcachefs: subvolume_format.h
        bcachefs: snapshot_format.h
        bcachefs: alloc_background_format.h
        bcachefs: xattr_format.h
        bcachefs: dirent_format.h
        bcachefs: inode_format.h
        bcachefs; quota_format.h
        bcachefs: sb-counters_format.h
        bcachefs: counters.c -> sb-counters.c
        bcachefs: comment bch_subvolume
        bcachefs: bch_snapshot::btime
        bcachefs: add missing __GFP_NOWARN
        bcachefs: opts->compression can now also be applied in the background
        bcachefs: Prep work for variable size btree node buffers
        bcachefs: grab s_umount only if snapshotting
        ...
      35a4474b
    • Linus Torvalds's avatar
      Merge tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 4fbbed78
      Linus Torvalds authored
      Pull timer updates from Thomas Gleixner:
       "Updates for time and clocksources:
      
         - A fix for the idle and iowait time accounting vs CPU hotplug.
      
           The time is reset on CPU hotplug which makes the accumulated
           systemwide time jump backwards.
      
         - Assorted fixes and improvements for clocksource/event drivers"
      
      * tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug
        clocksource/drivers/ep93xx: Fix error handling during probe
        clocksource/drivers/cadence-ttc: Fix some kernel-doc warnings
        clocksource/drivers/timer-ti-dm: Fix make W=n kerneldoc warnings
        clocksource/timer-riscv: Add riscv_clock_shutdown callback
        dt-bindings: timer: Add StarFive JH8100 clint
        dt-bindings: timer: thead,c900-aclint-mtimer: separate mtime and mtimecmp regs
      4fbbed78
    • Linus Torvalds's avatar
      Merge tag 'powerpc-6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 7b297a5c
      Linus Torvalds authored
      Pull powerpc fixes from Aneesh Kumar:
      
       - Increase default stack size to 32KB for Book3S
      
      Thanks to Michael Ellerman.
      
      * tag 'powerpc-6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/64s: Increase default stack size to 32KB
      7b297a5c
    • Kent Overstreet's avatar
      bcachefs: Improve inode_to_text() · 249f441f
      Kent Overstreet authored
      Add line breaks - inode_to_text() is now much easier to read.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      249f441f
    • Kent Overstreet's avatar
      bcachefs: logged_ops_format.h · d826cc57
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      d826cc57
    • Kent Overstreet's avatar
      bcachefs: reflink_format.h · 8d52ba60
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      8d52ba60
    • Kent Overstreet's avatar
      bcachefs; extents_format.h · b2fa1b63
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      b2fa1b63
    • Kent Overstreet's avatar
      bcachefs: ec_format.h · 0560eb9a
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      0560eb9a
    • Kent Overstreet's avatar
      bcachefs: subvolume_format.h · c6c4ff65
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      c6c4ff65
    • Kent Overstreet's avatar
      bcachefs: snapshot_format.h · 8fed323b
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      8fed323b
    • Kent Overstreet's avatar
      d455179f
    • Kent Overstreet's avatar
      bcachefs: xattr_format.h · 72e08010
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      72e08010
    • Kent Overstreet's avatar
      bcachefs: dirent_format.h · 7ffc4daa
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      7ffc4daa
    • Kent Overstreet's avatar
      bcachefs: inode_format.h · b36425da
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      b36425da
    • Kent Overstreet's avatar
      bcachefs; quota_format.h · 82de6207
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      82de6207
    • Kent Overstreet's avatar
      bcachefs: sb-counters_format.h · 43314801
      Kent Overstreet authored
      bcachefs_format.h has gotten too big; let's do some organizing.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      43314801
    • Kent Overstreet's avatar
      3a58dfbc
    • Kent Overstreet's avatar
      12207f49
    • Kent Overstreet's avatar
      bcachefs: bch_snapshot::btime · d32088f2
      Kent Overstreet authored
      Add a field to bch_snapshot for creation time; this will be important
      when we start exposing the snapshot tree to userspace.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      d32088f2
    • Kent Overstreet's avatar
      7be0208f
    • Kent Overstreet's avatar
      bcachefs: opts->compression can now also be applied in the background · d7e77f53
      Kent Overstreet authored
      The "apply this compression method in the background" paths now use the
      compression option if background_compression is not set; this means that
      setting or changing the compression option will cause existing data to
      be compressed accordingly in the background.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      d7e77f53
    • Kent Overstreet's avatar
      bcachefs: Prep work for variable size btree node buffers · ec4edd7b
      Kent Overstreet authored
      bcachefs btree nodes are big - typically 256k - and btree roots are
      pinned in memory. As we're now up to 18 btrees, we now have significant
      memory overhead in mostly empty btree roots.
      
      And in the future we're going to start enforcing that certain btree node
      boundaries exist, to solve lock contention issues - analagous to XFS's
      AGIs.
      
      Thus, we need to start allocating smaller btree node buffers when we
      can. This patch changes code that refers to the filesystem constant
      c->opts.btree_node_size to refer to the btree node buffer size -
      btree_buf_bytes() - where appropriate.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      ec4edd7b
    • Su Yue's avatar
      bcachefs: grab s_umount only if snapshotting · 2acc59dd
      Su Yue authored
      When I was testing mongodb over bcachefs with compression,
      there is a lockdep warning when snapshotting mongodb data volume.
      
      $ cat test.sh
      prog=bcachefs
      
      $prog subvolume create /mnt/data
      $prog subvolume create /mnt/data/snapshots
      
      while true;do
          $prog subvolume snapshot /mnt/data /mnt/data/snapshots/$(date +%s)
          sleep 1s
      done
      
      $ cat /etc/mongodb.conf
      systemLog:
        destination: file
        logAppend: true
        path: /mnt/data/mongod.log
      
      storage:
        dbPath: /mnt/data/
      
      lockdep reports:
      [ 3437.452330] ======================================================
      [ 3437.452750] WARNING: possible circular locking dependency detected
      [ 3437.453168] 6.7.0-rc7-custom+ #85 Tainted: G            E
      [ 3437.453562] ------------------------------------------------------
      [ 3437.453981] bcachefs/35533 is trying to acquire lock:
      [ 3437.454325] ffffa0a02b2b1418 (sb_writers#10){.+.+}-{0:0}, at: filename_create+0x62/0x190
      [ 3437.454875]
                     but task is already holding lock:
      [ 3437.455268] ffffa0a02b2b10e0 (&type->s_umount_key#48){.+.+}-{3:3}, at: bch2_fs_file_ioctl+0x232/0xc90 [bcachefs]
      [ 3437.456009]
                     which lock already depends on the new lock.
      
      [ 3437.456553]
                     the existing dependency chain (in reverse order) is:
      [ 3437.457054]
                     -> #3 (&type->s_umount_key#48){.+.+}-{3:3}:
      [ 3437.457507]        down_read+0x3e/0x170
      [ 3437.457772]        bch2_fs_file_ioctl+0x232/0xc90 [bcachefs]
      [ 3437.458206]        __x64_sys_ioctl+0x93/0xd0
      [ 3437.458498]        do_syscall_64+0x42/0xf0
      [ 3437.458779]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [ 3437.459155]
                     -> #2 (&c->snapshot_create_lock){++++}-{3:3}:
      [ 3437.459615]        down_read+0x3e/0x170
      [ 3437.459878]        bch2_truncate+0x82/0x110 [bcachefs]
      [ 3437.460276]        bchfs_truncate+0x254/0x3c0 [bcachefs]
      [ 3437.460686]        notify_change+0x1f1/0x4a0
      [ 3437.461283]        do_truncate+0x7f/0xd0
      [ 3437.461555]        path_openat+0xa57/0xce0
      [ 3437.461836]        do_filp_open+0xb4/0x160
      [ 3437.462116]        do_sys_openat2+0x91/0xc0
      [ 3437.462402]        __x64_sys_openat+0x53/0xa0
      [ 3437.462701]        do_syscall_64+0x42/0xf0
      [ 3437.462982]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [ 3437.463359]
                     -> #1 (&sb->s_type->i_mutex_key#15){+.+.}-{3:3}:
      [ 3437.463843]        down_write+0x3b/0xc0
      [ 3437.464223]        bch2_write_iter+0x5b/0xcc0 [bcachefs]
      [ 3437.464493]        vfs_write+0x21b/0x4c0
      [ 3437.464653]        ksys_write+0x69/0xf0
      [ 3437.464839]        do_syscall_64+0x42/0xf0
      [ 3437.465009]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [ 3437.465231]
                     -> #0 (sb_writers#10){.+.+}-{0:0}:
      [ 3437.465471]        __lock_acquire+0x1455/0x21b0
      [ 3437.465656]        lock_acquire+0xc6/0x2b0
      [ 3437.465822]        mnt_want_write+0x46/0x1a0
      [ 3437.465996]        filename_create+0x62/0x190
      [ 3437.466175]        user_path_create+0x2d/0x50
      [ 3437.466352]        bch2_fs_file_ioctl+0x2ec/0xc90 [bcachefs]
      [ 3437.466617]        __x64_sys_ioctl+0x93/0xd0
      [ 3437.466791]        do_syscall_64+0x42/0xf0
      [ 3437.466957]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [ 3437.467180]
                     other info that might help us debug this:
      
      [ 3437.469670] 2 locks held by bcachefs/35533:
                     other info that might help us debug this:
      
      [ 3437.467507] Chain exists of:
                       sb_writers#10 --> &c->snapshot_create_lock --> &type->s_umount_key#48
      
      [ 3437.467979]  Possible unsafe locking scenario:
      
      [ 3437.468223]        CPU0                    CPU1
      [ 3437.468405]        ----                    ----
      [ 3437.468585]   rlock(&type->s_umount_key#48);
      [ 3437.468758]                                lock(&c->snapshot_create_lock);
      [ 3437.469030]                                lock(&type->s_umount_key#48);
      [ 3437.469291]   rlock(sb_writers#10);
      [ 3437.469434]
                      *** DEADLOCK ***
      
      [ 3437.469670] 2 locks held by bcachefs/35533:
      [ 3437.469838]  #0: ffffa0a02ce00a88 (&c->snapshot_create_lock){++++}-{3:3}, at: bch2_fs_file_ioctl+0x1e3/0xc90 [bcachefs]
      [ 3437.470294]  #1: ffffa0a02b2b10e0 (&type->s_umount_key#48){.+.+}-{3:3}, at: bch2_fs_file_ioctl+0x232/0xc90 [bcachefs]
      [ 3437.470744]
                     stack backtrace:
      [ 3437.470922] CPU: 7 PID: 35533 Comm: bcachefs Kdump: loaded Tainted: G            E      6.7.0-rc7-custom+ #85
      [ 3437.471313] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
      [ 3437.471694] Call Trace:
      [ 3437.471795]  <TASK>
      [ 3437.471884]  dump_stack_lvl+0x57/0x90
      [ 3437.472035]  check_noncircular+0x132/0x150
      [ 3437.472202]  __lock_acquire+0x1455/0x21b0
      [ 3437.472369]  lock_acquire+0xc6/0x2b0
      [ 3437.472518]  ? filename_create+0x62/0x190
      [ 3437.472683]  ? lock_is_held_type+0x97/0x110
      [ 3437.472856]  mnt_want_write+0x46/0x1a0
      [ 3437.473025]  ? filename_create+0x62/0x190
      [ 3437.473204]  filename_create+0x62/0x190
      [ 3437.473380]  user_path_create+0x2d/0x50
      [ 3437.473555]  bch2_fs_file_ioctl+0x2ec/0xc90 [bcachefs]
      [ 3437.473819]  ? lock_acquire+0xc6/0x2b0
      [ 3437.474002]  ? __fget_files+0x2a/0x190
      [ 3437.474195]  ? __fget_files+0xbc/0x190
      [ 3437.474380]  ? lock_release+0xc5/0x270
      [ 3437.474567]  ? __x64_sys_ioctl+0x93/0xd0
      [ 3437.474764]  ? __pfx_bch2_fs_file_ioctl+0x10/0x10 [bcachefs]
      [ 3437.475090]  __x64_sys_ioctl+0x93/0xd0
      [ 3437.475277]  do_syscall_64+0x42/0xf0
      [ 3437.475454]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [ 3437.475691] RIP: 0033:0x7f2743c313af
      ======================================================
      
      In __bch2_ioctl_subvolume_create(), we grab s_umount unconditionally
      and unlock it at the end of the function. There is a comment
      "why do we need this lock?" about the lock coming from
      commit 42d23732 ("bcachefs: Snapshot creation, deletion")
      The reason is that __bch2_ioctl_subvolume_create() calls
      sync_inodes_sb() which enforce locked s_umount to writeback all dirty
      nodes before doing snapshot works.
      
      Fix it by read locking s_umount for snapshotting only and unlocking
      s_umount after sync_inodes_sb().
      Signed-off-by: default avatarSu Yue <glass.su@suse.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      2acc59dd
    • Su Yue's avatar
      bcachefs: kvfree bch_fs::snapshots in bch2_fs_snapshots_exit · 369acf97
      Su Yue authored
      bch_fs::snapshots is allocated by kvzalloc in __snapshot_t_mut.
      It should be freed by kvfree not kfree.
      Or umount will triger:
      
      [  406.829178 ] BUG: unable to handle page fault for address: ffffe7b487148008
      [  406.830676 ] #PF: supervisor read access in kernel mode
      [  406.831643 ] #PF: error_code(0x0000) - not-present page
      [  406.832487 ] PGD 0 P4D 0
      [  406.832898 ] Oops: 0000 [#1] PREEMPT SMP PTI
      [  406.833512 ] CPU: 2 PID: 1754 Comm: umount Kdump: loaded Tainted: G           OE      6.7.0-rc7-custom+ #90
      [  406.834746 ] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
      [  406.835796 ] RIP: 0010:kfree+0x62/0x140
      [  406.836197 ] Code: 80 48 01 d8 0f 82 e9 00 00 00 48 c7 c2 00 00 00 80 48 2b 15 78 9f 1f 01 48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 03 05 56 9f 1f 01 <48> 8b 50 08 48 89 c7 f6 c2 01 0f 85 b0 00 00 00 66 90 48 8b 07 f6
      [  406.837810 ] RSP: 0018:ffffb9d641607e48 EFLAGS: 00010286
      [  406.838213 ] RAX: ffffe7b487148000 RBX: ffffb9d645200000 RCX: ffffb9d641607dc4
      [  406.838738 ] RDX: 000065bb00000000 RSI: ffffffffc0d88b84 RDI: ffffb9d645200000
      [  406.839217 ] RBP: ffff9a4625d00068 R08: 0000000000000001 R09: 0000000000000001
      [  406.839650 ] R10: 0000000000000001 R11: 000000000000001f R12: ffff9a4625d4da80
      [  406.840055 ] R13: ffff9a4625d00000 R14: ffffffffc0e2eb20 R15: 0000000000000000
      [  406.840451 ] FS:  00007f0a264ffb80(0000) GS:ffff9a4e2d500000(0000) knlGS:0000000000000000
      [  406.840851 ] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  406.841125 ] CR2: ffffe7b487148008 CR3: 000000018c4d2000 CR4: 00000000000006f0
      [  406.841464 ] Call Trace:
      [  406.841583 ]  <TASK>
      [  406.841682 ]  ? __die+0x1f/0x70
      [  406.841828 ]  ? page_fault_oops+0x159/0x470
      [  406.842014 ]  ? fixup_exception+0x22/0x310
      [  406.842198 ]  ? exc_page_fault+0x1ed/0x200
      [  406.842382 ]  ? asm_exc_page_fault+0x22/0x30
      [  406.842574 ]  ? bch2_fs_release+0x54/0x280 [bcachefs]
      [  406.842842 ]  ? kfree+0x62/0x140
      [  406.842988 ]  ? kfree+0x104/0x140
      [  406.843138 ]  bch2_fs_release+0x54/0x280 [bcachefs]
      [  406.843390 ]  kobject_put+0xb7/0x170
      [  406.843552 ]  deactivate_locked_super+0x2f/0xa0
      [  406.843756 ]  cleanup_mnt+0xba/0x150
      [  406.843917 ]  task_work_run+0x59/0xa0
      [  406.844083 ]  exit_to_user_mode_prepare+0x197/0x1a0
      [  406.844302 ]  syscall_exit_to_user_mode+0x16/0x40
      [  406.844510 ]  do_syscall_64+0x4e/0xf0
      [  406.844675 ]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [  406.844907 ] RIP: 0033:0x7f0a2664e4fb
      Signed-off-by: default avatarSu Yue <glass.su@suse.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      369acf97
    • Kent Overstreet's avatar
      bcachefs: bios must be 512 byte algined · 00fff4dd
      Kent Overstreet authored
      Fixes: 023f9ac9 bcachefs: Delete dio read alignment check
      Reported-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      00fff4dd
    • Colin Ian King's avatar
      bcachefs: remove redundant variable tmp · aead3428
      Colin Ian King authored
      The variable tmp is being assigned a value but it isn't being
      read afterwards. The assignment is redundant and so tmp can be
      removed.
      
      Cleans up clang scan build warning:
      warning: Although the value stored to 'ret' is used in the enclosing
      expression, the value is never actually read from 'ret'
      [deadcode.DeadStores]
      Signed-off-by: default avatarColin Ian King <colin.i.king@gmail.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      aead3428
    • Kent Overstreet's avatar
    • Kent Overstreet's avatar
      bcachefs: Fix excess transaction restarts in __bchfs_fallocate() · 46bf2e9c
      Kent Overstreet authored
      drop_locks_do() should not be used in a fastpath without first trying
      the do in nonblocking mode - the unlock and relock will cause excessive
      transaction restarts and potentially livelocking with other threads that
      are contending for the same locks.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      46bf2e9c
    • Kent Overstreet's avatar
      bcachefs: extents_to_bp_state · 1a503904
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      1a503904
    • Kent Overstreet's avatar
      bcachefs: bkey_and_val_eq() · ba96d36c
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      ba96d36c
    • Kent Overstreet's avatar
      bcachefs: Better journal tracepoints · e6a2566f
      Kent Overstreet authored
      Factor out bch2_journal_bufs_to_text(), and use it in the
      journal_entry_full() tracepoint; when we can't get a journal reservation
      we need to know the outstanding journal entry sizes to know if the
      problem is due to excessive flushing.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      e6a2566f
    • Kent Overstreet's avatar
    • Kent Overstreet's avatar
      bcachefs: Avoid flushing the journal in the discard path · a6548c8b
      Kent Overstreet authored
      When issuing discards, we may need to flush the journal if there's too
      many buckets that can't be discarded until a journal flush.
      
      But the heuristic was bad; we should be comparing the number of buckets
      that need to flushes against the number of free buckets, not the number
      of buckets we saw.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      a6548c8b
    • Kent Overstreet's avatar
      bcachefs: Improve move_extent tracepoint · 189c176c
      Kent Overstreet authored
      Also print out the data_opts, so that we can see what specifically is
      being done to an extent.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      189c176c
    • Kent Overstreet's avatar
      bcachefs: Add missing bch2_moving_ctxt_flush_all() · ef740a1e
      Kent Overstreet authored
      This fixes a bug with rebalance IOs getting stuck with reads completed,
      but writes never being issued.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      ef740a1e