1. 06 Dec, 2019 1 commit
    • Nikos Tsironis's avatar
      dm thin: Flush data device before committing metadata · 694cfe7f
      Nikos Tsironis authored
      The thin provisioning target maintains per thin device mappings that map
      virtual blocks to data blocks in the data device.
      
      When we write to a shared block, in case of internal snapshots, or
      provision a new block, in case of external snapshots, we copy the shared
      block to a new data block (COW), update the mapping for the relevant
      virtual block and then issue the write to the new data block.
      
      Suppose the data device has a volatile write-back cache and the
      following sequence of events occur:
      
      1. We write to a shared block
      2. A new data block is allocated
      3. We copy the shared block to the new data block using kcopyd (COW)
      4. We insert the new mapping for the virtual block in the btree for that
         thin device.
      5. The commit timeout expires and we commit the metadata, that now
         includes the new mapping from step (4).
      6. The system crashes and the data device's cache has not been flushed,
         meaning that the COWed data are lost.
      
      The next time we read that virtual block of the thin device we read it
      from the data block allocated in step (2), since the metadata have been
      successfully committed. The data are lost due to the crash, so we read
      garbage instead of the old, shared data.
      
      This has the following implications:
      
      1. In case of writes to shared blocks, with size smaller than the pool's
         block size (which means we first copy the whole block and then issue
         the smaller write), we corrupt data that the user never touched.
      
      2. In case of writes to shared blocks, with size equal to the device's
         logical block size, we fail to provide atomic sector writes. When the
         system recovers the user will read garbage from that sector instead
         of the old data or the new data.
      
      3. Even for writes to shared blocks, with size equal to the pool's block
         size (overwrites), after the system recovers, the written sectors
         will contain garbage instead of a random mix of sectors containing
         either old data or new data, thus we fail again to provide atomic
         sectors writes.
      
      4. Even when the user flushes the thin device, because we first commit
         the metadata and then pass down the flush, the same risk for
         corruption exists (if the system crashes after the metadata have been
         committed but before the flush is passed down to the data device.)
      
      The only case which is unaffected is that of writes with size equal to
      the pool's block size and with the FUA flag set. But, because FUA writes
      trigger metadata commits, this case can trigger the corruption
      indirectly.
      
      Moreover, apart from internal and external snapshots, the same issue
      exists for newly provisioned blocks, when block zeroing is enabled.
      After the system recovers the provisioned blocks might contain garbage
      instead of zeroes.
      
      To solve this and avoid the potential data corruption we flush the
      pool's data device **before** committing its metadata.
      
      This ensures that the data blocks of any newly inserted mappings are
      properly written to non-volatile storage and won't be lost in case of a
      crash.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Acked-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      694cfe7f
  2. 05 Dec, 2019 5 commits
    • Nikos Tsironis's avatar
      dm thin metadata: Add support for a pre-commit callback · ecda7c02
      Nikos Tsironis authored
      Add support for one pre-commit callback which is run right before the
      metadata are committed.
      
      This allows the thin provisioning target to run a callback before the
      metadata are committed and is required by the next commit.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Acked-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      ecda7c02
    • Nikos Tsironis's avatar
      dm clone: Flush destination device before committing metadata · 8b3fd1f5
      Nikos Tsironis authored
      dm-clone maintains an on-disk bitmap which records which regions are
      valid in the destination device, i.e., which regions have already been
      hydrated, or have been written to directly, via user I/O.
      
      Setting a bit in the on-disk bitmap meas the corresponding region is
      valid in the destination device and we redirect all I/O regarding it to
      the destination device.
      
      Suppose the destination device has a volatile write-back cache and the
      following sequence of events occur:
      
      1. A region gets hydrated, either through the background hydration or
         because it was written to directly, via user I/O.
      
      2. The commit timeout expires and we commit the metadata, marking that
         region as valid in the destination device.
      
      3. The system crashes and the destination device's cache has not been
         flushed, meaning the region's data are lost.
      
      The next time we read that region we read it from the destination
      device, since the metadata have been successfully committed, but the
      data are lost due to the crash, so we read garbage instead of the old
      data.
      
      This has several implications:
      
      1. In case of background hydration or of writes with size smaller than
         the region size (which means we first copy the whole region and then
         issue the smaller write), we corrupt data that the user never
         touched.
      
      2. In case of writes with size equal to the device's logical block size,
         we fail to provide atomic sector writes. When the system recovers the
         user will read garbage from the sector instead of the old data or the
         new data.
      
      3. In case of writes without the FUA flag set, after the system
         recovers, the written sectors will contain garbage instead of a
         random mix of sectors containing either old data or new data, thus we
         fail again to provide atomic sector writes.
      
      4. Even when the user flushes the dm-clone device, because we first
         commit the metadata and then pass down the flush, the same risk for
         corruption exists (if the system crashes after the metadata have been
         committed but before the flush is passed down).
      
      The only case which is unaffected is that of writes with size equal to
      the region size and with the FUA flag set. But, because FUA writes
      trigger metadata commits, this case can trigger the corruption
      indirectly.
      
      To solve this and avoid the potential data corruption we flush the
      destination device **before** committing the metadata.
      
      This ensures that any freshly hydrated regions, for which we commit the
      metadata, are properly written to non-volatile storage and won't be lost
      in case of a crash.
      
      Fixes: 7431b783 ("dm: add clone target")
      Cc: stable@vger.kernel.org # v5.4+
      Signed-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      8b3fd1f5
    • Nikos Tsironis's avatar
      dm clone metadata: Use a two phase commit · 8fdbfe8d
      Nikos Tsironis authored
      Split the metadata commit in two parts:
      
      1. dm_clone_metadata_pre_commit(): Prepare the current transaction for
         committing. After this is called, all subsequent metadata updates,
         done through either dm_clone_set_region_hydrated() or
         dm_clone_cond_set_range(), will be part of the next transaction.
      
      2. dm_clone_metadata_commit(): Actually commit the current transaction
         to disk and start a new transaction.
      
      This is required by the following commit. It allows dm-clone to flush
      the destination device after step (1) to ensure that all freshly
      hydrated regions, for which we are updating the metadata, are properly
      written to non-volatile storage and won't be lost in case of a crash.
      
      Fixes: 7431b783 ("dm: add clone target")
      Cc: stable@vger.kernel.org # v5.4+
      Signed-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      8fdbfe8d
    • Nikos Tsironis's avatar
      dm clone metadata: Track exact changes per transaction · e6a505f3
      Nikos Tsironis authored
      Extend struct dirty_map with a second bitmap which tracks the exact
      regions that were hydrated during the current metadata transaction.
      
      Moreover, fix __flush_dmap() to only commit the metadata of the regions
      that were hydrated during the current transaction.
      
      This is required by the following commits to fix a data corruption bug.
      
      Fixes: 7431b783 ("dm: add clone target")
      Cc: stable@vger.kernel.org # v5.4+
      Signed-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      e6a505f3
    • Hou Tao's avatar
      dm btree: increase rebalance threshold in __rebalance2() · 474e5595
      Hou Tao authored
      We got the following warnings from thin_check during thin-pool setup:
      
        $ thin_check /dev/vdb
        examining superblock
        examining devices tree
          missing devices: [1, 84]
            too few entries in btree_node: 41, expected at least 42 (block 138, max_entries = 126)
        examining mapping tree
      
      The phenomenon is the number of entries in one node of details_info tree is
      less than (max_entries / 3). And it can be easily reproduced by the following
      procedures:
      
        $ new a thin pool
        $ presume the max entries of details_info tree is 126
        $ new 127 thin devices (e.g. 1~127) to make the root node being full
          and then split
        $ remove the first 43 (e.g. 1~43) thin devices to make the children
          reblance repeatedly
        $ stop the thin pool
        $ thin_check
      
      The root cause is that the B-tree removal procedure in __rebalance2()
      doesn't guarantee the invariance: the minimal number of entries in
      non-root node should be >= (max_entries / 3).
      
      Simply fix the problem by increasing the rebalance threshold to
      make sure the number of entries in each child will be greater
      than or equal to (max_entries / 3 + 1), so no matter which
      child is used for removal, the number will still be valid.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Acked-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      474e5595
  3. 26 Nov, 2019 2 commits
  4. 20 Nov, 2019 2 commits
  5. 18 Nov, 2019 1 commit
    • Jeffle Xu's avatar
      dm thin: wakeup worker only when deferred bios exist · d256d796
      Jeffle Xu authored
      Single thread fio test (read, bs=4k, ioengine=libaio, iodepth=128,
      numjobs=1) over dm-thin device has poor performance versus bare nvme
      device.
      
      Further investigation with perf indicates that queue_work_on() consumes
      over 20% CPU time when doing IO over dm-thin device. The call stack is
      as follows.
      
      - 40.57% thin_map
          + 22.07% queue_work_on
          + 9.95% dm_thin_find_block
          + 2.80% cell_defer_no_holder
            1.91% inc_all_io_entry.isra.33.part.34
          + 1.78% bio_detain.isra.35
      
      In cell_defer_no_holder(), wakeup_worker() is always called, no matter
      whether the tc->deferred_bio_list list is empty or not. In single thread
      IO model, this list is most likely empty. So skip waking up worker thread
      if tc->deferred_bio_list list is empty.
      
      Single thread IO performance improves from 448 MiB/s to 646 MiB/s (+44%)
      once the needless wake_worker() calls are properly skipped.
      Signed-off-by: default avatarJeffle Xu <jefflexu@linux.alibaba.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      d256d796
  6. 15 Nov, 2019 1 commit
    • Mikulas Patocka's avatar
      dm integrity: fix excessive alignment of metadata runs · d537858a
      Mikulas Patocka authored
      Metadata runs are supposed to be aligned on 4k boundary (so that they work
      efficiently with disks with 4k sectors). However, there was a programming
      bug that makes them aligned on 128k boundary instead. The unused space is
      wasted.
      
      Fix this bug by providing a proper 4k alignment. In order to keep
      existing volumes working, we introduce a new flag SB_FLAG_FIXED_PADDING
      - when the flag is clear, we calculate the padding the old way. In order
      to make sure that the old version cannot mount the volume created by the
      new version, we increase superblock version to 4.
      
      Also in order to not break with old integritysetup, we fix alignment
      only if the parameter "fix_padding" is present when formatting the
      device.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      d537858a
  7. 07 Nov, 2019 2 commits
  8. 05 Nov, 2019 16 commits
  9. 03 Nov, 2019 2 commits
    • Linus Torvalds's avatar
      Linux 5.4-rc6 · a99d8080
      Linus Torvalds authored
      a99d8080
    • Linus Torvalds's avatar
      Merge tag 'usb-5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 3a69c9e5
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "The USB sub-maintainers woke up this past week and sent a bunch of
        tiny fixes. Here are a lot of small patches that that resolve a bunch
        of reported issues in the USB core, drivers, serial drivers, gadget
        drivers, and of course, xhci :)
      
        All of these have been in linux-next with no reported issues"
      
      * tag 'usb-5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (31 commits)
        usb: dwc3: gadget: fix race when disabling ep with cancelled xfers
        usb: cdns3: gadget: Fix g_audio use case when connected to Super-Speed host
        usb: cdns3: gadget: reset EP_CLAIMED flag while unloading
        USB: serial: whiteheat: fix line-speed endianness
        USB: serial: whiteheat: fix potential slab corruption
        USB: gadget: Reject endpoints with 0 maxpacket value
        UAS: Revert commit 3ae62a42 ("UAS: fix alignment of scatter/gather segments")
        usb-storage: Revert commit 747668db ("usb-storage: Set virt_boundary_mask to avoid SG overflows")
        usbip: Fix free of unallocated memory in vhci tx
        usbip: tools: Fix read_usb_vudc_device() error path handling
        usb: xhci: fix __le32/__le64 accessors in debugfs code
        usb: xhci: fix Immediate Data Transfer endianness
        xhci: Fix use-after-free regression in xhci clear hub TT implementation
        USB: ldusb: fix control-message timeout
        USB: ldusb: use unsigned size format specifiers
        USB: ldusb: fix ring-buffer locking
        USB: Skip endpoints with 0 maxpacket length
        usb: cdns3: gadget: Don't manage pullups
        usb: dwc3: remove the call trace of USBx_GFLADJ
        usb: gadget: configfs: fix concurrent issue between composite APIs
        ...
      3a69c9e5
  10. 02 Nov, 2019 8 commits
    • Linus Torvalds's avatar
      Merge tag '5.4-rc6-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6 · 56cfd250
      Linus Torvalds authored
      Pull cifs fix from Steve French:
       "A small smb3 memleak fix"
      
      * tag '5.4-rc6-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6:
        fix memory leak in large read decrypt offload
      56cfd250
    • Linus Torvalds's avatar
      Merge tag 'hwmon-for-v5.4-rc6' of... · 9d234505
      Linus Torvalds authored
      Merge tag 'hwmon-for-v5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
      
      Pull hwmon fixes from Guenter Roeck:
      
       - Fix read timeout problem in ina3221 driver
      
       - Fix wrong bitmask in nct7904 driver
      
      * tag 'hwmon-for-v5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
        hwmon: (ina3221) Fix read timeout issue
        hwmon: (nct7904) Fix the incorrect value of vsen_mask & tcpu_mask & temp_mode in nct7904_data struct.
      9d234505
    • Linus Torvalds's avatar
      Merge tag 'pwm/for-5.4-rc6' of... · e935842a
      Linus Torvalds authored
      Merge tag 'pwm/for-5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm
      
      Pull pwm fixes from Thierry Reding:
       "It turned out that relying solely on drivers storing all the PWM state
        in hardware was a little premature and causes a number of subtle (and
        some not so subtle) regressions. Revert the offending patch for now"
      
      * tag 'pwm/for-5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm:
        Revert "pwm: Let pwm_get_state() return the last implemented state"
      e935842a
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · f83e148a
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Nine changes, eight in drivers [ufs, target, lpfc x 2, qla2xxx x 4]
        and one core change in sd that fixes an I/O failure on DIF type 3
        devices"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: qla2xxx: stop timer in shutdown path
        scsi: sd: define variable dif as unsigned int instead of bool
        scsi: target: cxgbit: Fix cxgbit_fw4_ack()
        scsi: qla2xxx: Fix partial flash write of MBI
        scsi: qla2xxx: Initialized mailbox to prevent driver load failure
        scsi: lpfc: Honor module parameter lpfc_use_adisc
        scsi: ufs-bsg: Wake the device before sending raw upiu commands
        scsi: lpfc: Check queue pointer before use
        scsi: qla2xxx: fixup incorrect usage of host_byte
      f83e148a
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 8194c28e
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
       "Our recent cleanup of EEH led to an oops on bare metal machines when
        the cxl (CAPI) driver creates virtual devices for an attached FPGA
        accelerator.
      
        The "secure virtual machine" support we added in v5.4 had a bug if the
        kernel was relocated (moved during boot), in those cases the signature
        of the kernel text wouldn't verify and the Ultravisor would refuse to
        run the VM.
      
        A recent change to disable interrupts before calling
        arch_cpu_idle_dead() caused a WARN_ON() in our bare metal CPU offline
        code to always trigger.
      
        The KUAP (SMAP) support we added for 32-bit Book3S had a bug if the
        address range crossed a segment (256MB) boundary which could lead to
        spurious faults.
      
        Thanks to: Christophe Leroy, Frederic Barrat, Michael Anderson,
        Nicholas Piggin, Sam Bobroff, Thiago Jung Bauermann"
      
      * tag 'powerpc-5.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/powernv: Fix CPU idle to be called with IRQs disabled
        powerpc/prom_init: Undo relocation before entering secure mode
        powerpc/powernv/eeh: Fix oops when probing cxl devices
        powerpc/32s: fix allow/prevent_user_access() when crossing segment boundaries.
      8194c28e
    • Linus Torvalds's avatar
      Merge tag 's390-5.4-6' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 969a5197
      Linus Torvalds authored
      Pull s390 fixes from Vasily Gorbik:
      
       - Fix cpu idle time accounting
      
       - Fix stack unwinder case when both pt_regs and sp are specified
      
       - Fix information leak via cmm timeout proc handler
      
      * tag 's390-5.4-6' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390/idle: fix cpu idle time calculation
        s390/unwind: fix mixing regs and sp
        s390/cmm: fix information leak in cmm_timeout_handler()
      969a5197
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 1204c70d
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix free/alloc races in batmanadv, from Sven Eckelmann.
      
       2) Several leaks and other fixes in kTLS support of mlx5 driver, from
          Tariq Toukan.
      
       3) BPF devmap_hash cost calculation can overflow on 32-bit, from Toke
          Høiland-Jørgensen.
      
       4) Add an r8152 device ID, from Kazutoshi Noguchi.
      
       5) Missing include in ipv6's addrconf.c, from Ben Dooks.
      
       6) Use siphash in flow dissector, from Eric Dumazet. Attackers can
          easily infer the 32-bit secret otherwise etc.
      
       7) Several netdevice nesting depth fixes from Taehee Yoo.
      
       8) Fix several KCSAN reported errors, from Eric Dumazet. For example,
          when doing lockless skb_queue_empty() checks, and accessing
          sk_napi_id/sk_incoming_cpu lockless as well.
      
       9) Fix jumbo packet handling in RXRPC, from David Howells.
      
      10) Bump SOMAXCONN and tcp_max_syn_backlog values, from Eric Dumazet.
      
      11) Fix DMA synchronization in gve driver, from Yangchun Fu.
      
      12) Several bpf offload fixes, from Jakub Kicinski.
      
      13) Fix sk_page_frag() recursion during memory reclaim, from Tejun Heo.
      
      14) Fix ping latency during high traffic rates in hisilicon driver, from
          Jiangfent Xiao.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (146 commits)
        net: fix installing orphaned programs
        net: cls_bpf: fix NULL deref on offload filter removal
        selftests: bpf: Skip write only files in debugfs
        selftests: net: reuseport_dualstack: fix uninitalized parameter
        r8169: fix wrong PHY ID issue with RTL8168dp
        net: dsa: bcm_sf2: Fix IMP setup for port different than 8
        net: phylink: Fix phylink_dbg() macro
        gve: Fixes DMA synchronization.
        inet: stop leaking jiffies on the wire
        ixgbe: Remove duplicate clear_bit() call
        Documentation: networking: device drivers: Remove stray asterisks
        e1000: fix memory leaks
        i40e: Fix receive buffer starvation for AF_XDP
        igb: Fix constant media auto sense switching when no cable is connected
        net: ethernet: arc: add the missed clk_disable_unprepare
        igb: Enable media autosense for the i350.
        igb/igc: Don't warn on fatal read failures when the device is removed
        tcp: increase tcp_max_syn_backlog max value
        net: increase SOMAXCONN to 4096
        netdevsim: Fix use-after-free during device dismantle
        ...
      1204c70d
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-5.4-3' of git://git.linux-nfs.org/projects/anna/linux-nfs · 372bf6c1
      Linus Torvalds authored
      Pull NFS client bugfixes from Anna Schumaker:
       "This contains two delegation fixes (with the RCU lock leak fix marked
        for stable), and three patches to fix destroying the the sunrpc back
        channel.
      
        Stable bugfixes:
      
         - Fix an RCU lock leak in nfs4_refresh_delegation_stateid()
      
        Other fixes:
      
         - The TCP back channel mustn't disappear while requests are
           outstanding
      
         - The RDMA back channel mustn't disappear while requests are
           outstanding
      
         - Destroy the back channel when we destroy the host transport
      
         - Don't allow a cached open with a revoked delegation"
      
      * tag 'nfs-for-5.4-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
        NFS: Fix an RCU lock leak in nfs4_refresh_delegation_stateid()
        NFSv4: Don't allow a cached open with a revoked delegation
        SUNRPC: Destroy the back channel when we destroy the host transport
        SUNRPC: The RDMA back channel mustn't disappear while requests are outstanding
        SUNRPC: The TCP back channel mustn't disappear while requests are outstanding
      372bf6c1