1. 27 Jul, 2017 40 commits
    • Devin Heitmueller's avatar
      mxl111sf: Fix driver to use heap allocate buffers for USB messages · 2f110d39
      Devin Heitmueller authored
      commit d90b336f upstream.
      
      The recent changes in 4.9 to mandate USB buffers be heap allocated
      broke this driver, which was allocating the buffers on the stack.
      This resulted in the device failing at initialization.
      
      Introduce dedicated send/receive buffers as part of the state
      structure, and add a mutex to protect access to them.
      
      Note: we also had to tweak the API to mxl111sf_ctrl_msg to pass
      the pointer to the state struct rather than the device, since
      we need it inside the function to access the buffers and the
      mutex.  This patch adjusts the callers to match the API change.
      Signed-off-by: default avatarDevin Heitmueller <dheitmueller@kernellabs.com>
      Reported-by: default avatarDoug Lung <dlung0@gmail.com>
      Cc: Michael Ira Krufky <mkrufky@linuxtv.org>
      Signed-off-by: default avatarHans Verkuil <hans.verkuil@cisco.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@s-opensource.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2f110d39
    • Jiahau Chang's avatar
      xhci: Bad Ethernet performance plugged in ASM1042A host · 5cc9b698
      Jiahau Chang authored
      commit 9da5a109 upstream.
      
      When USB Ethernet is plugged in ASMEDIA ASM1042A xHCI host, bad
      performance was manifesting in Web browser use (like download
      large file such as ISO image). It is known limitation of
      ASM1042A that is not compatible with driver scheduling,
      As a workaround we can modify flow control handling of ASM1042A.
      The register we modify is changes the behavior
      
      [use quirk bit 28, usleep_range 40-60us, empty non-pci function -Mathias]
      Signed-off-by: default avatarJiahau Chang <Lars_chang@asmedia.com.tw>
      Signed-off-by: default avatarIan Pilcher <arequipeno@gmail.com>
      Signed-off-by: default avatarMathias Nyman <mathias.nyman@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5cc9b698
    • Mathias Nyman's avatar
      xhci: Fix NULL pointer dereference when cleaning up streams for removed host · 7b7a1f02
      Mathias Nyman authored
      commit 4b895868 upstream.
      
      This off by one in stream_id indexing caused NULL pointer dereference and
      soft lockup on machines with USB attached SCSI devices connected to a
      hotpluggable xhci controller.
      
      The code that cleans up pending URBs for dead hosts tried to dereference
      a stream ring at the invalid stream_id 0.
      ep->stream_info->stream_rings[0] doesn't point to a ring.
      
      Start looping stream_id from 1 like in all the other places in the driver,
      and check that the ring exists before trying to kill URBs on it.
      Reported-by: default avatarrocko r <rockorequin@gmail.com>
      Signed-off-by: default avatarMathias Nyman <mathias.nyman@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7b7a1f02
    • Mathias Nyman's avatar
      xhci: fix 20000ms port resume timeout · be67d4af
      Mathias Nyman authored
      commit a54408d0 upstream.
      
      A uncleared PLC (port link change) bit will prevent furuther port event
      interrupts for that port. Leaving it uncleared caused get_port_status()
      to timeout after 20000ms while waiting to get the final port event
      interrupt for resume -> U0 state change.
      
      This is a targeted fix for a specific case where we get a port resume event
      racing with xhci resume. The port event interrupt handler notices xHC is
      not yet running and bails out early, leaving PLC uncleared.
      
      The whole xhci port resuming needs more attention, but while working on it
      it anyways makes sense to always ensure PLC is cleared in get_port_status
      before setting a new link state and waiting for its completion.
      Signed-off-by: default avatarMathias Nyman <mathias.nyman@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      be67d4af
    • Shu Wang's avatar
      xhci: fix memleak in xhci_run() · 01c5b393
      Shu Wang authored
      commit d6f5f071 upstream.
      
      Found this issue by kmemleak.
      xhci_run() did not check return val and free command for
      xhci_queue_vendor_command()
      
      unreferenced object 0xffff88011c0be500 (size 64):
        comm "kworker/0:1", pid 58, jiffies 4294670908 (age 50.420s)
        hex dump (first 32 bytes):
        backtrace:
          [<ffffffff8176166a>] kmemleak_alloc+0x4a/0xa0
          [<ffffffff8121801a>] kmem_cache_alloc_trace+0xca/0x1d0
          [<ffffffff81576bf4>] xhci_alloc_command+0x44/0x130
          [<ffffffff8156f1cc>] xhci_run+0x4cc/0x630
          [<ffffffff8153b84b>] usb_add_hcd+0x3bb/0x950
          [<ffffffff8154eac8>] usb_hcd_pci_probe+0x188/0x500
          [<ffffffff815851ac>] xhci_pci_probe+0x2c/0x220
          [<ffffffff813d2ca5>] local_pci_probe+0x45/0xa0
          [<ffffffff810a54e4>] work_for_cpu_fn+0x14/0x20
          [<ffffffff810a8409>] process_one_work+0x149/0x360
          [<ffffffff810a8d08>] worker_thread+0x1d8/0x3c0
          [<ffffffff810ae7d9>] kthread+0x109/0x140
          [<ffffffff8176d585>] ret_from_fork+0x25/0x30
          [<ffffffffffffffff>] 0xffffffffffffffff
      Signed-off-by: default avatarShu Wang <shuwang@redhat.com>
      Signed-off-by: default avatarMathias Nyman <mathias.nyman@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      01c5b393
    • Peter Chen's avatar
      usb: xhci: fix spinlock recursion for USB2 test mode · 9c97237c
      Peter Chen authored
      commit 576d5546 upstream.
      
      Both xhci_hub_control and xhci_disable_slot tries to hold spinlock, the
      spinlock recursion occurs when enters USB2 test mode. Fix it by unlock
      spinlock before calling xhci_disable_slot.
      
      Fixes: 0f1d832e ("usb: xhci: Add port test modes support for usb2")
      Signed-off-by: default avatarPeter Chen <peter.chen@nxp.com>
      Signed-off-by: default avatarMathias Nyman <mathias.nyman@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9c97237c
    • Michael Hernandez's avatar
      PCI/MSI: Ignore affinity if pre/post vector count is more than min_vecs · 07c79fd9
      Michael Hernandez authored
      commit 6f9a22bc upstream.
      
      min_vecs is the minimum amount of vectors needed to operate in MSI-X mode
      which may just include the vectors that don't need affinity.
      
      Disabling affinity settings causes the qla2xxx driver scsi_add_host() to fail
      when blk_mq is enabled as the blk_mq_pci_map_queues() expects affinity masks
      on each vector.
      
      Fixes: dfef358b ("PCI/MSI: Don't apply affinity if there aren't enough vectors left")
      Signed-off-by: default avatarMichael Hernandez <michael.hernandez@cavium.com>
      Signed-off-by: default avatarHimanshu Madhani <himanshu.madhani@cavium.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      07c79fd9
    • Chen Yu's avatar
      PCI/PM: Restore the status of PCI devices across hibernation · c1ead164
      Chen Yu authored
      commit e60514bd upstream.
      
      Currently we saw a lot of "No irq handler" errors during hibernation, which
      caused the system hang finally:
      
        ata4.00: qc timeout (cmd 0xec)
        ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)
        ata4.00: revalidation failed (errno=-5)
        ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
        do_IRQ: 31.151 No irq handler for vector
      
      According to above logs, there is an interrupt triggered and it is
      dispatched to CPU31 with a vector number 151, but there is no handler for
      it, thus this IRQ will not get acked and will cause an IRQ flood which
      kills the system.  To be more specific, the 31.151 is an interrupt from the
      AHCI host controller.
      
      After some investigation, the reason why this issue is triggered is because
      the thaw_noirq() function does not restore the MSI/MSI-X settings across
      hibernation.
      
      The scenario is illustrated below:
      
        1. Before hibernation, IRQ 34 is the handler for the AHCI device, which
           is bound to CPU31.
      
        2. Hibernation starts, the AHCI device is put into low power state.
      
        3. All the nonboot CPUs are put offline, so IRQ 34 has to be migrated to
           the last alive one - CPU0.
      
        4. After the snapshot has been created, all the nonboot CPUs are brought
           up again; IRQ 34 remains bound to CPU0.
      
        5. AHCI devices are put into D0.
      
        6. The snapshot is written to the disk.
      
      The issue is triggered in step 6.  The AHCI interrupt should be delivered
      to CPU0, however it is delivered to the original CPU31 instead, which
      causes the "No irq handler" issue.
      
      Ying Huang has provided a clue that, in step 3 it is possible that writing
      to the register might not take effect as the PCI devices have been
      suspended.
      
      In step 3, the IRQ 34 affinity should be modified from CPU31 to CPU0, but
      in fact it is not.  In __pci_write_msi_msg(), if the device is already in
      low power state, the low level MSI message entry will not be updated but
      cached.  During the device restore process after a normal suspend/resume,
      pci_restore_msi_state() writes the cached MSI back to the hardware.
      
      But this is not the case for hibernation.  pci_restore_msi_state() is not
      currently called in pci_pm_thaw_noirq(), although pci_save_state() has
      saved the necessary PCI cached information in pci_pm_freeze_noirq().
      
      Restore the PCI status for the device during hibernation.  Otherwise the
      status might be lost across hibernation (for example, settings for MSI,
      MSI-X, ATS, ACS, IOV, etc.), which might cause problems during hibernation.
      Suggested-by: default avatarYing Huang <ying.huang@intel.com>
      Suggested-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarChen Yu <yu.c.chen@intel.com>
      [bhelgaas: changelog]
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Rui Zhang <rui.zhang@intel.com>
      Cc: Ying Huang <ying.huang@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c1ead164
    • Shawn Lin's avatar
      PCI: rockchip: Use normal register bank for config accessors · 18fc66a8
      Shawn Lin authored
      commit dc8cca5e upstream.
      
      Rockchip's RC has two banks of registers for the root port: a normal bank
      that is strictly compatible with the PCIe spec, and a privileged bank that
      can be used to change RO bits of root port registers.
      
      When probing the RC driver, we use the privileged bank to do some basic
      setup work as some RO bits are hw-inited to wrong value.  But we didn't
      change to the normal bank after probing the driver.
      
      This leads to a serious problem when the PME code tries to clear the PME
      status by writing PCI_EXP_RTSTA_PME to the register of PCI_EXP_RTSTA.  Per
      PCIe 3.0 spec, section 7.8.14, the PME status bit is RW1C.  So the PME code
      is doing the right thing to clear the PME status but we find the RC doesn't
      clear it but actually setting it to one.  So finally the system trap in
      pcie_pme_work_fn() as PCI_EXP_RTSTA_PME is true now forever.  This issue
      can be reproduced by booting kernel with pci=nomsi.
      
      Use the normal register bank for the PCI config accessors.  The privileged
      bank is used only internally by this driver.
      
      Fixes: e77f847d ("PCI: rockchip: Add Rockchip PCIe controller support")
      Signed-off-by: default avatarShawn Lin <shawn.lin@rock-chips.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Cc: Jeffy Chen <jeffy.chen@rock-chips.com>
      Cc: Brian Norris <briannorris@chromium.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      18fc66a8
    • Bjorn Helgaas's avatar
      PCI: Work around poweroff & suspend-to-RAM issue on Macbook Pro 11 · 31f8d306
      Bjorn Helgaas authored
      commit 13cfc732 upstream.
      
      Neither soft poweroff (transition to ACPI power state S5) nor
      suspend-to-RAM (transition to state S3) works on the Macbook Pro 11,4 and
      11,5.
      
      The problem is related to the [mem 0x7fa00000-0x7fbfffff] space.  When we
      use that space, e.g., by assigning it to the 00:1c.0 Root Port, the ACPI
      Power Management 1 Control Register (PM1_CNT) at [io 0x1804] doesn't work
      anymore.
      
      Linux does a soft poweroff (transition to S5) by writing to PM1_CNT.  The
      theory about why this doesn't work is:
      
        - The write to PM1_CNT causes an SMI
        - The BIOS SMI handler depends on something in
          [mem 0x7fa00000-0x7fbfffff]
        - When Linux assigns [mem 0x7fa00000-0x7fbfffff] to the 00:1c.0 Port, it
          covers up whatever the SMI handler uses, so the SMI handler no longer
          works correctly
      
      Reserve the [mem 0x7fa00000-0x7fbfffff] space so we don't assign it to
      anything.
      
      This is voodoo programming, since we don't know what the real conflict is,
      but we've failed to find the root cause.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=103211
      Tested-by: thejoe@gmail.com
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Cc: Lukas Wunner <lukas@wunner.de>
      Cc: Chen Yu <yu.c.chen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      31f8d306
    • Jon Derrick's avatar
      PCI: vmd: Move SRCU cleanup after bus, child device removal · 2e09bcd1
      Jon Derrick authored
      commit 0cb259c4 upstream.
      
      Recent __call_srcu() changes have exposed that we need to cleanup SRCU
      structures after pci_stop_root_bus() calls into vmd_msi_free().
      
      Fixes: 3906b918 ("PCI: vmd: Use SRCU as a local RCU to prevent delaying global RCU")
      Signed-off-by: default avatarJon Derrick <jonathan.derrick@intel.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Acked-by: default avatarKeith Busch <keith.busch@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2e09bcd1
    • Juergen Gross's avatar
      xen/x86: fix cpu hotplug · 2614198e
      Juergen Gross authored
      commit c185ddec upstream.
      
      Commit dc6416f1 ("xen/x86: Call
      cpu_startup_entry(CPUHP_AP_ONLINE_IDLE) from xen_play_dead()")
      introduced an error leading to a stack overflow of the idle task when
      a cpu was brought offline/online many times: by calling
      cpu_startup_entry() instead of returning at the end of xen_play_dead()
      do_idle() would be entered again and again.
      
      Don't use cpu_startup_entry(), but cpuhp_online_idle() instead allowing
      to return from xen_play_dead().
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2614198e
    • Madhavan Srinivasan's avatar
      powerpc/perf: Fix SDAR_MODE value for continous sampling on Power9 · 4fa26aab
      Madhavan Srinivasan authored
      commit 20dd4c62 upstream.
      
      In case of continous sampling (non-marked), the code currently
      sets MMCRA[SDAR_MODE] to 0b01 (Update on TLB miss) for Power9 DD1.
      
      On DD2 and later it copies the sdar_mode value from the event code,
      which for most events is 0b00 (No updates).
      
      However we must set a non-zero value for SDAR_MODE when doing
      continuous sampling, so honor the event code, unless it's zero, in
      which case we use use 0b01 (Update on TLB miss).
      
      Fixes: 78b4416a ("powerpc/perf: Handle sdar_mode for marked event in power9")
      Signed-off-by: default avatarMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4fa26aab
    • Benjamin Herrenschmidt's avatar
      powerpc/mm/radix: Properly clear process table entry · 27444f65
      Benjamin Herrenschmidt authored
      commit c6bb0b8d upstream.
      
      On radix, the process table entry we want to clear when destroying a
      context is entry 0, not entry 1. This has no *immediate* consequence
      on Power9, but it can cause other bugs to become worse.
      
      Fixes: 7e381c0f ("powerpc/mm/radix: Add mmu context handling callback for radix")
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      27444f65
    • Oliver O'Halloran's avatar
      powerpc/asm: Mark cr0 as clobbered in mftb() · e6c8bf1d
      Oliver O'Halloran authored
      commit 2400fd82 upstream.
      
      The workaround for the CELL timebase bug does not correctly mark cr0 as
      being clobbered. This means GCC doesn't know that the asm block changes cr0 and
      might leave the result of an unrelated comparison in cr0 across the block, which
      we then trash, leading to basically random behaviour.
      
      Fixes: 859deea9 ("[POWERPC] Cell timebase bug workaround")
      Signed-off-by: default avatarOliver O'Halloran <oohall@gmail.com>
      [mpe: Tweak change log and flag for stable]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e6c8bf1d
    • Anton Blanchard's avatar
      powerpc: Fix emulation of mfocrf in emulate_step() · 0b422679
      Anton Blanchard authored
      commit 64e756c5 upstream.
      
      From POWER4 onwards, mfocrf() only places the specified CR field into
      the destination GPR, and the rest of it is set to 0. The PowerPC AS
      from version 3.0 now requires this behaviour.
      
      The emulation code currently puts the entire CR into the destination GPR.
      Fix it.
      
      Fixes: 6888199f ("[POWERPC] Emulate more instructions in software")
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Acked-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0b422679
    • Anton Blanchard's avatar
      powerpc: Fix emulation of mcrf in emulate_step() · 9c4614c4
      Anton Blanchard authored
      commit 87c4b83e upstream.
      
      The mcrf emulation code was using the CR field number directly as the shift
      value, without taking into account that CR fields are numbered from 0-7 starting
      at the high bits. That meant it was looking at the CR fields in the reverse
      order.
      
      Fixes: cf87c3f6 ("powerpc: Emulate icbi, mcrf and conditional-trap instructions")
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Acked-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9c4614c4
    • Michael Ellerman's avatar
      powerpc/64: Fix atomic64_inc_not_zero() to return an int · 1e605f20
      Michael Ellerman authored
      commit 01e6a61a upstream.
      
      Although it's not documented anywhere, there is an expectation that
      atomic64_inc_not_zero() returns a result which fits in an int. This is
      the behaviour implemented on all arches except powerpc.
      
      This has caused at least one bug in practice, in the percpu-refcount
      code, where the long result from our atomic64_inc_not_zero() was
      truncated to an int leading to lost references and stuck systems. That
      was worked around in that code in commit 966d2b04 ("percpu-refcount:
      fix reference leak during percpu-atomic transition").
      
      To the best of my grepping abilities there are no other callers
      in-tree which truncate the value, but we should fix it anyway. Because
      the breakage is subtle and potentially very harmful I'm also tagging
      it for stable.
      
      Code generation is largely unaffected because in most cases the
      callers are just using the result for a test anyway. In particular the
      case of fget() that was mentioned in commit a6cf7ed5
      ("powerpc/atomic: Implement atomic*_inc_not_zero") generates exactly
      the same code.
      
      Fixes: a6cf7ed5 ("powerpc/atomic: Implement atomic*_inc_not_zero")
      Noticed-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1e605f20
    • Balbir Singh's avatar
      powerpc/mm/radix: Fix execute permissions for interrupt_vectors · 13f46943
      Balbir Singh authored
      commit 7f6d498e upstream.
      
      Commit 9abcc981 ("powerpc/mm/radix: Only add X for pages
      overlapping kernel text") changed the linear mapping on Radix to only
      mark the kernel text executable.
      
      However if the kernel is run relocated, for example as a kdump kernel,
      then the exception vectors are split from the kernel text, ie. they
      remain at real address 0.
      
      We tend to get away with it, because the kernel itself will usually be
      below 1G, which means the 1G page at 0-1G is marked executable and
      everything works OK. However if the kernel is loaded above 1G, or the
      system has less than 1G in total (meaning we can't use a 1G page),
      then the exception vectors will not be marked executable and the
      kernel will fail to boot.
      
      Fix it by also checking if the address range overlaps the exception
      vectors when deciding if we should add PAGE_KERNEL_X.
      
      Fixes: 9abcc981 ("powerpc/mm/radix: Only add X for pages overlapping kernel text")
      Signed-off-by: default avatarBalbir Singh <bsingharora@gmail.com>
      [mpe: Combine with the existing check, rewrite change log]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      13f46943
    • Balbir Singh's avatar
      powerpc/pseries: Fix passing of pp0 in updatepp() and updateboltedpp() · f349607b
      Balbir Singh authored
      commit e71ff982 upstream.
      
      Once upon a time there were only two PP (page protection) bits. In ISA
      2.03 an additional PP bit was added, but because of the layout of the
      HPTE it could not be made contiguous with the existing PP bits.
      
      The result is that we now have three PP bits, named pp0, pp1, pp2,
      where pp0 occupies bit 63 of dword 1 of the HPTE and pp1 and pp2
      occupy bits 1 and 0 respectively. Until recently Linux hasn't used
      pp0, however with the addition of _PAGE_KERNEL_RO we started using it.
      
      The problem arises in the LPAR code, where we need to translate the PP
      bits into the argument for the H_PROTECT hypercall. Currently the code
      only passes bits 0-2 of newpp, which covers pp1, pp2 and N (no
      execute), meaning pp0 is not passed to the hypervisor at all.
      
      We can't simply pass it through in bit 63, as that would collide with a
      different field in the flags argument, as defined in PAPR. Instead we
      have to shift it down to bit 8 (IBM bit 55).
      
      Fixes: e58e87ad ("powerpc/mm: Update _PAGE_KERNEL_RO")
      Signed-off-by: default avatarBalbir Singh <bsingharora@gmail.com>
      [mpe: Simplify the test, rework change log]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f349607b
    • Michael Ellerman's avatar
      powerpc/mm/radix: Only add X for pages overlapping kernel text · cab0b22b
      Michael Ellerman authored
      commit 9abcc981 upstream.
      
      Currently we map the whole linear mapping with PAGE_KERNEL_X. Instead we
      should check if the page overlaps the kernel text and only then add
      PAGE_KERNEL_X.
      
      Note that we still use 1G pages if they're available, so this will
      typically still result in a 1G executable page at KERNELBASE. So this fix is
      primarily useful for catching stray branches to high linear mapping addresses.
      
      Without this patch, we can execute at 1G in xmon using:
      
        0:mon> m c000000040000000
        c000000040000000  00 l
        c000000040000000  00000000 01006038
        c000000040000004  00000000 2000804e
        c000000040000008  00000000 x
        0:mon> di c000000040000000
        c000000040000000  38600001      li      r3,1
        c000000040000004  4e800020      blr
        0:mon> p c000000040000000
        return value is 0x1
      
      After we get a 400 as expected:
      
        0:mon> p c000000040000000
        *** 400 exception occurred
      
      Fixes: 2bfd65e4 ("powerpc/mm/radix: Add radix callbacks for early init routines")
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cab0b22b
    • Paolo Bonzini's avatar
      scsi: virtio_scsi: always read VPD pages for multiqueue too · db367439
      Paolo Bonzini authored
      commit a680f1d4 upstream.
      
      Multi-queue virtio-scsi uses a different scsi_host_template struct.  Add
      the .device_alloc field there, too.
      
      Fixes: 25d1d50e
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarFam Zheng <famz@redhat.com>
      Reviewed-by: default avatarStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      db367439
    • Bart Van Assche's avatar
      xen/scsiback: Fix a TMR related use-after-free · 26e21de7
      Bart Van Assche authored
      commit 9f4ab18a upstream.
      
      scsiback_release_cmd() must not dereference se_cmd->se_tmr_req
      because that memory is freed by target_free_cmd_mem() before
      scsiback_release_cmd() is called. Fix this use-after-free by
      inlining struct scsiback_tmr into struct vscsibk_pend.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: David Disseldorp <ddiss@suse.de>
      Cc: xen-devel@lists.xenproject.org
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      26e21de7
    • Nicholas Bellinger's avatar
      iscsi-target: Add login_keys_workaround attribute for non RFC initiators · 430c78c7
      Nicholas Bellinger authored
      commit 138d351e upstream.
      
      This patch re-introduces part of a long standing login workaround that
      was recently dropped by:
      
        commit 1c99de98
        Author: Nicholas Bellinger <nab@linux-iscsi.org>
        Date:   Sun Apr 2 13:36:44 2017 -0700
      
            iscsi-target: Drop work-around for legacy GlobalSAN initiator
      
      Namely, the workaround for FirstBurstLength ended up being required by
      Mellanox Flexboot PXE boot ROMs as reported by Robert.
      
      So this patch re-adds the work-around for FirstBurstLength within
      iscsi_check_proposer_for_optional_reply(), and makes the key optional
      to respond when the initiator does not propose, nor respond to it.
      
      Also as requested by Arun, this patch introduces a new TPG attribute
      named 'login_keys_workaround' that controls the use of both the
      FirstBurstLength workaround, as well as the two other existing
      workarounds for gPXE iSCSI boot client.
      
      By default, the workaround is enabled with login_keys_workaround=1,
      since Mellanox FlexBoot requires it, and Arun has verified the Qlogic
      MSFT initiator already proposes FirstBurstLength, so it's uneffected
      by this re-adding this part of the original work-around.
      Reported-by: default avatarRobert LeBlanc <robert@leblancnet.us>
      Cc: Robert LeBlanc <robert@leblancnet.us>
      Reviewed-by: default avatarArun Easi <arun.easi@cavium.com>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      430c78c7
    • Bart Van Assche's avatar
      scsi: Avoid that scsi_exit_rq() triggers a use-after-free · bbfbcfa3
      Bart Van Assche authored
      commit 8e688254 upstream.
      
      Dereferencing shost from scsi_exit_rq() is not safe because the SCSI
      host may already have been freed when scsi_exit_rq() is called.
      Increasing the shost reference count in scsi_init_rq() and dropping that
      reference in scsi_exit_rq() is nontrivial since scsi_host_dev_release()
      may sleep and since scsi_exit_rq() may be called from interrupt
      context. Since scsi_exit_rq() only needs a single bit from shost, copy
      that bit into struct scsi_cmnd.
      Reported-by: default avatarScott Bauer <scott.bauer@intel.com>
      Fixes: e9c787e6 ("scsi: allocate scsi_cmnd structures as part of struct request")
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Scott Bauer <scott.bauer@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bbfbcfa3
    • Ewan D. Milne's avatar
      scsi: Add STARGET_CREATED_REMOVE state to scsi_target_state · d5ec2793
      Ewan D. Milne authored
      commit f9279c96 upstream.
      
      The addition of the STARGET_REMOVE state had the side effect of
      introducing a race condition that can cause a crash.
      
      scsi_target_reap_ref_release() checks the starget->state to
      see if it still in STARGET_CREATED, and if so, skips calling
      transport_remove_device() and device_del(), because the starget->state
      is only set to STARGET_RUNNING after scsi_target_add() has called
      device_add() and transport_add_device().
      
      However, if an rport loss occurs while a target is being scanned,
      it can happen that scsi_remove_target() will be called while the
      starget is still in the STARGET_CREATED state.  In this case, the
      starget->state will be set to STARGET_REMOVE, and as a result,
      scsi_target_reap_ref_release() will take the wrong path.  The end
      result is a panic:
      
      [ 1255.356653] Oops: 0000 [#1] SMP
      [ 1255.360154] Modules linked in: x86_pkg_temp_thermal kvm_intel kvm irqbypass crc32c_intel ghash_clmulni_i
      [ 1255.393234] CPU: 5 PID: 149 Comm: kworker/u96:4 Tainted: G        W       4.11.0+ #8
      [ 1255.401879] Hardware name: Dell Inc. PowerEdge R320/08VT7V, BIOS 2.0.22 11/19/2013
      [ 1255.410327] Workqueue: scsi_wq_6 fc_scsi_scan_rport [scsi_transport_fc]
      [ 1255.417720] task: ffff88060ca8c8c0 task.stack: ffffc900048a8000
      [ 1255.424331] RIP: 0010:kernfs_find_ns+0x13/0xc0
      [ 1255.429287] RSP: 0018:ffffc900048abbf0 EFLAGS: 00010246
      [ 1255.435123] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      [ 1255.443083] RDX: 0000000000000000 RSI: ffffffff8188d659 RDI: 0000000000000000
      [ 1255.451043] RBP: ffffc900048abc10 R08: 0000000000000000 R09: 0000012433fe0025
      [ 1255.459005] R10: 0000000025e5a4b5 R11: 0000000025e5a4b5 R12: ffffffff8188d659
      [ 1255.466972] R13: 0000000000000000 R14: ffff8805f55e5088 R15: 0000000000000000
      [ 1255.474931] FS:  0000000000000000(0000) GS:ffff880616b40000(0000) knlGS:0000000000000000
      [ 1255.483959] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1255.490370] CR2: 0000000000000068 CR3: 0000000001c09000 CR4: 00000000000406e0
      [ 1255.498332] Call Trace:
      [ 1255.501058]  kernfs_find_and_get_ns+0x31/0x60
      [ 1255.505916]  sysfs_unmerge_group+0x1d/0x60
      [ 1255.510498]  dpm_sysfs_remove+0x22/0x60
      [ 1255.514783]  device_del+0xf4/0x2e0
      [ 1255.518577]  ? device_remove_file+0x19/0x20
      [ 1255.523241]  attribute_container_class_device_del+0x1a/0x20
      [ 1255.529457]  transport_remove_classdev+0x4e/0x60
      [ 1255.534607]  ? transport_add_class_device+0x40/0x40
      [ 1255.540046]  attribute_container_device_trigger+0xb0/0xc0
      [ 1255.546069]  transport_remove_device+0x15/0x20
      [ 1255.551025]  scsi_target_reap_ref_release+0x25/0x40
      [ 1255.556467]  scsi_target_reap+0x2e/0x40
      [ 1255.560744]  __scsi_scan_target+0xaa/0x5b0
      [ 1255.565312]  scsi_scan_target+0xec/0x100
      [ 1255.569689]  fc_scsi_scan_rport+0xb1/0xc0 [scsi_transport_fc]
      [ 1255.576099]  process_one_work+0x14b/0x390
      [ 1255.580569]  worker_thread+0x4b/0x390
      [ 1255.584651]  kthread+0x109/0x140
      [ 1255.588251]  ? rescuer_thread+0x330/0x330
      [ 1255.592730]  ? kthread_park+0x60/0x60
      [ 1255.596815]  ret_from_fork+0x29/0x40
      [ 1255.600801] Code: 24 08 48 83 42 40 01 5b 41 5c 5d c3 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90
      [ 1255.621876] RIP: kernfs_find_ns+0x13/0xc0 RSP: ffffc900048abbf0
      [ 1255.628479] CR2: 0000000000000068
      [ 1255.632756] ---[ end trace 34a69ba0477d036f ]---
      
      Fix this by adding another scsi_target state STARGET_CREATED_REMOVE
      to distinguish this case.
      
      Fixes: f05795d3 ("scsi: Add intermediate STARGET_REMOVE state to scsi_target_state")
      Reported-by: default avatarDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: default avatarEwan D. Milne <emilne@redhat.com>
      Reviewed-by: default avatarLaurence Oberman <loberman@redhat.com>
      Tested-by: default avatarLaurence Oberman <loberman@redhat.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d5ec2793
    • Quinn Tran's avatar
      scsi: qla2xxx: Allow ABTS, PURX, RIDA on ATIOQ for ISP83XX/27XX · 1dfabed3
      Quinn Tran authored
      commit 3c4810ff upstream.
      
      Driver added mechanism to move ABTS/PUREX/RIDA mailbox to
      ATIO queue as part of commit id 41dc529a
      ("qla2xxx: Improve RSCN handling in driver").
      
      This patch adds a check to only allow ABTS/PURX/RIDA
      to be moved to ATIO Queue for ISP83XX and ISP27XX.
      Signed-off-by: default avatarQuinn Tran <quinn.tran@cavium.com>
      Signed-off-by: default avatarHimanshu Madhani <himanshu.madhani@cavium.com>
      Reviewed-by: default avatarBart Van Assche <Bart.VanAssche@sandisk.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1dfabed3
    • Paolo Bonzini's avatar
      scsi: virtio_scsi: let host do exception handling · 2ced4c4c
      Paolo Bonzini authored
      commit e72c9a2a upstream.
      
      virtio_scsi tries to do exception handling after the default 30 seconds
      timeout expires.  However, it's better to let the host control the
      timeout, otherwise with a heavy I/O load it is likely that an abort will
      also timeout.  This leads to fatal errors like filesystems going
      offline.
      
      Disable the 'sd' timeout and allow the host to do exception handling,
      following the precedent of the storvsc driver.
      
      Hannes has a proposal to introduce timeouts in virtio, but this provides
      an immediate solution for stable kernels too.
      
      [mkp: fixed typo]
      Reported-by: default avatarDouglas Miller <dougmill@linux.vnet.ibm.com>
      Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: linux-scsi@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2ced4c4c
    • Maurizio Lombardi's avatar
      scsi: ses: do not add a device to an enclosure if enclosure_add_links() fails. · ad8e43cf
      Maurizio Lombardi authored
      commit 62e62ffd upstream.
      
      The enclosure_add_device() function should fail if it can't create the
      relevant sysfs links.
      Signed-off-by: default avatarMaurizio Lombardi <mlombard@redhat.com>
      Tested-by: default avatarDouglas Miller <dougmill@linux.vnet.ibm.com>
      Acked-by: default avatarJames Bottomley <jejb@linux.vnet.ibm.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ad8e43cf
    • Krzysztof Kozlowski's avatar
      PM / Domains: Fix unsafe iteration over modified list of domains · adc01577
      Krzysztof Kozlowski authored
      commit a7e2d1bc upstream.
      
      of_genpd_remove_last() iterates over list of domains and removes
      matching element thus it has to use safe version of list iteration.
      
      Fixes: 17926551 (PM / Domains: Add support for removing nested PM domains by provider)
      Signed-off-by: default avatarKrzysztof Kozlowski <krzk@kernel.org>
      Acked-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      adc01577
    • Krzysztof Kozlowski's avatar
      PM / Domains: Fix unsafe iteration over modified list of domain providers · 01ee0ee7
      Krzysztof Kozlowski authored
      commit b556b15d upstream.
      
      of_genpd_del_provider() iterates over list of domain provides and
      removes matching element thus it has to use safe version of list
      iteration.
      
      Fixes: aa42240a (PM / Domains: Add generic OF-based PM domain look-up)
      Signed-off-by: default avatarKrzysztof Kozlowski <krzk@kernel.org>
      Acked-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      01ee0ee7
    • Krzysztof Kozlowski's avatar
      PM / Domains: Fix unsafe iteration over modified list of device links · 61461312
      Krzysztof Kozlowski authored
      commit c6e83cac upstream.
      
      pm_genpd_remove_subdomain() iterates over domain's master_links list and
      removes matching element thus it has to use safe version of list
      iteration.
      
      Fixes: f721889f ("PM / Domains: Support for generic I/O PM domains (v8)")
      Signed-off-by: default avatarKrzysztof Kozlowski <krzk@kernel.org>
      Acked-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      61461312
    • Peter Rosin's avatar
      ASoC: atmel: tse850: fix off-by-one in the "ANA" enumeration count · 5a18ee01
      Peter Rosin authored
      commit a00cebf5 upstream.
      
      At some point I added the "Low" entry at the beginning of the array
      without bumping the enumeration count from 9 to 10. Fix this. While at
      it, fix the anti-pattern for the other enumeration (used by MUX{1,2}).
      
      Fixes: aa431124 ("ASoC: atmel: tse850: add ASoC driver for the Axentia TSE-850")
      Signed-off-by: default avatarPeter Rosin <peda@axentia.se>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5a18ee01
    • Satish Babu Patakokila's avatar
      ASoC: compress: Derive substream from stream based on direction · 8024384e
      Satish Babu Patakokila authored
      commit 01b8cedf upstream.
      
      Currently compress driver hardcodes direction as playback to get
      substream from the stream. This results in getting the incorrect
      substream for compressed capture usecase.
      To fix this, remove the hardcoding and derive substream based on
      the stream direction.
      Signed-off-by: default avatarSatish Babu Patakokila <sbpata@codeaurora.org>
      Signed-off-by: default avatarBanajit Goswami <bgoswami@codeaurora.org>
      Acked-By: default avatarVinod Koul <vinod.koul@intel.com>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8024384e
    • Shawn Guo's avatar
      ASoC: zx-i2s: flip I2S master/slave mode · 1f3fb267
      Shawn Guo authored
      commit a205c159 upstream.
      
      The SND_SOC_DAIFMT_MASTER bits are defined to specify the master/slave
      mode for Codec, not I2S.  So the I2S master/slave mode should be flipped
      according to SND_SOC_DAIFMT_MASTER bits.
      Signed-off-by: default avatarShawn Guo <shawn.guo@linaro.org>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1f3fb267
    • Cyrille Pitchen's avatar
      spi: atmel: fix corrupted data issue on SAM9 family SoCs · 2e74a4f5
      Cyrille Pitchen authored
      commit 7094576c upstream.
      
      This patch disables the use of the DMA for data transfer and forces the
      use of PIO transfers instead as a quick fixup to solve the cache aliasing
      issue on ARM9 based cores, which embeds a VIVT data cache.
      
      Indeed in the case of VIVT data caches, it is not safe to call dma_map_*()
      functions to map buffers for DMA transfers when those buffers have been
      allocated by vmalloc() or from any DMA-unsafe area.
      
      Further patches may propose a better solution based on the use of a bounce
      buffer at the SPI sub-system level but such solution needs more time to be
      discussed. Then the use of DMA transfers could be enabled again to improve
      the performances but before that, this patch already solves the issue.
      Signed-off-by: default avatarCyrille Pitchen <cyrille.pitchen@microchip.com>
      Acked-by: default avatarNicolas Ferre <nicolas.ferre@microchip.com>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2e74a4f5
    • Matwey V Kornilov's avatar
      igb: Explicitly select page 0 at initialization · 9317c1d0
      Matwey V Kornilov authored
      commit 440aeca4 upstream.
      
      The functions igb_read_phy_reg_gs40g/igb_write_phy_reg_gs40g (which were
      removed in 2a3cdead) explicitly selected the required page at every phy_reg
      access. Currently, igb_get_phy_id_82575 relays on the fact that page 0 is
      already selected. The assumption is not fulfilled for my Lex 3I380CW
      motherboard with integrated dual i211 based gigabit ethernet. This leads to igb
      initialization failure and network interfaces are not working:
      
          igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
          igb: Copyright (c) 2007-2014 Intel Corporation.
          igb: probe of 0000:01:00.0 failed with error -2
          igb: probe of 0000:02:00.0 failed with error -2
      
      In order to fix it, we explicitly select page 0 before first access to phy
      registers.
      
      See also: https://bugzilla.suse.com/show_bug.cgi?id=1009911
      See also: http://www.lex.com.tw/products/pdf/3I380A&3I380CW.pdf
      
      Fixes: 2a3cdead ("igb: Remove GS40G specific defines/functions")
      Signed-off-by: default avatarMatwey V Kornilov <matwey@sai.msu.ru>
      Tested-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9317c1d0
    • Filipe Manana's avatar
      Btrfs: incremental send, fix invalid memory access · 114571a8
      Filipe Manana authored
      commit 24e52b11 upstream.
      
      When doing an incremental send, while processing an extent that changed
      between the parent and send snapshots and that extent was an inline extent
      in the parent snapshot, it's possible to access a memory region beyond
      the end of leaf if the inline extent is very small and it is the first
      item in a leaf.
      
      An example scenario is described below.
      
      The send snapshot has the following leaf:
      
       leaf 33865728 items 33 free space 773 generation 46 owner 5
       fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7
       chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f
              (...)
              item 14 key (335 EXTENT_DATA 0) itemoff 3052 itemsize 53
                      generation 36 type 1 (regular)
                      extent data disk byte 12791808 nr 4096
                      extent data offset 0 nr 4096 ram 4096
                      extent compression 0 (none)
              item 15 key (335 EXTENT_DATA 8192) itemoff 2999 itemsize 53
                      generation 36 type 1 (regular)
                      extent data disk byte 138170368 nr 225280
                      extent data offset 0 nr 225280 ram 225280
                      extent compression 0 (none)
              (...)
      
      And the parent snapshot has the following leaf:
      
       leaf 31272960 items 17 free space 17 generation 31 owner 5
       fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7
       chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f
              item 0 key (335 EXTENT_DATA 0) itemoff 3951 itemsize 44
                      generation 31 type 0 (inline)
                      inline extent data size 23 ram_bytes 613 compression 1 (zlib)
              (...)
      
      When computing the send stream, it is detected that the extent of inode
      335, at file offset 0, and at fs/btrfs/send.c:is_extent_unchanged() we
      grab the leaf from the parent snapshot and access the inline extent item.
      However, before jumping to the 'out' label, we access the 'offset' and
      'disk_bytenr' fields of the extent item, which should not be done for
      inline extents since the inlined data starts at the offset of the
      'disk_bytenr' field and can be very small. For example accessing the
      'offset' field of the file extent item results in the following trace:
      
      [  599.705368] general protection fault: 0000 [#1] PREEMPT SMP
      [  599.706296] Modules linked in: btrfs psmouse i2c_piix4 ppdev acpi_cpufreq serio_raw parport_pc i2c_core evdev tpm_tis tpm_tis_core sg pcspkr parport tpm button su$
      [  599.709340] CPU: 7 PID: 5283 Comm: btrfs Not tainted 4.10.0-rc8-btrfs-next-46+ #1
      [  599.709340] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [  599.709340] task: ffff88023eedd040 task.stack: ffffc90006658000
      [  599.709340] RIP: 0010:read_extent_buffer+0xdb/0xf4 [btrfs]
      [  599.709340] RSP: 0018:ffffc9000665ba00 EFLAGS: 00010286
      [  599.709340] RAX: db73880000000000 RBX: 0000000000000000 RCX: 0000000000000001
      [  599.709340] RDX: ffffc9000665ba60 RSI: db73880000000000 RDI: ffffc9000665ba5f
      [  599.709340] RBP: ffffc9000665ba30 R08: 0000000000000001 R09: ffff88020dc5e098
      [  599.709340] R10: 0000000000001000 R11: 0000160000000000 R12: 6db6db6db6db6db7
      [  599.709340] R13: ffff880000000000 R14: 0000000000000000 R15: ffff88020dc5e088
      [  599.709340] FS:  00007f519555a8c0(0000) GS:ffff88023f3c0000(0000) knlGS:0000000000000000
      [  599.709340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  599.709340] CR2: 00007f1411afd000 CR3: 0000000235f8e000 CR4: 00000000000006e0
      [  599.709340] Call Trace:
      [  599.709340]  btrfs_get_token_64+0x93/0xce [btrfs]
      [  599.709340]  ? printk+0x48/0x50
      [  599.709340]  btrfs_get_64+0xb/0xd [btrfs]
      [  599.709340]  process_extent+0x3a1/0x1106 [btrfs]
      [  599.709340]  ? btree_read_extent_buffer_pages+0x5/0xef [btrfs]
      [  599.709340]  changed_cb+0xb03/0xb3d [btrfs]
      [  599.709340]  ? btrfs_get_token_32+0x7a/0xcc [btrfs]
      [  599.709340]  btrfs_compare_trees+0x432/0x53d [btrfs]
      [  599.709340]  ? process_extent+0x1106/0x1106 [btrfs]
      [  599.709340]  btrfs_ioctl_send+0x960/0xe26 [btrfs]
      [  599.709340]  btrfs_ioctl+0x181b/0x1fed [btrfs]
      [  599.709340]  ? trace_hardirqs_on_caller+0x150/0x1ac
      [  599.709340]  vfs_ioctl+0x21/0x38
      [  599.709340]  ? vfs_ioctl+0x21/0x38
      [  599.709340]  do_vfs_ioctl+0x611/0x645
      [  599.709340]  ? rcu_read_unlock+0x5b/0x5d
      [  599.709340]  ? __fget+0x6d/0x79
      [  599.709340]  SyS_ioctl+0x57/0x7b
      [  599.709340]  entry_SYSCALL_64_fastpath+0x18/0xad
      [  599.709340] RIP: 0033:0x7f51945eec47
      [  599.709340] RSP: 002b:00007ffc21c13e98 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
      [  599.709340] RAX: ffffffffffffffda RBX: ffffffff81096459 RCX: 00007f51945eec47
      [  599.709340] RDX: 00007ffc21c13f20 RSI: 0000000040489426 RDI: 0000000000000004
      [  599.709340] RBP: ffffc9000665bf98 R08: 00007f519450d700 R09: 00007f519450d700
      [  599.709340] R10: 00007f519450d9d0 R11: 0000000000000202 R12: 0000000000000046
      [  599.709340] R13: ffffc9000665bf78 R14: 0000000000000000 R15: 00007f5195574040
      [  599.709340]  ? trace_hardirqs_off_caller+0x43/0xb1
      [  599.709340] Code: 29 f0 49 39 d8 4c 0f 47 c3 49 03 81 58 01 00 00 44 89 c1 4c 01 c2 4c 29 c3 48 c1 f8 03 49 0f af c4 48 c1 e0 0c 4c 01 e8 48 01 c6 <f3> a4 31 f6 4$
      [  599.709340] RIP: read_extent_buffer+0xdb/0xf4 [btrfs] RSP: ffffc9000665ba00
      [  599.762057] ---[ end trace fe00d7af61b9f49e ]---
      
      This is because the 'offset' field starts at an offset of 37 bytes
      (offsetof(struct btrfs_file_extent_item, offset)), has a length of 8
      bytes and therefore attemping to read it causes a 1 byte access beyond
      the end of the leaf, as the first item's content in a leaf is located
      at the tail of the leaf, the item size is 44 bytes and the offset of
      that field plus its length (37 + 8 = 45) goes beyond the item's size
      by 1 byte.
      
      So fix this by accessing the 'offset' and 'disk_bytenr' fields after
      jumping to the 'out' label if we are processing an inline extent. We
      move the reading operation of the 'disk_bytenr' field too because we
      have the same problem as for the 'offset' field explained above when
      the inline data is less then 8 bytes. The access to the 'generation'
      field is also moved but just for the sake of grouping access to all
      the fields.
      
      Fixes: e1cbfd7b ("Btrfs: send, fix file hole not being preserved due to inline extent")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      114571a8
    • Jan Kara's avatar
      btrfs: Don't clear SGID when inheriting ACLs · 88b4b815
      Jan Kara authored
      commit b7f8a09f upstream.
      
      When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
      set, DIR1 is expected to have SGID bit set (and owning group equal to
      the owning group of 'DIR0'). However when 'DIR0' also has some default
      ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
      'DIR1' to get cleared if user is not member of the owning group.
      
      Fix the problem by moving posix_acl_update_mode() out of
      __btrfs_set_acl() into btrfs_set_acl(). That way the function will not be
      called when inheriting ACLs which is what we want as it prevents SGID
      bit clearing and the mode has been properly set by posix_acl_create()
      anyway.
      
      Fixes: 07393101
      CC: linux-btrfs@vger.kernel.org
      CC: David Sterba <dsterba@suse.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      88b4b815
    • Filipe Manana's avatar
      Btrfs: fix invalid extent maps due to hole punching · dad8a6e1
      Filipe Manana authored
      commit 609805d8 upstream.
      
      While punching a hole in a range that is not aligned with the sector size
      (currently the same as the page size) we can end up leaving an extent map
      in memory with a length that is smaller then the sector size or with a
      start offset that is not aligned to the sector size. Both cases are not
      expected and can lead to problems. This issue is easily detected
      after the patch from commit a7e3b975 ("Btrfs: fix reported number of
      inode blocks"), introduced in kernel 4.12-rc1, in a scenario like the
      following for example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ xfs_io -c "pwrite -S 0xaa -b 100K 0 100K" /mnt/foo
        $ xfs_io -c "fpunch 60K 90K" /mnt/foo
        $ xfs_io -c "pwrite -S 0xbb -b 100K 50K 100K" /mnt/foo
        $ xfs_io -c "pwrite -S 0xcc -b 50K 100K 50K" /mnt/foo
        $ umount /mnt
      
      After the unmount operation we can see several warnings emmitted due to
      underflows related to space reservation counters:
      
      [ 2837.443299] ------------[ cut here ]------------
      [ 2837.447395] WARNING: CPU: 8 PID: 2474 at fs/btrfs/inode.c:9444 btrfs_destroy_inode+0xe8/0x27e [btrfs]
      [ 2837.452108] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button se
      rio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_gene
      ric raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
      [ 2837.458389] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
      [ 2837.459754] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [ 2837.462379] Call Trace:
      [ 2837.462379]  dump_stack+0x68/0x92
      [ 2837.462379]  __warn+0xc2/0xdd
      [ 2837.462379]  warn_slowpath_null+0x1d/0x1f
      [ 2837.462379]  btrfs_destroy_inode+0xe8/0x27e [btrfs]
      [ 2837.462379]  destroy_inode+0x3d/0x55
      [ 2837.462379]  evict+0x177/0x17e
      [ 2837.462379]  dispose_list+0x50/0x71
      [ 2837.462379]  evict_inodes+0x132/0x141
      [ 2837.462379]  generic_shutdown_super+0x3f/0xeb
      [ 2837.462379]  kill_anon_super+0x12/0x1c
      [ 2837.462379]  btrfs_kill_super+0x16/0x21 [btrfs]
      [ 2837.462379]  deactivate_locked_super+0x30/0x68
      [ 2837.462379]  deactivate_super+0x36/0x39
      [ 2837.462379]  cleanup_mnt+0x58/0x76
      [ 2837.462379]  __cleanup_mnt+0x12/0x14
      [ 2837.462379]  task_work_run+0x77/0x9b
      [ 2837.462379]  prepare_exit_to_usermode+0x9d/0xc5
      [ 2837.462379]  syscall_return_slowpath+0x196/0x1b9
      [ 2837.462379]  entry_SYSCALL_64_fastpath+0xab/0xad
      [ 2837.462379] RIP: 0033:0x7f3ef3e6b9a7
      [ 2837.462379] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 2837.462379] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
      [ 2837.462379] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
      [ 2837.462379] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
      [ 2837.462379] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
      [ 2837.462379] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
      [ 2837.519355] ---[ end trace e79345fe24b30b8d ]---
      [ 2837.596256] ------------[ cut here ]------------
      [ 2837.597625] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5699 btrfs_free_block_groups+0x246/0x3eb [btrfs]
      [ 2837.603547] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
      [ 2837.659372] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
      [ 2837.663359] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [ 2837.663359] Call Trace:
      [ 2837.663359]  dump_stack+0x68/0x92
      [ 2837.663359]  __warn+0xc2/0xdd
      [ 2837.663359]  warn_slowpath_null+0x1d/0x1f
      [ 2837.663359]  btrfs_free_block_groups+0x246/0x3eb [btrfs]
      [ 2837.663359]  close_ctree+0x1dd/0x2e1 [btrfs]
      [ 2837.663359]  ? evict_inodes+0x132/0x141
      [ 2837.663359]  btrfs_put_super+0x15/0x17 [btrfs]
      [ 2837.663359]  generic_shutdown_super+0x6a/0xeb
      [ 2837.663359]  kill_anon_super+0x12/0x1c
      [ 2837.663359]  btrfs_kill_super+0x16/0x21 [btrfs]
      [ 2837.663359]  deactivate_locked_super+0x30/0x68
      [ 2837.663359]  deactivate_super+0x36/0x39
      [ 2837.663359]  cleanup_mnt+0x58/0x76
      [ 2837.663359]  __cleanup_mnt+0x12/0x14
      [ 2837.663359]  task_work_run+0x77/0x9b
      [ 2837.663359]  prepare_exit_to_usermode+0x9d/0xc5
      [ 2837.663359]  syscall_return_slowpath+0x196/0x1b9
      [ 2837.663359]  entry_SYSCALL_64_fastpath+0xab/0xad
      [ 2837.663359] RIP: 0033:0x7f3ef3e6b9a7
      [ 2837.663359] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 2837.663359] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
      [ 2837.663359] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
      [ 2837.663359] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
      [ 2837.663359] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
      [ 2837.663359] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
      [ 2837.739445] ---[ end trace e79345fe24b30b8e ]---
      [ 2837.745595] ------------[ cut here ]------------
      [ 2837.746412] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5700 btrfs_free_block_groups+0x261/0x3eb [btrfs]
      [ 2837.747955] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
      [ 2837.755395] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
      [ 2837.756769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [ 2837.758526] Call Trace:
      [ 2837.758925]  dump_stack+0x68/0x92
      [ 2837.759383]  __warn+0xc2/0xdd
      [ 2837.759383]  warn_slowpath_null+0x1d/0x1f
      [ 2837.759383]  btrfs_free_block_groups+0x261/0x3eb [btrfs]
      [ 2837.759383]  close_ctree+0x1dd/0x2e1 [btrfs]
      [ 2837.759383]  ? evict_inodes+0x132/0x141
      [ 2837.759383]  btrfs_put_super+0x15/0x17 [btrfs]
      [ 2837.759383]  generic_shutdown_super+0x6a/0xeb
      [ 2837.759383]  kill_anon_super+0x12/0x1c
      [ 2837.759383]  btrfs_kill_super+0x16/0x21 [btrfs]
      [ 2837.759383]  deactivate_locked_super+0x30/0x68
      [ 2837.759383]  deactivate_super+0x36/0x39
      [ 2837.759383]  cleanup_mnt+0x58/0x76
      [ 2837.759383]  __cleanup_mnt+0x12/0x14
      [ 2837.759383]  task_work_run+0x77/0x9b
      [ 2837.759383]  prepare_exit_to_usermode+0x9d/0xc5
      [ 2837.759383]  syscall_return_slowpath+0x196/0x1b9
      [ 2837.759383]  entry_SYSCALL_64_fastpath+0xab/0xad
      [ 2837.759383] RIP: 0033:0x7f3ef3e6b9a7
      [ 2837.759383] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 2837.759383] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
      [ 2837.759383] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
      [ 2837.759383] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
      [ 2837.759383] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
      [ 2837.759383] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
      [ 2837.777063] ---[ end trace e79345fe24b30b8f ]---
      [ 2837.778235] ------------[ cut here ]------------
      [ 2837.778856] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:9825 btrfs_free_block_groups+0x348/0x3eb [btrfs]
      [ 2837.791385] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
      [ 2837.797711] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
      [ 2837.798594] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [ 2837.800118] Call Trace:
      [ 2837.800515]  dump_stack+0x68/0x92
      [ 2837.801015]  __warn+0xc2/0xdd
      [ 2837.801471]  warn_slowpath_null+0x1d/0x1f
      [ 2837.801698]  btrfs_free_block_groups+0x348/0x3eb [btrfs]
      [ 2837.801698]  close_ctree+0x1dd/0x2e1 [btrfs]
      [ 2837.801698]  ? evict_inodes+0x132/0x141
      [ 2837.801698]  btrfs_put_super+0x15/0x17 [btrfs]
      [ 2837.801698]  generic_shutdown_super+0x6a/0xeb
      [ 2837.801698]  kill_anon_super+0x12/0x1c
      [ 2837.801698]  btrfs_kill_super+0x16/0x21 [btrfs]
      [ 2837.801698]  deactivate_locked_super+0x30/0x68
      [ 2837.801698]  deactivate_super+0x36/0x39
      [ 2837.801698]  cleanup_mnt+0x58/0x76
      [ 2837.801698]  __cleanup_mnt+0x12/0x14
      [ 2837.801698]  task_work_run+0x77/0x9b
      [ 2837.801698]  prepare_exit_to_usermode+0x9d/0xc5
      [ 2837.801698]  syscall_return_slowpath+0x196/0x1b9
      [ 2837.801698]  entry_SYSCALL_64_fastpath+0xab/0xad
      [ 2837.801698] RIP: 0033:0x7f3ef3e6b9a7
      [ 2837.801698] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 2837.801698] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
      [ 2837.801698] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
      [ 2837.801698] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
      [ 2837.801698] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
      [ 2837.801698] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
      [ 2837.818441] ---[ end trace e79345fe24b30b90 ]---
      [ 2837.818991] BTRFS info (device sdc): space_info 1 has 7974912 free, is not full
      [ 2837.819830] BTRFS info (device sdc): space_info total=8388608, used=417792, pinned=0, reserved=0, may_use=18446744073709547520, readonly=0
      
      What happens in the above example is the following:
      
      1) When punching the hole, at btrfs_punch_hole(), the variable tail_len
         is set to 2048 (as tail_start is 148Kb + 1 and offset + len is 150Kb).
         This results in the creation of an extent map with a length of 2Kb
         starting at file offset 148Kb, through find_first_non_hole() ->
         btrfs_get_extent().
      
      2) The second write (first write after the hole punch operation), sets
         the range [50Kb, 152Kb[ to delalloc.
      
      3) The third write, at btrfs_find_new_delalloc_bytes(), sees the extent
         map covering the range [148Kb, 150Kb[ and ends up calling
         set_extent_bit() for the same range, which results in splitting an
         existing extent state record, covering the range [148Kb, 152Kb[ into
         two 2Kb extent state records, covering the ranges [148Kb, 150Kb[ and
         [150Kb, 152Kb[.
      
      4) Finally at lock_and_cleanup_extent_if_need(), immediately after calling
         btrfs_find_new_delalloc_bytes() we clear the delalloc bit from the
         range [100Kb, 152Kb[ which results in the btrfs_clear_bit_hook()
         callback being invoked against the two 2Kb extent state records that
         cover the ranges [148Kb, 150Kb[ and [150Kb, 152Kb[. When called against
         the first 2Kb extent state, it calls btrfs_delalloc_release_metadata()
         with a length argument of 2048 bytes. That function rounds up the length
         to a sector size aligned length, so it ends up considering a length of
         4096 bytes, and then calls calc_csum_metadata_size() which results in
         decrementing the inode's csum_bytes counter by 4096 bytes, so after
         it stays a value of 0 bytes. Then the same happens when
         btrfs_clear_bit_hook() is called against the second extent state that
         has a length of 2Kb, covering the range [150Kb, 152Kb[, the length is
         rounded up to 4096 and calc_csum_metadata_size() ends up being called
         to decrement 4096 bytes from the inode's csum_bytes counter, which
         at that time has a value of 0, leading to an underflow, which is
         exactly what triggers the first warning, at btrfs_destroy_inode().
         All the other warnings relate to several space accounting counters
         that underflow as well due to similar reasons.
      
      A similar case but where the hole punching operation creates an extent map
      with a start offset not aligned to the sector size is the following:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ xfs_io -f -c "fpunch 695K 820K" $SCRATCH_MNT/bar
        $ xfs_io -c "pwrite -S 0xaa 1008K 307K" $SCRATCH_MNT/bar
        $ xfs_io -c "pwrite -S 0xbb -b 630K 1073K 630K" $SCRATCH_MNT/bar
        $ xfs_io -c "pwrite -S 0xcc -b 459K 1068K 459K" $SCRATCH_MNT/bar
        $ umount /mnt
      
      During the unmount operation we get similar traces for the same reasons as
      in the first example.
      
      So fix the hole punching operation to make sure it never creates extent
      maps with a length that is not aligned to the sector size nor with a start
      offset that is not aligned to the sector size, as this breaks all
      assumptions and it's a land mine.
      
      Fixes: d7781546 ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dad8a6e1