1. 06 Jan, 2016 8 commits
    • Daniel Borkmann's avatar
      netlink, mmap: fix edge-case leakages in nf queue zero-copy · 6a838edc
      Daniel Borkmann authored
      commit 6bb0fef4 upstream.
      
      When netlink mmap on receive side is the consumer of nf queue data,
      it can happen that in some edge cases, we write skb shared info into
      the user space mmap buffer:
      
      Assume a possible rx ring frame size of only 4096, and the network skb,
      which is being zero-copied into the netlink skb, contains page frags
      with an overall skb->len larger than the linear part of the netlink
      skb.
      
      skb_zerocopy(), which is generic and thus not aware of the fact that
      shared info cannot be accessed for such skbs then tries to write and
      fill frags, thus leaking kernel data/pointers and in some corner cases
      possibly writing out of bounds of the mmap area (when filling the
      last slot in the ring buffer this way).
      
      I.e. the ring buffer slot is then of status NL_MMAP_STATUS_VALID, has
      an advertised length larger than 4096, where the linear part is visible
      at the slot beginning, and the leaked sizeof(struct skb_shared_info)
      has been written to the beginning of the next slot (also corrupting
      the struct nl_mmap_hdr slot header incl. status etc), since skb->end
      points to skb->data + ring->frame_size - NL_MMAP_HDRLEN.
      
      The fix adds and lets __netlink_alloc_skb() take the actual needed
      linear room for the network skb + meta data into account. It's completely
      irrelevant for non-mmaped netlink sockets, but in case mmap sockets
      are used, it can be decided whether the available skb_tailroom() is
      really large enough for the buffer, or whether it needs to internally
      fallback to a normal alloc_skb().
      
      >From nf queue side, the information whether the destination port is
      an mmap RX ring is not really available without extra port-to-socket
      lookup, thus it can only be determined in lower layers i.e. when
      __netlink_alloc_skb() is called that checks internally for this. I
      chose to add the extra ldiff parameter as mmap will then still work:
      We have data_len and hlen in nfqnl_build_packet_message(), data_len
      is the full length (capped at queue->copy_range) for skb_zerocopy()
      and hlen some possible part of data_len that needs to be copied; the
      rem_len variable indicates the needed remaining linear mmap space.
      
      The only other workaround in nf queue internally would be after
      allocation time by f.e. cap'ing the data_len to the skb_tailroom()
      iff we deal with an mmap skb, but that would 1) expose the fact that
      we use a mmap skb to upper layers, and 2) trim the skb where we
      otherwise could just have moved the full skb into the normal receive
      queue.
      
      After the patch, in my test case the ring slot doesn't fit and therefore
      shows NL_MMAP_STATUS_COPY, where a full skb carries all the data and
      thus needs to be picked up via recv().
      
      Fixes: 3ab1f683 ("nfnetlink: add support for memory mapped netlink")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      6a838edc
    • Daniel Borkmann's avatar
      ebpf: fix fd refcount leaks related to maps in bpf syscall · a7090f77
      Daniel Borkmann authored
      commit 592867bf upstream.
      
      We may already have gotten a proper fd struct through fdget(), so
      whenever we return at the end of an map operation, we need to call
      fdput(). However, each map operation from syscall side first probes
      CHECK_ATTR() to verify that unused fields in the bpf_attr union are
      zero.
      
      In case of malformed input, we return with error, but the lookup to
      the map_fd was already performed at that time, so that we return
      without an corresponding fdput(). Fix it by performing an fdget()
      only right before bpf_map_get(). The fdget() invocation on maps in
      the verifier is not affected.
      
      Fixes: db20fd2b ("bpf: add lookup/update/delete/iterate methods to BPF maps")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      a7090f77
    • Eric Dumazet's avatar
      task_work: remove fifo ordering guarantee · 3fbf4c73
      Eric Dumazet authored
      commit c8219906 upstream.
      
      In commit f341861f ("task_work: add a scheduling point in
      task_work_run()") I fixed a latency problem adding a cond_resched()
      call.
      
      Later, commit ac3d0da8 added yet another loop to reverse a list,
      bringing back the latency spike :
      
      I've seen in some cases this loop taking 275 ms, if for example a
      process with 2,000,000 files is killed.
      
      We could add yet another cond_resched() in the reverse loop, or we
      can simply remove the reversal, as I do not think anything
      would depend on order of task_work_add() submitted works.
      
      Fixes: ac3d0da8 ("task_work: Make task_work_add() lockless")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarMaciej Żenczykowski <maze@google.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      3fbf4c73
    • Axel Lin's avatar
      ASoC: spear_pcm: Use devm_snd_dmaengine_pcm_register to fix resource leak · 2a11e1d5
      Axel Lin authored
      commit 2c3f4b97 upstream.
      
      All the callers assume devm_spear_pcm_platform_register is a devm_ API, so
      use devm_snd_dmaengine_pcm_register in devm_spear_pcm_platform_register.
      
      Fixes: e1771bcf ("ASoC: SPEAr: remove custom DMA alloc compat function")
      Signed-off-by: default avatarAxel Lin <axel.lin@ingics.com>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      2a11e1d5
    • Scott Feldman's avatar
      bridge: fix netlink max attr size · a515cb6e
      Scott Feldman authored
      commit eb4cb851 upstream.
      
      .maxtype should match .policy.  Probably just been getting lucky here
      because IFLA_BRPORT_MAX > IFLA_BR_MAX.
      
      Fixes: 13323516 ("bridge: implement rtnl_link_ops->changelink")
      Signed-off-by: default avatarScott Feldman <sfeldma@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      a515cb6e
    • Florian Fainelli's avatar
      net: bcmgenet: Delay PHY initialization to bcmgenet_open() · 47f3d72e
      Florian Fainelli authored
      commit 6cc8e6d4 upstream.
      
      We are currently doing a full PHY initialization and even starting the
      pHY state machine during bcmgenet_mii_init() which is executed in the
      driver's probe function. This is convenient to determine whether we can
      attach to a proper PHY device but comes at the expense of spending up to
      10ms per MDIO transactions (to reach the waitqueue timeout), which slows
      things down.
      
      This also creates a sitaution where we end-up attaching twice to the
      PHY, which is not quite correct either.
      
      Fix this by moving bcmgenet_mii_probe() into bcmgenet_open() and update
      its error path accordingly.
      
      Avoid printing the message "attached PHY at address 1 [...]" every time
      we bring up/down the interface and remove this print since it duplicates
      what the PHY driver already does for us.
      
      Fixes: 1c1008c7 ("net: bcmgenet: add main driver file")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      47f3d72e
    • Florian Fainelli's avatar
      net: bcmgenet: Use correct dev_id for free_irq · 6629cd48
      Florian Fainelli authored
      commit 978ffac4 upstream.
      
      bcmgenet_open()'s error path call free_irq() with a dev_id argument
      different from the one we used to call request_irq() with, this will
      make us trip over the warning in kernel/irq/manage.c:__free_irq()
      
      Fixes: 1c1008c7 ("net: bcmgenet: add main driver file")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      6629cd48
    • Grygorii Strashko's avatar
      pinctrl: single: dra7: remove PCS_QUIRK_SHARED_IRQ · d2e21c80
      Grygorii Strashko authored
      commit 6417049f upstream.
      
      On DRA7 there is one pinctrl domain (dra7_pmx_core) and
      PRCM wake-up IRQ is not shared, so remove quirk.
      
      Cc: Nishanth Menon <nm@ti.com>
      Fixes: 31320bea ('pinctrl: single: Add DRA7 pinctrl compatibility')
      Signed-off-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Acked-by: default avatarTero Kristo <t-kristo@ti.com>
      Acked-by: default avatarTony Lindgren <tony@atomide.com>
      Signed-off-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      d2e21c80
  2. 04 Jan, 2016 8 commits
    • Christophe Lombard's avatar
      cxl: Fix number of allocated pages in SPA · 32190d02
      Christophe Lombard authored
      commit 4108efb0 upstream.
      
      The scheduled process area is currently allocated before assigning the
      correct maximum processes to the AFU, which will mean we only ever
      allocate a fixed number of pages for the scheduled process area. This
      will limit us to 958 processes with 2 x 64K pages. If we try to use more
      processes than that we'd probably overrun the buffer and corrupt memory
      or crash.
      
      AFUs that require three or more interrupts per process will not be
      affected as they are already limited to less processes than that, but we
      could hit it on an AFU that requires 0, 1 or 2 interrupts per process,
      or when using 4K pages.
      
      This patch moves the initialisation of the num_procs to before the SPA
      allocation so that enough pages will be allocated for the number of
      processes that the AFU supports.
      Signed-off-by: default avatarChristophe Lombard <clombard@linux.vnet.ibm.com>
      Signed-off-by: default avatarIan Munsie <imunsie@au1.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      [ kamal: backport to 4.2-stable: context ]
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      32190d02
    • Maxim Sheviakov's avatar
      drm/radeon: fix quirk for MSI R7 370 Armor 2X · cd629882
      Maxim Sheviakov authored
      commit 515c752d upstream.
      
      There was a typo in the original.
      
      bug:
      https://bugs.freedesktop.org/show_bug.cgi?id=92865Signed-off-by: default avatarMaxim Sheviakov <mrader3940@yandex.ru>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      cd629882
    • Alex Deucher's avatar
    • Maxim Sheviakov's avatar
      drm/radeon: add quirk for MSI R7 370 · 8217dd18
      Maxim Sheviakov authored
      commit e7865479 upstream.
      
      Just adds the quirk for MSI R7 370 Armor 2X
      Bug:
      https://bugs.freedesktop.org/show_bug.cgi?id=91294Signed-off-by: default avatarMaxim Sheviakov <mrader3940@yandex.ru>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      8217dd18
    • Malcolm Crossley's avatar
      x86/xen: Do not clip xen_e820_map to xen_e820_map_entries when sanitizing map · fe82c1fe
      Malcolm Crossley authored
      commit 64c98e7f upstream.
      
      Sanitizing the e820 map may produce extra E820 entries which would result in
      the topmost E820 entries being removed. The removed entries would typically
      include the top E820 usable RAM region and thus result in the domain having
      signicantly less RAM available to it.
      
      Fix by allowing sanitize_e820_map to use the full size of the allocated E820
      array.
      Signed-off-by: default avatarMalcolm Crossley <malcolm.crossley@citrix.com>
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      [ kamal: backport to 3.19-stable: context ]
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      fe82c1fe
    • Nikhil Badola's avatar
      drivers: usb: fsl: Workaround for USB erratum-A005275 · 6039455f
      Nikhil Badola authored
      commit f8786a91 upstream.
      
      Incoming packets in high speed are randomly corrupted by h/w
      resulting in multiple errors. This workaround makes FS as
      default mode in all affected socs by disabling HS chirp
      signalling.This errata does not affect FS and LS mode.
      
      Forces all HS devices to connect in FS mode for all socs
      affected by this erratum:
      P3041 and P2041 rev 1.0 and 1.1
      P5020 and P5010 rev 1.0 and 2.0
      P5040, P1010 and T4240 rev 1.0
      Signed-off-by: default avatarRamneek Mehresh <ramneek.mehresh@freescale.com>
      Signed-off-by: default avatarNikhil Badola <nikhil.badola@freescale.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      6039455f
    • Nikhil Badola's avatar
      drivers: usb :fsl: Implement Workaround for USB Erratum A007792 · ae14b033
      Nikhil Badola authored
      commit 523f1dec upstream.
      
      USB controller version-2.5 requires to enable internal UTMI
      phy and program PTS field in PORTSC register before asserting
      controller reset. This is must for successful resetting of the
      controller and subsequent enumeration of usb devices
      Signed-off-by: default avatarNikhil Badola <nikhil.badola@freescale.com>
      Signed-off-by: default avatarSuresh Gupta <suresh.gupta@freescale.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      ae14b033
    • Eric Benard's avatar
      mxc_nand: fix copy_spare · 45ba1919
      Eric Benard authored
      commit e5a5d92d upstream.
      
      it was broken by 35d5d20e
      "mtd: mxc_nand: cleanup copy_spare function"
      
      else we get the following error :
      [   22.709507] ubi0: attaching mtd3
      [   23.613470] ubi0: scanning is finished
      [   23.617278] ubi0: empty MTD device detected
      [   23.623219] Unhandled fault: imprecise external abort (0x1c06) at 0x9e62f0ec
      [   23.630291] pgd = 9df80000
      [   23.633005] [9e62f0ec] *pgd=8e60041e(bad)
      [   23.637064] Internal error: : 1c06 [#1] SMP ARM
      [   23.641605] Modules linked in:
      [   23.644687] CPU: 0 PID: 99 Comm: ubiattach Not tainted 4.2.0-dirty #22
      [   23.651222] Hardware name: Freescale i.MX53 (Device Tree Support)
      [   23.657322] task: 9e687300 ti: 9dcfc000 task.ti: 9dcfc000
      [   23.662744] PC is at memcpy16_toio+0x4c/0x74
      [   23.667026] LR is at mxc_nand_command+0x484/0x640
      [   23.671739] pc : [<803f9c08>]    lr : [<803faeb0>]    psr: 60000013
      [   23.671739] sp : 9dcfdb10  ip : 9e62f0ea  fp : 9dcfdb1c
      [   23.683222] r10: a09c1000  r9 : 0000001a  r8 : ffffffff
      [   23.688453] r7 : ffffffff  r6 : 9e674810  r5 : 9e674810  r4 : 000000b6
      [   23.694985] r3 : a09c16a4  r2 : a09c16a4  r1 : a09c16a4  r0 : 0000ffff
      [   23.701521] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
      [   23.708662] Control: 10c5387d  Table: 8df80019  DAC: 00000015
      [   23.714413] Process ubiattach (pid: 99, stack limit = 0x9dcfc210)
      [   23.720514] Stack: (0x9dcfdb10 to 0x9dcfe000)
      [   23.724881] db00:                                     9dcfdb6c 9dcfdb20 803faeb0 803f9bc8
      [   23.733069] db20: 803f227c 803f9b74 ffffffff 9e674810 9e674810 9e674810 00000040 9e62f010
      [   23.741255] db40: 803faa2c 9e674b40 9e674810 803faa2c 00000400 803faa2c 00000000 9df42800
      [   23.749441] db60: 9dcfdb9c 9dcfdb70 803f2024 803faa38 9e4201cc 00000000 803f0a78 9e674b40
      [   23.757627] db80: 803f1f80 9e674810 00000400 00000400 9dcfdc14 9dcfdba0 803f3bd8 803f1f8c
      [   23.765814] dba0: 9e4201cc 00000000 00000580 00000000 00000000 800718c0 0000007f 00001000
      [   23.774000] dbc0: 9df42800 000000e0 00000000 00000000 9e4201cc 00000000 00000000 00000000
      [   23.782186] dbe0: 00000580 00000580 00000000 9e674810 9dcfdc20 9dcfdce8 9df42800 00580000
      [   23.790372] dc00: 00000000 00000400 9dcfdc6c 9dcfdc18 803f3f94 803f39a4 9dcfdc20 00000000
      [   23.798558] dc20: 00000000 00000400 00000000 00000000 00000000 00000000 9df42800 00000000
      [   23.806744] dc40: 9dcfdd0c 00580000 00000000 00000400 00000000 9df42800 9dee1000 9d802000
      [   23.814930] dc60: 9dcfdc94 9dcfdc70 803eb63c 803f3f38 00000400 9dcfdce8 9df42800 dead4ead
      [   23.823116] dc80: 803eb5f4 00000000 9dcfdcc4 9dcfdc98 803e82ac 803eb600 00000400 9dcfdce8
      [   23.831301] dca0: 9df42800 00000400 9dee0000 00000000 00000400 00000000 9dcfdd1c 9dcfdcc8
      [   23.839488] dcc0: 80406048 803e8230 00000400 9dcfdce8 9df42800 9dcfdc78 00000008 00000000
      [   23.847673] dce0: 00000000 00000000 00000000 00000004 00000000 9df42800 9dee0000 00000000
      [   23.855859] dd00: 9d802030 00000000 9dc8b214 9d802000 9dcfdd44 9dcfdd20 804066cc 80405f50
      [   23.864047] dd20: 00000400 9dc8b200 9d802030 9df42800 9dee0000 9dc8b200 9dcfdd84 9dcfdd48
      [   23.872233] dd40: 8040a544 804065ac 9e401c80 000080d0 9dcfdd84 00000001 800fc828 9df42400
      [   23.880418] dd60: 00000000 00000080 9dc8b200 9dc8b200 9dc8b200 9dee0000 9dcfdddc 9dcfdd88
      [   23.888605] dd80: 803fb560 8040a440 9dcfddc4 9dcfdd98 800f1428 9dee1000 a0acf000 00000000
      [   23.896792] dda0: 00000000 ffffffff 00000006 00000000 9dee0000 9dee0000 00005600 00000080
      [   23.904979] ddc0: 9dc8b200 a0acf000 9dc8b200 8112514c 9dcfde24 9dcfdde0 803fc08c 803fb4f0
      [   23.913165] dde0: 9e401c80 00000013 9dcfde04 9dcfddf8 8006bbf8 8006ba00 9dcfde24 00000000
      [   23.921351] de00: 9dee0000 00000065 9dee0000 00000001 9dc8b200 8112514c 9dcfde84 9dcfde28
      [   23.929538] de20: 8040afa0 803fb948 ffffffff 00000000 9dc8b214 9dcfde40 800f1428 800f11dc
      [   23.937724] de40: 9dc8b21c 9dc8b20c 9dc8b204 9dee1000 9dc8b214 8069bb60 fffff000 fffff000
      [   23.945911] de60: 9e7b5400 00000000 9dee0000 9dee1000 00001000 9e7b5400 9dcfdecc 9dcfde88
      [   23.954097] de80: 803ff1bc 8040a630 9dcfdea4 9dcfde98 00000800 00000800 9dcfdecc 9dcfdea8
      [   23.962284] dea0: 803e8f6c 00000000 7e87ab70 9e7b5400 80113e30 00000003 9dcfc000 00000000
      [   23.970470] dec0: 9dcfdf04 9dcfded0 804008cc 803feb98 ffffffff 00000003 00000000 00000000
      [   23.978656] dee0: 00000000 00000000 9e7cb000 9dc193e0 7e87ab70 9dd92140 9dcfdf7c 9dcfdf08
      [   23.986842] df00: 80113b5c 8040080c 800fbed8 8006bbf0 9e7cb000 00000003 9e7cb000 9dd92140
      [   23.995029] df20: 9dc193e0 9dd92148 9dcfdf4c 9dcfdf38 8011022c 800fbe78 8000f9cc 9e687300
      [   24.003216] df40: 9dcfdf6c 9dcfdf50 8011f798 8007ffe8 7e87ab70 9dd92140 00000003 9dd92140
      [   24.011402] df60: 40186f40 7e87ab70 9dcfc000 00000000 9dcfdfa4 9dcfdf80 80113e30 8011373c
      [   24.019588] df80: 7e87ab70 7e87ab70 7e87aea9 00000036 8000fb84 9dcfc000 00000000 9dcfdfa8
      [   24.027775] dfa0: 8000f9a0 80113e00 7e87ab70 7e87ab70 00000003 40186f40 7e87ab70 00000000
      [   24.035962] dfc0: 7e87ab70 7e87ab70 7e87aea9 00000036 00000000 00000000 76fd1f70 00000000
      [   24.044148] dfe0: 76f80f8c 7e87ab28 00009810 76f80fc4 60000010 00000003 00000000 00000000
      [   24.052328] Backtrace:
      [   24.054806] [<803f9bbc>] (memcpy16_toio) from [<803faeb0>] (mxc_nand_command+0x484/0x640)
      [   24.062996] [<803faa2c>] (mxc_nand_command) from [<803f2024>] (nand_write_page+0xa4/0x154)
      [   24.071264]  r10:9df42800 r9:00000000 r8:803faa2c r7:00000400 r6:803faa2c r5:9e674810
      [   24.079180]  r4:9e674b40
      [   24.081738] [<803f1f80>] (nand_write_page) from [<803f3bd8>] (nand_do_write_ops+0x240/0x444)
      [   24.090180]  r8:00000400 r7:00000400 r6:9e674810 r5:803f1f80 r4:9e674b40
      [   24.096970] [<803f3998>] (nand_do_write_ops) from [<803f3f94>] (nand_write+0x68/0x88)
      [   24.104804]  r10:00000400 r9:00000000 r8:00580000 r7:9df42800 r6:9dcfdce8 r5:9dcfdc20
      [   24.112719]  r4:9e674810
      [   24.115287] [<803f3f2c>] (nand_write) from [<803eb63c>] (part_write+0x48/0x50)
      [   24.122514]  r10:9d802000 r9:9dee1000 r8:9df42800 r7:00000000 r6:00000400 r5:00000000
      [   24.130429]  r4:00580000
      [   24.132989] [<803eb5f4>] (part_write) from [<803e82ac>] (mtd_write+0x88/0xa0)
      [   24.140129]  r5:00000000 r4:803eb5f4
      [   24.143748] [<803e8224>] (mtd_write) from [<80406048>] (ubi_io_write+0x104/0x65c)
      [   24.151235]  r7:00000000 r6:00000400 r5:00000000 r4:9dee0000
      [   24.156968] [<80405f44>] (ubi_io_write) from [<804066cc>] (ubi_io_write_ec_hdr+0x12c/0x190)
      [   24.165323]  r10:9d802000 r9:9dc8b214 r8:00000000 r7:9d802030 r6:00000000 r5:9dee0000
      [   24.173239]  r4:9df42800
      [   24.175798] [<804065a0>] (ubi_io_write_ec_hdr) from [<8040a544>] (ubi_early_get_peb+0x110/0x1f0)
      [   24.184587]  r6:9dc8b200 r5:9dee0000 r4:9df42800
      [   24.189262] [<8040a434>] (ubi_early_get_peb) from [<803fb560>] (create_vtbl+0x7c/0x238)
      [   24.197271]  r10:9dee0000 r9:9dc8b200 r8:9dc8b200 r7:9dc8b200 r6:00000080 r5:00000000
      [   24.205187]  r4:9df42400
      [   24.207746] [<803fb4e4>] (create_vtbl) from [<803fc08c>] (ubi_read_volume_table+0x750/0xa64)
      [   24.216187]  r10:8112514c r9:9dc8b200 r8:a0acf000 r7:9dc8b200 r6:00000080 r5:00005600
      [   24.224103]  r4:9dee0000
      [   24.226662] [<803fb93c>] (ubi_read_volume_table) from [<8040afa0>] (ubi_attach+0x97c/0x152c)
      [   24.235103]  r10:8112514c r9:9dc8b200 r8:00000001 r7:9dee0000 r6:00000065 r5:9dee0000
      [   24.243018]  r4:00000000
      [   24.245579] [<8040a624>] (ubi_attach) from [<803ff1bc>] (ubi_attach_mtd_dev+0x630/0xbac)
      [   24.253673]  r10:9e7b5400 r9:00001000 r8:9dee1000 r7:9dee0000 r6:00000000 r5:9e7b5400
      [   24.261588]  r4:fffff000
      [   24.264148] [<803feb8c>] (ubi_attach_mtd_dev) from [<804008cc>] (ctrl_cdev_ioctl+0xcc/0x1cc)
      [   24.272589]  r10:00000000 r9:9dcfc000 r8:00000003 r7:80113e30 r6:9e7b5400 r5:7e87ab70
      [   24.280505]  r4:00000000
      [   24.283070] [<80400800>] (ctrl_cdev_ioctl) from [<80113b5c>] (do_vfs_ioctl+0x42c/0x6c4)
      [   24.291077]  r6:9dd92140 r5:7e87ab70 r4:9dc193e0
      [   24.295753] [<80113730>] (do_vfs_ioctl) from [<80113e30>] (SyS_ioctl+0x3c/0x64)
      [   24.303066]  r10:00000000 r9:9dcfc000 r8:7e87ab70 r7:40186f40 r6:9dd92140 r5:00000003
      [   24.310981]  r4:9dd92140
      [   24.313549] [<80113df4>] (SyS_ioctl) from [<8000f9a0>] (ret_fast_syscall+0x0/0x54)
      [   24.321123]  r9:9dcfc000 r8:8000fb84 r7:00000036 r6:7e87aea9 r5:7e87ab70 r4:7e87ab70
      [   24.328957] Code: e1c300b0 e1510002 e1a03001 1afffff9 (e89da800)
      [   24.335066] ---[ end trace ab1cb17887f21bbb ]---
      [   24.340249] Unhandled fault: imprecise external abort (0x1c06) at 0x7ee8bcf0
      [   24.347310] pgd = 9df3c000
      [   24.350023] [7ee8bcf0] *pgd=8dcbf831, *pte=8eb3334f, *ppte=8eb3383f
      Segmentation fault
      
      Fixes: 35d5d20e ("mtd: mxc_nand: cleanup copy_spare function")
      Signed-off-by: default avatarEric Bénard <eric@eukrea.com>
      Reviewed-by: default avatarUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Reviewed-by: default avatarBaruch Siach <baruch@tkos.co.il>
      Signed-off-by: default avatarBrian Norris <computersforpeace@gmail.com>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      45ba1919
  3. 15 Dec, 2015 24 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.2.8 · 5f5c1db0
      Greg Kroah-Hartman authored
      5f5c1db0
    • Filipe Manana's avatar
      Btrfs: fix regression running delayed references when using qgroups · dacd97b8
      Filipe Manana authored
      commit b06c4bf5 upstream.
      
      In the kernel 4.2 merge window we had a big changes to the implementation
      of delayed references and qgroups which made the no_quota field of delayed
      references not used anymore. More specifically the no_quota field is not
      used anymore as of:
      
        commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
      
      Leaving the no_quota field actually prevents delayed references from
      getting merged, which in turn cause the following BUG_ON(), at
      fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
      
        static int run_delayed_tree_ref(...)
        {
           (...)
           BUG_ON(node->ref_mod != 1);
           (...)
        }
      
      This happens on a scenario like the following:
      
        1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
      
        2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
      
        3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref2 is incompatible
           due to Ref2->no_quota != Ref3->no_quota.
      
        4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref3 is incompatible
           due to Ref3->no_quota != Ref4->no_quota.
      
        5) We run delayed references, trigger merging of delayed references,
           through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
      
        6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
           all other conditions are satisfied too. So Ref1 gets a ref_mod
           value of 2.
      
        7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
           all other conditions are satisfied too. So Ref2 gets a ref_mod
           value of 2.
      
        8) Ref1 and Ref2 aren't merged, because they have different values
           for their no_quota field.
      
        9) Delayed reference Ref1 is picked for running (select_delayed_ref()
           always prefers references with an action == BTRFS_ADD_DELAYED_REF).
           So run_delayed_tree_ref() is called for Ref1 which triggers the
           BUG_ON because Ref1->red_mod != 1 (equals 2).
      
      So fix this by removing the no_quota field, as it's not used anymore as
      of commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented
      qgroup mechanism.").
      
      The use of no_quota was also buggy in at least two places:
      
      1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
         no_quota to 0 instead of 1 when the following condition was true:
         is_fstree(ref_root) || !fs_info->quota_enabled
      
      2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
         reset a node's no_quota when the condition "!is_fstree(root_objectid)
         || !root->fs_info->quota_enabled" was true but we did it only in
         an unused local stack variable, that is, we never reset the no_quota
         value in the node itself.
      
      This fixes the remainder of problems several people have been having when
      running delayed references, mostly while a balance is running in parallel,
      on a 4.2+ kernel.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Also, this fixes deadlock issue when using the clone ioctl with qgroups
      enabled, as reported by Elias Probst in the mailing list. The deadlock
      happens because after calling btrfs_insert_empty_item we have our path
      holding a write lock on a leaf of the fs/subvol tree and then before
      releasing the path we called check_ref() which did backref walking, when
      qgroups are enabled, and tried to read lock the same leaf. The trace for
      this case is the following:
      
        INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
        (...)
        Call Trace:
          [<ffffffff86999201>] schedule+0x74/0x83
          [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
          [<ffffffff86137ed7>] ? wait_woken+0x74/0x74
          [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
          [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
          [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
          [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
          [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
          [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
          [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
          [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
          [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
          [<ffffffff863e852e>] check_ref+0x64/0xc4
          [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
          [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
          [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
          [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
        (...)
      
      The problem goes away by eleminating check_ref(), which no longer is
      needed as its purpose was to get a value for the no_quota field of
      a delayed reference (this patch removes the no_quota field as mentioned
      earlier).
      Reported-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: default avatarElias Probst <mail@eliasprobst.eu>
      Reported-by: default avatarPeter Becker <floyd.net@gmail.com>
      Reported-by: default avatarMalte Schröder <malte@tnxip.de>
      Reported-by: default avatarDerek Dongray <derek@valedon.co.uk>
      Reported-by: default avatarErkki Seppala <flux-btrfs@inside.org>
      Cc: stable@vger.kernel.org  # 4.2+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      dacd97b8
    • Hans Verkuil's avatar
      cobalt: fix Kconfig dependency · 25c386d0
      Hans Verkuil authored
      commit fc88dd16 upstream.
      
      The cobalt driver should depend on VIDEO_V4L2_SUBDEV_API.
      
      This fixes this kbuild error:
      
      tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
      head:   99bc7215
      commit: 85756a06 [media] cobalt: add new driver
      config: x86_64-randconfig-s0-09201514 (attached as .config)
      reproduce:
        git checkout 85756a06
        # save the attached .config to linux build tree
        make ARCH=x86_64
      
      All error/warnings (new ones prefixed by >>):
      
         drivers/media/i2c/adv7604.c: In function 'adv76xx_get_format':
      >> drivers/media/i2c/adv7604.c:1853:9: error: implicit declaration of function 'v4l2_subdev_get_try_format' [-Werror=implicit-function-declaration]
            fmt = v4l2_subdev_get_try_format(sd, cfg, format->pad);
                  ^
         drivers/media/i2c/adv7604.c:1853:7: warning: assignment makes pointer from integer without a cast [-Wint-conversion]
            fmt = v4l2_subdev_get_try_format(sd, cfg, format->pad);
                ^
         drivers/media/i2c/adv7604.c: In function 'adv76xx_set_format':
         drivers/media/i2c/adv7604.c:1882:7: warning: assignment makes pointer from integer without a cast [-Wint-conversion]
            fmt = v4l2_subdev_get_try_format(sd, cfg, format->pad);
                ^
         cc1: some warnings being treated as errors
      Signed-off-by: default avatarHans Verkuil <hans.verkuil@cisco.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@osg.samsung.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      25c386d0
    • Lu, Han's avatar
      ALSA: hda/hdmi - apply Skylake fix-ups to Broxton display codec · bc506f02
      Lu, Han authored
      commit e2656412 upstream.
      
      Broxton and Skylake have the same behavior on display audio. So this patch
      applys Skylake fix-ups to Broxton.
      Signed-off-by: default avatarLu, Han <han.lu@intel.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bc506f02
    • Arnd Bergmann's avatar
      ceph: fix message length computation · 3dbb4105
      Arnd Bergmann authored
      commit 777d738a upstream.
      
      create_request_message() computes the maximum length of a message,
      but uses the wrong type for the time stamp: sizeof(struct timespec)
      may be 8 or 16 depending on the architecture, while sizeof(struct
      ceph_timespec) is always 8, and that is what gets put into the
      message.
      
      Found while auditing the uses of timespec for y2038 problems.
      
      Fixes: b8e69066 ("ceph: include time stamp in every MDS request")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarYan, Zheng <zyan@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3dbb4105
    • Junxiao Bi's avatar
      ocfs2: fix umask ignored issue · b91750f8
      Junxiao Bi authored
      commit 8f1eb487 upstream.
      
      New created file's mode is not masked with umask, and this makes umask not
      work for ocfs2 volume.
      
      Fixes: 702e5bc6 ("ocfs2: use generic posix ACL infrastructure")
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Gang He <ghe@suse.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b91750f8
    • Jeff Layton's avatar
      nfs: if we have no valid attrs, then don't declare the attribute cache valid · 12cdf727
      Jeff Layton authored
      commit c812012f upstream.
      
      If we pass in an empty nfs_fattr struct to nfs_update_inode, it will
      (correctly) not update any of the attributes, but it then clears the
      NFS_INO_INVALID_ATTR flag, which indicates that the attributes are
      up to date. Don't clear the flag if the fattr struct has no valid
      attrs to apply.
      Reviewed-by: default avatarSteve French <steve.french@primarydata.com>
      Signed-off-by: default avatarJeff Layton <jeff.layton@primarydata.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      12cdf727
    • Benjamin Coddington's avatar
      nfs4: start callback_ident at idr 1 · 1fa68c4c
      Benjamin Coddington authored
      commit c68a027c upstream.
      
      If clp->cl_cb_ident is zero, then nfs_cb_idr_remove_locked() skips removing
      it when the nfs_client is freed.  A decoding or server bug can then find
      and try to put that first nfs_client which would lead to a crash.
      Signed-off-by: default avatarBenjamin Coddington <bcodding@redhat.com>
      Fixes: d6870312 ("nfs4client: convert to idr_alloc()")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1fa68c4c
    • Daniel Borkmann's avatar
      debugfs: fix refcount imbalance in start_creating · a1936414
      Daniel Borkmann authored
      commit 0ee9608c upstream.
      
      In debugfs' start_creating(), we pin the file system to safely access
      its root. When we failed to create a file, we unpin the file system via
      failed_creating() to release the mount count and eventually the reference
      of the vfsmount.
      
      However, when we run into an error during lookup_one_len() when still
      in start_creating(), we only release the parent's mutex but not so the
      reference on the mount. Looks like it was done in the past, but after
      splitting portions of __create_file() into start_creating() and
      end_creating() via 190afd81 ("debugfs: split the beginning and the
      end of __create_file() off"), this seemed missed. Noticed during code
      review.
      
      Fixes: 190afd81 ("debugfs: split the beginning and the end of __create_file() off")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a1936414
    • Andrew Elble's avatar
      nfsd: eliminate sending duplicate and repeated delegations · 814a24d5
      Andrew Elble authored
      commit 34ed9872 upstream.
      
      We've observed the nfsd server in a state where there are
      multiple delegations on the same nfs4_file for the same client.
      The nfs client does attempt to DELEGRETURN these when they are presented to
      it - but apparently under some (unknown) circumstances the client does not
      manage to return all of them. This leads to the eventual
      attempt to CB_RECALL more than one delegation with the same nfs
      filehandle to the same client. The first recall will succeed, but the
      next recall will fail with NFS4ERR_BADHANDLE. This leads to the server
      having delegations on cl_revoked that the client has no way to FREE
      or DELEGRETURN, with resulting inability to recover. The state manager
      on the server will continually assert SEQ4_STATUS_RECALLABLE_STATE_REVOKED,
      and the state manager on the client will be looping unable to satisfy
      the server.
      
      List discussion also reports a race between OPEN and DELEGRETURN that
      will be avoided by only sending the delegation once to the
      client. This is also logically in accordance with RFC5561 9.1.1 and 10.2.
      
      So, let's:
      
      1.) Not hand out duplicate delegations.
      2.) Only send them to the client once.
      
      RFC 5561:
      
      9.1.1:
      "Delegations and layouts, on the other hand, are not associated with a
      specific owner but are associated with the client as a whole
      (identified by a client ID)."
      
      10.2:
      "...the stateid for a delegation is associated with a client ID and may be
      used on behalf of all the open-owners for the given client.  A
      delegation is made to the client as a whole and not to any specific
      process or thread of control within it."
      Reported-by: default avatarEric Meddaugh <etmsys@rit.edu>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: Olga Kornievskaia <aglo@umich.edu>
      Signed-off-by: default avatarAndrew Elble <aweits@rit.edu>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      814a24d5
    • Jeff Layton's avatar
      nfsd: serialize state seqid morphing operations · 4708c200
      Jeff Layton authored
      commit 35a92fe8 upstream.
      
      Andrew was seeing a race occur when an OPEN and OPEN_DOWNGRADE were
      running in parallel. The server would receive the OPEN_DOWNGRADE first
      and check its seqid, but then an OPEN would race in and bump it. The
      OPEN_DOWNGRADE would then complete and bump the seqid again.  The result
      was that the OPEN_DOWNGRADE would be applied after the OPEN, even though
      it should have been rejected since the seqid changed.
      
      The only recourse we have here I think is to serialize operations that
      bump the seqid in a stateid, particularly when we're given a seqid in
      the call. To address this, we add a new rw_semaphore to the
      nfs4_ol_stateid struct. We do a down_write prior to checking the seqid
      after looking up the stateid to ensure that nothing else is going to
      bump it while we're operating on it.
      
      In the case of OPEN, we do a down_read, as the call doesn't contain a
      seqid. Those can run in parallel -- we just need to serialize them when
      there is a concurrent OPEN_DOWNGRADE or CLOSE.
      
      LOCK and LOCKU however always take the write lock as there is no
      opportunity for parallelizing those.
      Reported-and-Tested-by: default avatarAndrew W Elble <aweits@rit.edu>
      Signed-off-by: default avatarJeff Layton <jeff.layton@primarydata.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4708c200
    • Stefan Richter's avatar
      firewire: ohci: fix JMicron JMB38x IT context discovery · 080fe15d
      Stefan Richter authored
      commit 100ceb66 upstream.
      
      Reported by Clifford and Craig for JMicron OHCI-1394 + SDHCI combo
      controllers:  Often or even most of the time, the controller is
      initialized with the message "added OHCI v1.10 device as card 0, 4 IR +
      0 IT contexts, quirks 0x10".  With 0 isochronous transmit DMA contexts
      (IT contexts), applications like audio output are impossible.
      
      However, OHCI-1394 demands that at least 4 IT contexts are implemented
      by the link layer controller, and indeed JMicron JMB38x do implement
      four of them.  Only their IsoXmitIntMask register is unreliable at early
      access.
      
      With my own JMB381 single function controller I found:
        - I can reproduce the problem with a lower probability than Craig's.
        - If I put a loop around the section which clears and reads
          IsoXmitIntMask, then either the first or the second attempt will
          return the correct initial mask of 0x0000000f.  I never encountered
          a case of needing more than a second attempt.
        - Consequently, if I put a dummy reg_read(...IsoXmitIntMaskSet)
          before the first write, the subsequent read will return the correct
          result.
        - If I merely ignore a wrong read result and force the known real
          result, later isochronous transmit DMA usage works just fine.
      
      So let's just fix this chip bug up by the latter method.  Tested with
      JMB381 on kernel 3.13 and 4.3.
      
      Since OHCI-1394 generally requires 4 IT contexts at a minium, this
      workaround is simply applied whenever the initial read of IsoXmitIntMask
      returns 0, regardless whether it's a JMicron chip or not.  I never heard
      of this issue together with any other chip though.
      
      I am not 100% sure that this fix works on the OHCI-1394 part of JMB380
      and JMB388 combo controllers exactly the same as on the JMB381 single-
      function controller, but so far I haven't had a chance to let an owner
      of a combo chip run a patched kernel.
      
      Strangely enough, IsoRecvIntMask is always reported correctly, even
      though it is probed right before IsoXmitIntMask.
      
      Reported-by: Clifford Dunn
      Reported-by: default avatarCraig Moore <craig.moore@qenos.com>
      Signed-off-by: default avatarStefan Richter <stefanr@s5r6.in-berlin.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      080fe15d
    • Daeho Jeong's avatar
      ext4, jbd2: ensure entering into panic after recording an error in superblock · 3be55388
      Daeho Jeong authored
      commit 4327ba52 upstream.
      
      If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the
      journaling will be aborted first and the error number will be recorded
      into JBD2 superblock and, finally, the system will enter into the
      panic state in "errors=panic" option.  But, in the rare case, this
      sequence is little twisted like the below figure and it will happen
      that the system enters into panic state, which means the system reset
      in mobile environment, before completion of recording an error in the
      journal superblock. In this case, e2fsck cannot recognize that the
      filesystem failure occurred in the previous run and the corruption
      wouldn't be fixed.
      
      Task A                        Task B
      ext4_handle_error()
      -> jbd2_journal_abort()
        -> __journal_abort_soft()
          -> __jbd2_journal_abort_hard()
          | -> journal->j_flags |= JBD2_ABORT;
          |
          |                         __ext4_abort()
          |                         -> jbd2_journal_abort()
          |                         | -> __journal_abort_soft()
          |                         |   -> if (journal->j_flags & JBD2_ABORT)
          |                         |           return;
          |                         -> panic()
          |
          -> jbd2_journal_update_sb_errno()
      Tested-by: default avatarHobin Woo <hobin.woo@samsung.com>
      Signed-off-by: default avatarDaeho Jeong <daeho.jeong@samsung.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3be55388
    • Lukas Czerner's avatar
      ext4: fix potential use after free in __ext4_journal_stop · e107e557
      Lukas Czerner authored
      commit 6934da92 upstream.
      
      There is a use-after-free possibility in __ext4_journal_stop() in the
      case that we free the handle in the first jbd2_journal_stop() because
      we're referencing handle->h_err afterwards. This was introduced in
      9705acd6 and it is wrong. Fix it by
      storing the handle->h_err value beforehand and avoid referencing
      potentially freed handle.
      
      Fixes: 9705acd6Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e107e557
    • Theodore Ts'o's avatar
      ext4 crypto: replace some BUG_ON()'s with error checks · 89fa6ced
      Theodore Ts'o authored
      commit 687c3c36 upstream.
      
      Buggy (or hostile) userspace should not be able to cause the kernel to
      crash.
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      89fa6ced
    • Theodore Ts'o's avatar
      ext4 crypto: fix memory leak in ext4_bio_write_page() · 34a6ac6a
      Theodore Ts'o authored
      commit 937d7b84 upstream.
      
      There are times when ext4_bio_write_page() is called even though we
      don't actually need to do any I/O.  This happens when ext4_writepage()
      gets called by the jbd2 commit path when an inode needs to force its
      pages written out in order to provide data=ordered guarantees --- and
      a page is backed by an unwritten (e.g., uninitialized) block on disk,
      or if delayed allocation means the page's backing store hasn't been
      allocated yet.  In that case, we need to skip the call to
      ext4_encrypt_page(), since in addition to wasting CPU, it leads to a
      bounce page and an ext4 crypto context getting leaked.
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      34a6ac6a
    • Ilya Dryomov's avatar
      rbd: don't put snap_context twice in rbd_queue_workfn() · 587de47c
      Ilya Dryomov authored
      commit 70b16db8 upstream.
      
      Commit 4e752f0a ("rbd: access snapshot context and mapping size
      safely") moved ceph_get_snap_context() out of rbd_img_request_create()
      and into rbd_queue_workfn(), adding a ceph_put_snap_context() to the
      error path in rbd_queue_workfn().  However, rbd_img_request_create()
      consumes a ref on snapc, so calling ceph_put_snap_context() after
      a successful rbd_img_request_create() leads to an extra put.  Fix it.
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarJosh Durgin <jdurgin@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      587de47c
    • David Sterba's avatar
      btrfs: fix signed overflows in btrfs_sync_file · cf7c570b
      David Sterba authored
      commit 9dcbeed4 upstream.
      
      The calculation of range length in btrfs_sync_file leads to signed
      overflow. This was caught by PaX gcc SIZE_OVERFLOW plugin.
      
      https://forums.grsecurity.net/viewtopic.php?f=1&t=4284
      
      The fsync call passes 0 and LLONG_MAX, the range length does not fit to
      loff_t and overflows, but the value is converted to u64 so it silently
      works as expected.
      
      The minimal fix is a typecast to u64, switching functions to take
      (start, end) instead of (start, len) would be more intrusive.
      
      Coccinelle script found that there's one more opencoded calculation of
      the length.
      
      <smpl>
      @@
      loff_t start, end;
      @@
      * end - start
      </smpl>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cf7c570b
    • Filipe Manana's avatar
      Btrfs: fix race when listing an inode's xattrs · d3b189c3
      Filipe Manana authored
      commit f1cd1f0b upstream.
      
      When listing a inode's xattrs we have a time window where we race against
      a concurrent operation for adding a new hard link for our inode that makes
      us not return any xattr to user space. In order for this to happen, the
      first xattr of our inode needs to be at slot 0 of a leaf and the previous
      leaf must still have room for an inode ref (or extref) item, and this can
      happen because an inode's listxattrs callback does not lock the inode's
      i_mutex (nor does the VFS does it for us), but adding a hard link to an
      inode makes the VFS lock the inode's i_mutex before calling the inode's
      link callback.
      
      If we have the following leafs:
      
                     Leaf X (has N items)                    Leaf Y
      
       [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 XATTR_ITEM 12345), ... ]
                 slot N - 2         slot N - 1              slot 0
      
      The race illustrated by the following sequence diagram is possible:
      
             CPU 1                                               CPU 2
      
        btrfs_listxattr()
      
          searches for key (257 XATTR_ITEM 0)
      
          gets path with path->nodes[0] == leaf X
          and path->slots[0] == N
      
          because path->slots[0] is >=
          btrfs_header_nritems(leaf X), it calls
          btrfs_next_leaf()
      
          btrfs_next_leaf()
            releases the path
      
                                                         adds key (257 INODE_REF 666)
                                                         to the end of leaf X (slot N),
                                                         and leaf X now has N + 1 items
      
            searches for the key (257 INODE_REF 256),
            with path->keep_locks == 1, because that
            is the last key it saw in leaf X before
            releasing the path
      
            ends up at leaf X again and it verifies
            that the key (257 INODE_REF 256) is no
            longer the last key in leaf X, so it
            returns with path->nodes[0] == leaf X
            and path->slots[0] == N, pointing to
            the new item with key (257 INODE_REF 666)
      
          btrfs_listxattr's loop iteration sees that
          the type of the key pointed by the path is
          different from the type BTRFS_XATTR_ITEM_KEY
          and so it breaks the loop and stops looking
          for more xattr items
            --> the application doesn't get any xattr
                listed for our inode
      
      So fix this by breaking the loop only if the key's type is greater than
      BTRFS_XATTR_ITEM_KEY and skip the current key if its type is smaller.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d3b189c3
    • Filipe Manana's avatar
      Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow · 3e25e3a0
      Filipe Manana authored
      commit 1d512cb7 upstream.
      
      If we are using the NO_HOLES feature, we have a tiny time window when
      running delalloc for a nodatacow inode where we can race with a concurrent
      link or xattr add operation leading to a BUG_ON.
      
      This happens because at run_delalloc_nocow() we end up casting a leaf item
      of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a
      file extent item (struct btrfs_file_extent_item) and then analyse its
      extent type field, which won't match any of the expected extent types
      (values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an
      explicit BUG_ON(1).
      
      The following sequence diagram shows how the race happens when running a
      no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following
      neighbour leafs:
      
                   Leaf X (has N items)                    Leaf Y
      
       [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
                    slot N - 2         slot N - 1              slot 0
      
       (Note the implicit hole for inode 257 regarding the [0, 8K[ range)
      
             CPU 1                                         CPU 2
      
       run_dealloc_nocow()
         btrfs_lookup_file_extent()
           --> searches for a key with value
               (257 EXTENT_DATA 4096) in the
               fs/subvol tree
           --> returns us a path with
               path->nodes[0] == leaf X and
               path->slots[0] == N
      
         because path->slots[0] is >=
         btrfs_header_nritems(leaf X), it
         calls btrfs_next_leaf()
      
         btrfs_next_leaf()
           --> releases the path
      
                                                    hard link added to our inode,
                                                    with key (257 INODE_REF 500)
                                                    added to the end of leaf X,
                                                    so leaf X now has N + 1 keys
      
           --> searches for the key
               (257 INODE_REF 256), because
               it was the last key in leaf X
               before it released the path,
               with path->keep_locks set to 1
      
           --> ends up at leaf X again and
               it verifies that the key
               (257 INODE_REF 256) is no longer
               the last key in the leaf, so it
               returns with path->nodes[0] ==
               leaf X and path->slots[0] == N,
               pointing to the new item with
               key (257 INODE_REF 500)
      
         the loop iteration of run_dealloc_nocow()
         does not break out the loop and continues
         because the key referenced in the path
         at path->nodes[0] and path->slots[0] is
         for inode 257, its type is < BTRFS_EXTENT_DATA_KEY
         and its offset (500) is less then our delalloc
         range's end (8192)
      
         the item pointed by the path, an inode reference item,
         is (incorrectly) interpreted as a file extent item and
         we get an invalid extent type, leading to the BUG_ON(1):
      
         if (extent_type == BTRFS_FILE_EXTENT_REG ||
            extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
             (...)
         } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
             (...)
         } else {
             BUG_ON(1)
         }
      
      The same can happen if a xattr is added concurrently and ends up having
      a key with an offset smaller then the delalloc's range end.
      
      So fix this by skipping keys with a type smaller than
      BTRFS_EXTENT_DATA_KEY.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3e25e3a0
    • Filipe Manana's avatar
      Btrfs: fix race leading to incorrect item deletion when dropping extents · 4c53de05
      Filipe Manana authored
      commit aeafbf84 upstream.
      
      While running a stress test I got the following warning triggered:
      
        [191627.672810] ------------[ cut here ]------------
        [191627.673949] WARNING: CPU: 8 PID: 8447 at fs/btrfs/file.c:779 __btrfs_drop_extents+0x391/0xa50 [btrfs]()
        (...)
        [191627.701485] Call Trace:
        [191627.702037]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
        [191627.702992]  [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
        [191627.704091]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
        [191627.705380]  [<ffffffffa0664499>] ? __btrfs_drop_extents+0x391/0xa50 [btrfs]
        [191627.706637]  [<ffffffff8104b46d>] warn_slowpath_null+0x1a/0x1c
        [191627.707789]  [<ffffffffa0664499>] __btrfs_drop_extents+0x391/0xa50 [btrfs]
        [191627.709155]  [<ffffffff8115663c>] ? cache_alloc_debugcheck_after.isra.32+0x171/0x1d0
        [191627.712444]  [<ffffffff81155007>] ? kmemleak_alloc_recursive.constprop.40+0x16/0x18
        [191627.714162]  [<ffffffffa06570c9>] insert_reserved_file_extent.constprop.40+0x83/0x24e [btrfs]
        [191627.715887]  [<ffffffffa065422b>] ? start_transaction+0x3bb/0x610 [btrfs]
        [191627.717287]  [<ffffffffa065b604>] btrfs_finish_ordered_io+0x273/0x4e2 [btrfs]
        [191627.728865]  [<ffffffffa065b888>] finish_ordered_fn+0x15/0x17 [btrfs]
        [191627.730045]  [<ffffffffa067d688>] normal_work_helper+0x14c/0x32c [btrfs]
        [191627.731256]  [<ffffffffa067d96a>] btrfs_endio_write_helper+0x12/0x14 [btrfs]
        [191627.732661]  [<ffffffff81061119>] process_one_work+0x24c/0x4ae
        [191627.733822]  [<ffffffff810615b0>] worker_thread+0x206/0x2c2
        [191627.734857]  [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f
        [191627.736052]  [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f
        [191627.737349]  [<ffffffff810669a6>] kthread+0xef/0xf7
        [191627.738267]  [<ffffffff810f3b3a>] ? time_hardirqs_on+0x15/0x28
        [191627.739330]  [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad
        [191627.741976]  [<ffffffff81465592>] ret_from_fork+0x42/0x70
        [191627.743080]  [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad
        [191627.744206] ---[ end trace bbfddacb7aaada8d ]---
      
        $ cat -n fs/btrfs/file.c
        691  int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
        (...)
        758                  btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
        759                  if (key.objectid > ino ||
        760                      key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end)
        761                          break;
        762
        763                  fi = btrfs_item_ptr(leaf, path->slots[0],
        764                                      struct btrfs_file_extent_item);
        765                  extent_type = btrfs_file_extent_type(leaf, fi);
        766
        767                  if (extent_type == BTRFS_FILE_EXTENT_REG ||
        768                      extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
        (...)
        774                  } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
        (...)
        778                  } else {
        779                          WARN_ON(1);
        780                          extent_end = search_start;
        781                  }
        (...)
      
      This happened because the item we were processing did not match a file
      extent item (its key type != BTRFS_EXTENT_DATA_KEY), and even on this
      case we cast the item to a struct btrfs_file_extent_item pointer and
      then find a type field value that does not match any of the expected
      values (BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]). This scenario happens
      due to a tiny time window where a race can happen as exemplified below.
      For example, consider the following scenario where we're using the
      NO_HOLES feature and we have the following two neighbour leafs:
      
                     Leaf X (has N items)                    Leaf Y
      
      [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
                slot N - 2         slot N - 1              slot 0
      
      Our inode 257 has an implicit hole in the range [0, 8K[ (implicit rather
      than explicit because NO_HOLES is enabled). Now if our inode has an
      ordered extent for the range [4K, 8K[ that is finishing, the following
      can happen:
      
                CPU 1                                       CPU 2
      
        btrfs_finish_ordered_io()
          insert_reserved_file_extent()
            __btrfs_drop_extents()
               Searches for the key
                (257 EXTENT_DATA 4096) through
                btrfs_lookup_file_extent()
      
               Key not found and we get a path where
               path->nodes[0] == leaf X and
               path->slots[0] == N
      
               Because path->slots[0] is >=
               btrfs_header_nritems(leaf X), we call
               btrfs_next_leaf()
      
               btrfs_next_leaf() releases the path
      
                                                        inserts key
                                                        (257 INODE_REF 4096)
                                                        at the end of leaf X,
                                                        leaf X now has N + 1 keys,
                                                        and the new key is at
                                                        slot N
      
               btrfs_next_leaf() searches for
               key (257 INODE_REF 256), with
               path->keep_locks set to 1,
               because it was the last key it
               saw in leaf X
      
                 finds it in leaf X again and
                 notices it's no longer the last
                 key of the leaf, so it returns 0
                 with path->nodes[0] == leaf X and
                 path->slots[0] == N (which is now
                 < btrfs_header_nritems(leaf X)),
                 pointing to the new key
                 (257 INODE_REF 4096)
      
               __btrfs_drop_extents() casts the
               item at path->nodes[0], slot
               path->slots[0], to a struct
               btrfs_file_extent_item - it does
               not skip keys for the target
               inode with a type less than
               BTRFS_EXTENT_DATA_KEY
               (BTRFS_INODE_REF_KEY < BTRFS_EXTENT_DATA_KEY)
      
               sees a bogus value for the type
               field triggering the WARN_ON in
               the trace shown above, and sets
               extent_end = search_start (4096)
      
               does the if-then-else logic to
               fixup 0 length extent items created
               by a past bug from hole punching:
      
                 if (extent_end == key.offset &&
                     extent_end >= search_start)
                     goto delete_extent_item;
      
               that evaluates to true and it ends
               up deleting the key pointed to by
               path->slots[0], (257 INODE_REF 4096),
               from leaf X
      
      The same could happen for example for a xattr that ends up having a key
      with an offset value that matches search_start (very unlikely but not
      impossible).
      
      So fix this by ensuring that keys smaller than BTRFS_EXTENT_DATA_KEY are
      skipped, never casted to struct btrfs_file_extent_item and never deleted
      by accident. Also protect against the unexpected case of getting a key
      for a lower inode number by skipping that key and issuing a warning.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4c53de05
    • Filipe Manana's avatar
      Btrfs: fix regression when running delayed references · 480d3c86
      Filipe Manana authored
      commit 2c3cf7d5 upstream.
      
      In the kernel 4.2 merge window we had a refactoring/rework of the delayed
      references implementation in order to fix certain problems with qgroups.
      However that rework introduced one more regression that leads to the
      following trace when running delayed references for metadata:
      
      [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
      [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2
      [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
      [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
      [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000
      [35908.065201] RIP: 0010:[<ffffffffa04928b5>]  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
      [35908.065201] RSP: 0018:ffff88010c4cbb08  EFLAGS: 00010293
      [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000
      [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000
      [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8
      [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000
      [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000
      [35908.065201] FS:  0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
      [35908.065201] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0
      [35908.065201] Stack:
      [35908.065201]  ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000
      [35908.065201]  ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8
      [35908.065201]  ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000
      [35908.065201] Call Trace:
      [35908.065201]  [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
      [35908.065201]  [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
      [35908.065201]  [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs]
      [35908.065201]  [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs]
      [35908.065201]  [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs]
      [35908.065201]  [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
      [35908.065201]  [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs]
      [35908.065201]  [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs]
      [35908.065201]  [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs]
      [35908.065201]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
      [35908.065201]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
      [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
      [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
      [35908.065201]  [<ffffffff8106904d>] kthread+0xef/0xf7
      [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
      [35908.065201]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
      [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
      [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
      [35908.065201] RIP  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
      [35908.065201]  RSP <ffff88010c4cbb08>
      [35908.310885] ---[ end trace fe4299baf0666457 ]---
      
      This happens because the new delayed references code no longer merges
      delayed references that have different sequence values. The following
      steps are an example sequence leading to this issue:
      
      1) Transaction N starts, fs_info->tree_mod_seq has value 0;
      
      2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
         bytenr A is created, with a value of 1 and a seq value of 0;
      
      3) fs_info->tree_mod_seq is incremented to 1;
      
      4) Extent buffer A is deleted through btrfs_del_items(), which calls
         btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
         later returns the metadata extent associated to extent buffer A to
         the free space cache (the range is not pinned), because the extent
         buffer was created in the current transaction (N) and writeback never
         happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
         in the extent buffer).
         This creates the delayed reference Ref2 for bytenr A, with a value
         of -1 and a seq value of 1;
      
      5) Delayed reference Ref2 is not merged with Ref1 when we create it,
         because they have different sequence numbers (decided at
         add_delayed_ref_tail_merge());
      
      6) fs_info->tree_mod_seq is incremented to 2;
      
      7) Some task attempts to allocate a new extent buffer (done at
         extent-tree.c:find_free_extent()), but due to heavy fragmentation
         and running low on metadata space the clustered allocation fails
         and we fall back to unclustered allocation, which finds the
         extent at offset A, so a new extent buffer at offset A is allocated.
         This creates delayed reference Ref3 for bytenr A, with a value of 1
         and a seq value of 2;
      
      8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
         all have different seq values;
      
      9) We start running the delayed references (__btrfs_run_delayed_refs());
      
      10) The delayed Ref1 is the first one being applied, which ends up
          creating an inline extent backref in the extent tree;
      
      10) Next the delayed reference Ref3 is selected for execution, and not
          Ref2, because select_delayed_ref() always gives a preference for
          positive references (that have an action of BTRFS_ADD_DELAYED_REF);
      
      11) When running Ref3 we encounter alreay the inline extent backref
          in the extent tree at insert_inline_extent_backref(), which makes
          us hit the following BUG_ON:
      
              BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
      
          This is always true because owner corresponds to the level of the
          extent buffer/btree node in the btree.
      
      For the scenario described above we hit the BUG_ON because we never merge
      references that have different seq values.
      
      We used to do the merging before the 4.2 kernel, more specifically, before
      the commmits:
      
        c6fc2454 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
        c43d160f ("btrfs: delayed-ref: Cleanup the unneeded functions.")
      
      This issue became more exposed after the following change that was added
      to 4.2 as well:
      
        cffc3374 ("Btrfs: fix order by which delayed references are run")
      
      Which in turn fixed another regression by the two commits previously
      mentioned.
      
      So fix this by bringing back the delayed reference merge code, with the
      proper adaptations so that it operates against the new data structure
      (linked list vs old red black tree implementation).
      
      This issue was hit running fstest btrfs/063 in a loop. Several people have
      reported this issue in the mailing list when running on kernels 4.2+.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Fixes: c6fc2454 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
      Reported-by: default avatarPeter Becker <floyd.net@gmail.com>
      Reported-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: default avatarMalte Schröder <malte@tnxip.de>
      Reported-by: default avatarDerek Dongray <derek@valedon.co.uk>
      Reported-by: default avatarErkki Seppala <flux-btrfs@inside.org>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      480d3c86
    • Filipe Manana's avatar
      Btrfs: fix truncation of compressed and inlined extents · f50bb341
      Filipe Manana authored
      commit 0305cd5f upstream.
      
      When truncating a file to a smaller size which consists of an inline
      extent that is compressed, we did not discard (or made unusable) the
      data between the new file size and the old file size, wasting metadata
      space and allowing for the truncated data to be leaked and the data
      corruption/loss mentioned below.
      We were also not correctly decrementing the number of bytes used by the
      inode, we were setting it to zero, giving a wrong report for callers of
      the stat(2) syscall. The fsck tool also reported an error about a mismatch
      between the nbytes of the file versus the real space used by the file.
      
      Now because we weren't discarding the truncated region of the file, it
      was possible for a caller of the clone ioctl to actually read the data
      that was truncated, allowing for a security breach without requiring root
      access to the system, using only standard filesystem operations. The
      scenario is the following:
      
         1) User A creates a file which consists of an inline and compressed
            extent with a size of 2000 bytes - the file is not accessible to
            any other users (no read, write or execution permission for anyone
            else);
      
         2) The user truncates the file to a size of 1000 bytes;
      
         3) User A makes the file world readable;
      
         4) User B creates a file consisting of an inline extent of 2000 bytes;
      
         5) User B issues a clone operation from user A's file into its own
            file (using a length argument of 0, clone the whole range);
      
         6) User B now gets to see the 1000 bytes that user A truncated from
            its file before it made its file world readbale. User B also lost
            the bytes in the range [1000, 2000[ bytes from its own file, but
            that might be ok if his/her intention was reading stale data from
            user A that was never supposed to be public.
      
      Note that this contrasts with the case where we truncate a file from 2000
      bytes to 1000 bytes and then truncate it back from 1000 to 2000 bytes. In
      this case reading any byte from the range [1000, 2000[ will return a value
      of 0x00, instead of the original data.
      
      This problem exists since the clone ioctl was added and happens both with
      and without my recent data loss and file corruption fixes for the clone
      ioctl (patch "Btrfs: fix file corruption and data loss after cloning
      inline extents").
      
      So fix this by truncating the compressed inline extents as we do for the
      non-compressed case, which involves decompressing, if the data isn't already
      in the page cache, compressing the truncated version of the extent, writing
      the compressed content into the inline extent and then truncate it.
      
      The following test case for fstests reproduces the problem. In order for
      the test to pass both this fix and my previous fix for the clone ioctl
      that forbids cloning a smaller inline extent into a larger one,
      which is titled "Btrfs: fix file corruption and data loss after cloning
      inline extents", are needed. Without that other fix the test fails in a
      different way that does not leak the truncated data, instead part of
      destination file gets replaced with zeroes (because the destination file
      has a larger inline extent than the source).
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_cloner
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _scratch_mount "-o compress"
      
        # Create our test files. File foo is going to be the source of a clone operation
        # and consists of a single inline extent with an uncompressed size of 512 bytes,
        # while file bar consists of a single inline extent with an uncompressed size of
        # 256 bytes. For our test's purpose, it's important that file bar has an inline
        # extent with a size smaller than foo's inline extent.
        $XFS_IO_PROG -f -c "pwrite -S 0xa1 0 128"   \
                -c "pwrite -S 0x2a 128 384" \
                $SCRATCH_MNT/foo | _filter_xfs_io
        $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 256" $SCRATCH_MNT/bar | _filter_xfs_io
      
        # Now durably persist all metadata and data. We do this to make sure that we get
        # on disk an inline extent with a size of 512 bytes for file foo.
        sync
      
        # Now truncate our file foo to a smaller size. Because it consists of a
        # compressed and inline extent, btrfs did not shrink the inline extent to the
        # new size (if the extent was not compressed, btrfs would shrink it to 128
        # bytes), it only updates the inode's i_size to 128 bytes.
        $XFS_IO_PROG -c "truncate 128" $SCRATCH_MNT/foo
      
        # Now clone foo's inline extent into bar.
        # This clone operation should fail with errno EOPNOTSUPP because the source
        # file consists only of an inline extent and the file's size is smaller than
        # the inline extent of the destination (128 bytes < 256 bytes). However the
        # clone ioctl was not prepared to deal with a file that has a size smaller
        # than the size of its inline extent (something that happens only for compressed
        # inline extents), resulting in copying the full inline extent from the source
        # file into the destination file.
        #
        # Note that btrfs' clone operation for inline extents consists of removing the
        # inline extent from the destination inode and copy the inline extent from the
        # source inode into the destination inode, meaning that if the destination
        # inode's inline extent is larger (N bytes) than the source inode's inline
        # extent (M bytes), some bytes (N - M bytes) will be lost from the destination
        # file. Btrfs could copy the source inline extent's data into the destination's
        # inline extent so that we would not lose any data, but that's currently not
        # done due to the complexity that would be needed to deal with such cases
        # (specially when one or both extents are compressed), returning EOPNOTSUPP, as
        # it's normally not a very common case to clone very small files (only case
        # where we get inline extents) and copying inline extents does not save any
        # space (unlike for normal, non-inlined extents).
        $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar
      
        # Now because the above clone operation used to succeed, and due to foo's inline
        # extent not being shinked by the truncate operation, our file bar got the whole
        # inline extent copied from foo, making us lose the last 128 bytes from bar
        # which got replaced by the bytes in range [128, 256[ from foo before foo was
        # truncated - in other words, data loss from bar and being able to read old and
        # stale data from foo that should not be possible to read anymore through normal
        # filesystem operations. Contrast with the case where we truncate a file from a
        # size N to a smaller size M, truncate it back to size N and then read the range
        # [M, N[, we should always get the value 0x00 for all the bytes in that range.
      
        # We expected the clone operation to fail with errno EOPNOTSUPP and therefore
        # not modify our file's bar data/metadata. So its content should be 256 bytes
        # long with all bytes having the value 0xbb.
        #
        # Without the btrfs bug fix, the clone operation succeeded and resulted in
        # leaking truncated data from foo, the bytes that belonged to its range
        # [128, 256[, and losing data from bar in that same range. So reading the
        # file gave us the following content:
        #
        # 0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
        # *
        # 0000200 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a
        # *
        # 0000400
        echo "File bar's content after the clone operation:"
        od -t x1 $SCRATCH_MNT/bar
      
        # Also because the foo's inline extent was not shrunk by the truncate
        # operation, btrfs' fsck, which is run by the fstests framework everytime a
        # test completes, failed reporting the following error:
        #
        #  root 5 inode 257 errors 400, nbytes wrong
      
        status=0
        exit
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f50bb341
    • Filipe Manana's avatar
      Btrfs: fix file corruption and data loss after cloning inline extents · 8d073ae4
      Filipe Manana authored
      commit 8039d87d upstream.
      
      Currently the clone ioctl allows to clone an inline extent from one file
      to another that already has other (non-inlined) extents. This is a problem
      because btrfs is not designed to deal with files having inline and regular
      extents, if a file has an inline extent then it must be the only extent
      in the file and must start at file offset 0. Having a file with an inline
      extent followed by regular extents results in EIO errors when doing reads
      or writes against the first 4K of the file.
      
      Also, the clone ioctl allows one to lose data if the source file consists
      of a single inline extent, with a size of N bytes, and the destination
      file consists of a single inline extent with a size of M bytes, where we
      have M > N. In this case the clone operation removes the inline extent
      from the destination file and then copies the inline extent from the
      source file into the destination file - we lose the M - N bytes from the
      destination file, a read operation will get the value 0x00 for any bytes
      in the the range [N, M] (the destination inode's i_size remained as M,
      that's why we can read past N bytes).
      
      So fix this by not allowing such destructive operations to happen and
      return errno EOPNOTSUPP to user space.
      
      Currently the fstest btrfs/035 tests the data loss case but it totally
      ignores this - i.e. expects the operation to succeed and does not check
      the we got data loss.
      
      The following test case for fstests exercises all these cases that result
      in file corruption and data loss:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_cloner
        _require_btrfs_fs_feature "no_holes"
        _require_btrfs_mkfs_feature "no-holes"
      
        rm -f $seqres.full
      
        test_cloning_inline_extents()
        {
            local mkfs_opts=$1
            local mount_opts=$2
      
            _scratch_mkfs $mkfs_opts >>$seqres.full 2>&1
            _scratch_mount $mount_opts
      
            # File bar, the source for all the following clone operations, consists
            # of a single inline extent (50 bytes).
            $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 50" $SCRATCH_MNT/bar \
                | _filter_xfs_io
      
            # Test cloning into a file with an extent (non-inlined) where the
            # destination offset overlaps that extent. It should not be possible to
            # clone the inline extent from file bar into this file.
            $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 16K" $SCRATCH_MNT/foo \
                | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo
      
            # Doing IO against any range in the first 4K of the file should work.
            # Due to a past clone ioctl bug which allowed cloning the inline extent,
            # these operations resulted in EIO errors.
            echo "File foo data after clone operation:"
            # All bytes should have the value 0xaa (clone operation failed and did
            # not modify our file).
            od -t x1 $SCRATCH_MNT/foo
            $XFS_IO_PROG -c "pwrite -S 0xcc 0 100" $SCRATCH_MNT/foo | _filter_xfs_io
      
            # Test cloning the inline extent against a file which has a hole in its
            # first 4K followed by a non-inlined extent. It should not be possible
            # as well to clone the inline extent from file bar into this file.
            $XFS_IO_PROG -f -c "pwrite -S 0xdd 4K 12K" $SCRATCH_MNT/foo2 \
                | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo2
      
            # Doing IO against any range in the first 4K of the file should work.
            # Due to a past clone ioctl bug which allowed cloning the inline extent,
            # these operations resulted in EIO errors.
            echo "File foo2 data after clone operation:"
            # All bytes should have the value 0x00 (clone operation failed and did
            # not modify our file).
            od -t x1 $SCRATCH_MNT/foo2
            $XFS_IO_PROG -c "pwrite -S 0xee 0 90" $SCRATCH_MNT/foo2 | _filter_xfs_io
      
            # Test cloning the inline extent against a file which has a size of zero
            # but has a prealloc extent. It should not be possible as well to clone
            # the inline extent from file bar into this file.
            $XFS_IO_PROG -f -c "falloc -k 0 1M" $SCRATCH_MNT/foo3 | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo3
      
            # Doing IO against any range in the first 4K of the file should work.
            # Due to a past clone ioctl bug which allowed cloning the inline extent,
            # these operations resulted in EIO errors.
            echo "First 50 bytes of foo3 after clone operation:"
            # Should not be able to read any bytes, file has 0 bytes i_size (the
            # clone operation failed and did not modify our file).
            od -t x1 $SCRATCH_MNT/foo3
            $XFS_IO_PROG -c "pwrite -S 0xff 0 90" $SCRATCH_MNT/foo3 | _filter_xfs_io
      
            # Test cloning the inline extent against a file which consists of a
            # single inline extent that has a size not greater than the size of
            # bar's inline extent (40 < 50).
            # It should be possible to do the extent cloning from bar to this file.
            $XFS_IO_PROG -f -c "pwrite -S 0x01 0 40" $SCRATCH_MNT/foo4 \
                | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo4
      
            # Doing IO against any range in the first 4K of the file should work.
            echo "File foo4 data after clone operation:"
            # Must match file bar's content.
            od -t x1 $SCRATCH_MNT/foo4
            $XFS_IO_PROG -c "pwrite -S 0x02 0 90" $SCRATCH_MNT/foo4 | _filter_xfs_io
      
            # Test cloning the inline extent against a file which consists of a
            # single inline extent that has a size greater than the size of bar's
            # inline extent (60 > 50).
            # It should not be possible to clone the inline extent from file bar
            # into this file.
            $XFS_IO_PROG -f -c "pwrite -S 0x03 0 60" $SCRATCH_MNT/foo5 \
                | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo5
      
            # Reading the file should not fail.
            echo "File foo5 data after clone operation:"
            # Must have a size of 60 bytes, with all bytes having a value of 0x03
            # (the clone operation failed and did not modify our file).
            od -t x1 $SCRATCH_MNT/foo5
      
            # Test cloning the inline extent against a file which has no extents but
            # has a size greater than bar's inline extent (16K > 50).
            # It should not be possible to clone the inline extent from file bar
            # into this file.
            $XFS_IO_PROG -f -c "truncate 16K" $SCRATCH_MNT/foo6 | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo6
      
            # Reading the file should not fail.
            echo "File foo6 data after clone operation:"
            # Must have a size of 16K, with all bytes having a value of 0x00 (the
            # clone operation failed and did not modify our file).
            od -t x1 $SCRATCH_MNT/foo6
      
            # Test cloning the inline extent against a file which has no extents but
            # has a size not greater than bar's inline extent (30 < 50).
            # It should be possible to clone the inline extent from file bar into
            # this file.
            $XFS_IO_PROG -f -c "truncate 30" $SCRATCH_MNT/foo7 | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo7
      
            # Reading the file should not fail.
            echo "File foo7 data after clone operation:"
            # Must have a size of 50 bytes, with all bytes having a value of 0xbb.
            od -t x1 $SCRATCH_MNT/foo7
      
            # Test cloning the inline extent against a file which has a size not
            # greater than the size of bar's inline extent (20 < 50) but has
            # a prealloc extent that goes beyond the file's size. It should not be
            # possible to clone the inline extent from bar into this file.
            $XFS_IO_PROG -f -c "falloc -k 0 1M" \
                            -c "pwrite -S 0x88 0 20" \
                            $SCRATCH_MNT/foo8 | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo8
      
            echo "File foo8 data after clone operation:"
            # Must have a size of 20 bytes, with all bytes having a value of 0x88
            # (the clone operation did not modify our file).
            od -t x1 $SCRATCH_MNT/foo8
      
            _scratch_unmount
        }
      
        echo -e "\nTesting without compression and without the no-holes feature...\n"
        test_cloning_inline_extents
      
        echo -e "\nTesting with compression and without the no-holes feature...\n"
        test_cloning_inline_extents "" "-o compress"
      
        echo -e "\nTesting without compression and with the no-holes feature...\n"
        test_cloning_inline_extents "-O no-holes" ""
      
        echo -e "\nTesting with compression and with the no-holes feature...\n"
        test_cloning_inline_extents "-O no-holes" "-o compress"
      
        status=0
        exit
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8d073ae4