1. 27 Aug, 2016 22 commits
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 5e608a02
      Linus Torvalds authored
      Merge fixes from Andrew Morton:
       "11 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm: silently skip readahead for DAX inodes
        dax: fix device-dax region base
        fs/seq_file: fix out-of-bounds read
        mm: memcontrol: avoid unused function warning
        mm: clarify COMPACTION Kconfig text
        treewide: replace config_enabled() with IS_ENABLED() (2nd round)
        printk: fix parsing of "brl=" option
        soft_dirty: fix soft_dirty during THP split
        sysctl: handle error writing UINT_MAX to u32 fields
        get_maintainer: quiet noisy implicit -f vcs_file_exists checking
        byteswap: don't use __builtin_bswap*() with sparse
      5e608a02
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 65fc7d54
      Linus Torvalds authored
      Pull ARM64 fix from Catalin Marinas:
       "ARM64 fix to avoid potential TLB conflict when CONFIG_RANDOMIZE_BASE
        is enabled"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: avoid TLB conflict with CONFIG_RANDOMIZE_BASE
      65fc7d54
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma · a3d34698
      Linus Torvalds authored
      Pull rdma fixes from Doug Ledford:
       "Round one of 4.8 rc fixes.
      
        This should be the bulk of the -rc fixes for 4.8.  I only have a few
        things that are still outstanding (two ipoib bugs for which the
        solution is not yet fully known, and a few queued items that came in
        after my last push and I didn't want to delay this pull request for
        late comers again).
      
        Even though the patch count is kind of high, everything is minor fixes
        so the overall churn is pretty low.
      
        Summary:
      
         - minor fixes to cxgb4
         - minor fixes to mlx4
         - one minor fix each to core, rxe, isert, srpt, mlx5, ocrdma, and usnic
         - six or so fixes to i40iw fixes
         - the rest are hfi1 fixes"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (34 commits)
        i40iw: Send last streaming mode message for loopback connections
        IB/srpt: Update sport->port_guid with each port refresh
        RDMA/ocrdma: Fix the max_sge reported from FW
        i40iw: Avoid writing to freed memory
        i40iw: Fix double free of allocated_buffer
        IB/mlx5: Remove superfluous include of io-mapping.h
        i40iw: Do not set self-referencing pointer to NULL after kfree
        i40iw: Add missing NULL check for MPA private data
        iw_cxgb4: Fix cxgb4 arm CQ logic w/IB_CQ_REPORT_MISSED_EVENTS
        i40iw: Add missing check for interface already open
        i40iw: Protect req_resource_num update
        i40iw: Change mem_resources pointer to a u8
        IB/core: Use memdup_user() rather than duplicating its implementation
        IB/qib: Use memdup_user() rather than duplicating its implementation
        iw_cxgb4: use the MPA initiator's IRD if < our ORD
        iw_cxgb4: limit IRD/ORD advertised to ULP by device max.
        IB/hfi1: Fix mm_struct use after free
        IB/rdmvat: Fix double vfree() in rvt_create_qp() error path
        IB/hfi1: Improve J_KEY generation
        IB/hfi1: Return invalid field for non-QSFP CableInfo queries
        ...
      a3d34698
    • Linus Torvalds's avatar
      Merge tag 'sound-4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 03cef710
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "Here are a bunch of fixes as you can see in diffstat.
      
        One core change in ASoC is about the unexpected unbinding error, and
        another about debugfs cleanup.
      
        The rest are wide-spread driver-specific fixes: a series of LINE6 USB
        fixes, a HD-audio quirk, and various ASoC fixes including OMAP boot
        fixes and Intel SKL fixes"
      
      * tag 'sound-4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (22 commits)
        ALSA: hda/realtek - fix headset mic detection for MSI MS-B120
        ASoC: omap-mcpdm: Fix irq resource handling
        ASoC: max98371: Add terminate entry for i2c_device_id tables
        ALSA: line6: Fix POD sysfs attributes segfault
        ALSA: line6: Give up on the lock while URBs are released.
        ALSA: line6: Remove double line6_pcm_release() after failed acquire.
        ASoC: omap-abe-twl6040: Correct dmic-codec device registration
        ASoC: core: Clean up DAPM before the card debugfs
        ASoC: omap-mcpdm: Drop pdmclk clock handling
        ASoC: atmel_ssc_dai: Don't unconditionally reset SSC on stream startup
        ASoC: compress: Fix leak of a widget list in soc_compr_open_fe
        ASoC: Intel: Skylake: Fix error return code in skl_probe()
        ASoC: wm2000: Fix return of uninitialised varible
        ASoC: Fix leak of rtd in soc_bind_dai_link
        ASoC: da7213: Default to 64 BCLKs per WCLK to support all formats
        ASoC: nau8825: fix static check error about semaphone control
        ASoC: nau8825: fix bug in playback when suspend
        ASoC: samsung: Fix clock handling in S3C24XX_UDA134X card
        ASoC: simple-card-utils: add missing MODULE_xxx()
        ASoC: Intel: Skylake: Check list empty while getting module info
        ...
      03cef710
    • Linus Torvalds's avatar
      Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs · 28687b93
      Linus Torvalds authored
      Pull btrfs fixes from Chris Mason:
       "We've queued up a few different fixes in here.  These range from
        enospc corners to fsync and quota fixes, and a few targeted at error
        handling for corrupt metadata/fuzzing"
      
      * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
        Btrfs: fix lockdep warning on deadlock against an inode's log mutex
        Btrfs: detect corruption when non-root leaf has zero item
        Btrfs: check btree node's nritems
        btrfs: don't create or leak aliased root while cleaning up orphans
        Btrfs: fix em leak in find_first_block_group
        btrfs: do not background blkdev_put()
        Btrfs: clarify do_chunk_alloc()'s return value
        btrfs: fix fsfreeze hang caused by delayed iputs deal
        btrfs: update btrfs_space_info's bytes_may_use timely
        btrfs: divide btrfs_update_reserved_bytes() into two functions
        btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()
        btrfs: qgroup: Fix qgroup incorrectness caused by log replay
        btrfs: relocation: Fix leaking qgroups numbers on data extents
        btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()
        btrfs: waiting on qgroup rescan should not always be interruptible
        btrfs: properly track when rescan worker is running
        btrfs: flush_space: treat return value of do_chunk_alloc properly
        Btrfs: add ASSERT for block group's memory leak
        btrfs: backref: Fix soft lockup in __merge_refs function
        Btrfs: fix memory leak of reloc_root
      28687b93
    • Linus Torvalds's avatar
      Merge tag 'dlm-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm · 370f6017
      Linus Torvalds authored
      Pull dlm fix from David Teigland:
       "This fixes a bug introduced by recent debugfs cleanup"
      
      * tag 'dlm-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
        dlm: fix malfunction of dlm_tool caused by debugfs changes
      370f6017
    • Linus Torvalds's avatar
      Merge tag 'dm-4.8-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm · 6ec675ed
      Linus Torvalds authored
      Pull device mapper fixes from Mike Snitzer:
      
       - another stable fix for DM flakey (that tweaks the previous fix that
         didn't factor in expected 'drop_writes' behavior for read IO).
      
       - a dm-log bio operation flags fix for the broader block changes that
         were merged during the 4.8 merge window.
      
      * tag 'dm-4.8-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
        dm log: fix unitialized bio operation flags
        dm flakey: fix reads to be issued if drop_writes configured
      6ec675ed
    • Linus Torvalds's avatar
      Merge tag 'iommu-fixes-v4.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · 67a8c7d6
      Linus Torvalds authored
      Pull IOMMU fixes from Joerg Roedel:
       "Fixes from Will Deacon:
      
         - fix a couple of thinkos in the CMDQ error handling and
           short-descriptor page table code that have been there since day one
      
         - disable stalling faults, since they may result in hardware deadlock
      
         - fix an accidental BUG() when passing disable_bypass=1 on the
           cmdline"
      
      * tag 'iommu-fixes-v4.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
        iommu/arm-smmu: Don't BUG() if we find aborting STEs with disable_bypass
        iommu/arm-smmu: Disable stalling faults for all endpoints
        iommu/arm-smmu: Fix CMDQ error handling
        iommu/io-pgtable-arm-v7s: Fix attributes when splitting blocks
      67a8c7d6
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.dk/linux-block · fd1ae514
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "Here's a set of block fixes for the current 4.8-rc release.  This
        contains:
      
         - a fix for a secure erase regression, from Adrian.
      
         - a fix for an mmc use-after-free bug regression, also from Adrian.
      
         - potential zero pointer deference in bdev freezing, from Andrey.
      
         - a race fix for blk_set_queue_dying() from Bart.
      
         - a set of xen blkfront fixes from Bob Liu.
      
         - three small fixes for bcache, from Eric and Kent.
      
         - a fix for a potential invalid NVMe state transition, from Gabriel.
      
         - blk-mq CPU offline fix, preventing us from issuing and completing a
           request on the wrong queue.  From me.
      
         - revert two previous floppy changes, since they caused a user
           visibile regression.  A better fix is in the works.
      
         - ensure that we don't send down bios that have more than 256
           elements in them.  Fixes a crash with bcache, for example.  From
           Ming.
      
         - a fix for deferencing an error pointer with cgroup writeback.
           Fixes a regression.  From Vegard"
      
      * 'for-linus' of git://git.kernel.dk/linux-block:
        mmc: fix use-after-free of struct request
        Revert "floppy: refactor open() flags handling"
        Revert "floppy: fix open(O_ACCMODE) for ioctl-only open"
        fs/block_dev: fix potential NULL ptr deref in freeze_bdev()
        blk-mq: improve warning for running a queue on the wrong CPU
        blk-mq: don't overwrite rq->mq_ctx
        block: make sure a big bio is split into at most 256 bvecs
        nvme: Fix nvme_get/set_features() with a NULL result pointer
        bdev: fix NULL pointer dereference
        xen-blkfront: free resources if xlvbd_alloc_gendisk fails
        xen-blkfront: introduce blkif_set_queue_limits()
        xen-blkfront: fix places not updated after introducing 64KB page granularity
        bcache: pr_err: more meaningful error message when nr_stripes is invalid
        bcache: RESERVE_PRIO is too small by one when prio_buckets() is a power of two.
        bcache: register_bcache(): call blkdev_put() when cache_alloc() fails
        block: Fix race triggered by blk_set_queue_dying()
        block: Fix secure erase
        nvme: Prevent controller state invalid transition
      fd1ae514
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · b09c412a
      Linus Torvalds authored
      Pull input subsystem fixes from Dmitry Torokhov:
       "Simply small driver fixups"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        Input: ads7846 - remove redundant regulator_disable call
        Input: synaptics-rmi4 - fix register descriptor subpacket map construction
        Input: tegra-kbc - fix inverted reset logic
        Input: silead - use devm_gpiod_get
        Input: i8042 - set up shared ps2_cmd_mutex for AUX ports
      b09c412a
    • Linus Torvalds's avatar
      Merge tag 'pci-v4.8-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci · 219c04ce
      Linus Torvalds authored
      Pull PCI fixes from Bjorn Helgaas:
       "Resource management:
         - Update "pci=resource_alignment" documentation (Mathias Koehrer)
      
        MSI:
         - Use positive flags in pci_alloc_irq_vectors() (Christoph Hellwig)
         - Call pci_intx() when using legacy interrupts in pci_alloc_irq_vectors() (Christoph Hellwig)
      
        Intel VMD host bridge driver:
         - Fix infinite loop executing irq's (Keith Busch)"
      
      * tag 'pci-v4.8-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
        x86/PCI: VMD: Fix infinite loop executing irq's
        PCI: Call pci_intx() when using legacy interrupts in pci_alloc_irq_vectors()
        PCI: Use positive flags in pci_alloc_irq_vectors()
        PCI: Update "pci=resource_alignment" documentation
      219c04ce
    • Ross Zwisler's avatar
      mm: silently skip readahead for DAX inodes · 11bd969f
      Ross Zwisler authored
      For DAX inodes we need to be careful to never have page cache pages in
      the mapping->page_tree.  This radix tree should be composed only of DAX
      exceptional entries and zero pages.
      
      ltp's readahead02 test was triggering a warning because we were trying
      to insert a DAX exceptional entry but found that a page cache page had
      already been inserted into the tree.  This page was being inserted into
      the radix tree in response to a readahead(2) call.
      
      Readahead doesn't make sense for DAX inodes, but we don't want it to
      report a failure either.  Instead, we just return success and don't do
      any work.
      
      Link: http://lkml.kernel.org/r/20160824221429.21158-1-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: <stable@vger.kernel.org>	[4.5+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11bd969f
    • Dan Williams's avatar
      dax: fix device-dax region base · d0e58455
      Dan Williams authored
      The data offset for a dax region needs to account for a reservation in
      the resource range.  Otherwise, device-dax is allowing mappings directly
      into the memmap or device-info-block area with crash signatures like the
      following:
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
       IP: get_zone_device_page+0x11/0x30
       Call Trace:
         follow_devmap_pmd+0x298/0x2c0
         follow_page_mask+0x275/0x530
         __get_user_pages+0xe3/0x750
         __gfn_to_pfn_memslot+0x1b2/0x450 [kvm]
         tdp_page_fault+0x130/0x280 [kvm]
         kvm_mmu_page_fault+0x5f/0xf0 [kvm]
         handle_ept_violation+0x94/0x180 [kvm_intel]
         vmx_handle_exit+0x1d3/0x1440 [kvm_intel]
         kvm_arch_vcpu_ioctl_run+0x81d/0x16a0 [kvm]
         kvm_vcpu_ioctl+0x33c/0x620 [kvm]
         do_vfs_ioctl+0xa2/0x5d0
         SyS_ioctl+0x79/0x90
         entry_SYSCALL_64_fastpath+0x1a/0xa4
      
      Fixes: ab68f262 ("/dev/dax, pmem: direct access to persistent memory")
      Link: http://lkml.kernel.org/r/147205536732.1606.8994275381938837346.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarAbhilash Kumar Mulumudi <m.abhilash-kumar@hpe.com>
      Reported-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Tested-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0e58455
    • Vegard Nossum's avatar
      fs/seq_file: fix out-of-bounds read · 088bf2ff
      Vegard Nossum authored
      seq_read() is a nasty piece of work, not to mention buggy.
      
      It has (I think) an old bug which allows unprivileged userspace to read
      beyond the end of m->buf.
      
      I was getting these:
      
          BUG: KASAN: slab-out-of-bounds in seq_read+0xcd2/0x1480 at addr ffff880116889880
          Read of size 2713 by task trinity-c2/1329
          CPU: 2 PID: 1329 Comm: trinity-c2 Not tainted 4.8.0-rc1+ #96
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
          Call Trace:
            kasan_object_err+0x1c/0x80
            kasan_report_error+0x2cb/0x7e0
            kasan_report+0x4e/0x80
            check_memory_region+0x13e/0x1a0
            kasan_check_read+0x11/0x20
            seq_read+0xcd2/0x1480
            proc_reg_read+0x10b/0x260
            do_loop_readv_writev.part.5+0x140/0x2c0
            do_readv_writev+0x589/0x860
            vfs_readv+0x7b/0xd0
            do_readv+0xd8/0x2c0
            SyS_readv+0xb/0x10
            do_syscall_64+0x1b3/0x4b0
            entry_SYSCALL64_slow_path+0x25/0x25
          Object at ffff880116889100, in cache kmalloc-4096 size: 4096
          Allocated:
          PID = 1329
            save_stack_trace+0x26/0x80
            save_stack+0x46/0xd0
            kasan_kmalloc+0xad/0xe0
            __kmalloc+0x1aa/0x4a0
            seq_buf_alloc+0x35/0x40
            seq_read+0x7d8/0x1480
            proc_reg_read+0x10b/0x260
            do_loop_readv_writev.part.5+0x140/0x2c0
            do_readv_writev+0x589/0x860
            vfs_readv+0x7b/0xd0
            do_readv+0xd8/0x2c0
            SyS_readv+0xb/0x10
            do_syscall_64+0x1b3/0x4b0
            return_from_SYSCALL_64+0x0/0x6a
          Freed:
          PID = 0
          (stack is not available)
          Memory state around the buggy address:
           ffff88011688a000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
           ffff88011688a080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
          >ffff88011688a100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      		       ^
           ffff88011688a180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
           ffff88011688a200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
          ==================================================================
          Disabling lock debugging due to kernel taint
      
      This seems to be the same thing that Dave Jones was seeing here:
      
        https://lkml.org/lkml/2016/8/12/334
      
      There are multiple issues here:
      
        1) If we enter the function with a non-empty buffer, there is an attempt
           to flush it. But it was not clearing m->from after doing so, which
           means that if we try to do this flush twice in a row without any call
           to traverse() in between, we are going to be reading from the wrong
           place -- the splat above, fixed by this patch.
      
        2) If there's a short write to userspace because of page faults, the
           buffer may already contain multiple lines (i.e. pos has advanced by
           more than 1), but we don't save the progress that was made so the
           next call will output what we've already returned previously. Since
           that is a much less serious issue (and I have a headache after
           staring at seq_read() for the past 8 hours), I'll leave that for now.
      
      Link: http://lkml.kernel.org/r/1471447270-32093-1-git-send-email-vegard.nossum@oracle.comSigned-off-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Reported-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      088bf2ff
    • Arnd Bergmann's avatar
      mm: memcontrol: avoid unused function warning · 358c07fc
      Arnd Bergmann authored
      A bugfix in v4.8-rc2 introduced a harmless warning when
      CONFIG_MEMCG_SWAP is disabled but CONFIG_MEMCG is enabled:
      
        mm/memcontrol.c:4085:27: error: 'mem_cgroup_id_get_online' defined but not used [-Werror=unused-function]
         static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)
      
      This moves the function inside of the #ifdef block that hides the
      calling function, to avoid the warning.
      
      Fixes: 1f47b61f ("mm: memcontrol: fix swap counter leak on swapout from offline cgroup")
      Link: http://lkml.kernel.org/r/20160824113733.2776701-1-arnd@arndb.deSigned-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      358c07fc
    • Michal Hocko's avatar
      mm: clarify COMPACTION Kconfig text · b32eaf71
      Michal Hocko authored
      The current wording of the COMPACTION Kconfig help text doesn't
      emphasise that disabling COMPACTION might cripple the page allocator
      which relies on the compaction quite heavily for high order requests and
      an unexpected OOM can happen with the lack of compaction.  Make sure we
      are vocal about that.
      
      Link: http://lkml.kernel.org/r/20160823091726.GK23577@dhcp22.suse.czSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b32eaf71
    • Masahiro Yamada's avatar
      treewide: replace config_enabled() with IS_ENABLED() (2nd round) · a5ff1b34
      Masahiro Yamada authored
      Commit 97f2645f ("tree-wide: replace config_enabled() with
      IS_ENABLED()") mostly killed config_enabled(), but some new users have
      appeared for v4.8-rc1.  They are all used for a boolean option, so can
      be replaced with IS_ENABLED() safely.
      
      Link: http://lkml.kernel.org/r/1471970749-24867-1-git-send-email-yamada.masahiro@socionext.comSigned-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarPeter Oberparleiter <oberpar@linux.vnet.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5ff1b34
    • Nicolas Iooss's avatar
      printk: fix parsing of "brl=" option · ae6c33ba
      Nicolas Iooss authored
      Commit bbeddf52 ("printk: move braille console support into separate
      braille.[ch] files") moved the parsing of braille-related options into
      _braille_console_setup(), changing the type of variable str from char*
      to char**.  In this commit, memcmp(str, "brl,", 4) was correctly updated
      to memcmp(*str, "brl,", 4) but not memcmp(str, "brl=", 4).
      
      Update the code to make "brl=" option work again and replace memcmp()
      with strncmp() to make the compiler able to detect such an issue.
      
      Fixes: bbeddf52 ("printk: move braille console support into separate braille.[ch] files")
      Link: http://lkml.kernel.org/r/20160823165700.28952-1-nicolas.iooss_linux@m4x.orgSigned-off-by: default avatarNicolas Iooss <nicolas.iooss_linux@m4x.org>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae6c33ba
    • Andrea Arcangeli's avatar
      soft_dirty: fix soft_dirty during THP split · 804dd150
      Andrea Arcangeli authored
      While adding proper userfaultfd_wp support with bits in pagetable and
      swap entry to avoid false positives WP userfaults through swap/fork/
      KSM/etc, I've been adding a framework that mostly mirrors soft dirty.
      
      So I noticed in one place I had to add uffd_wp support to the pagetables
      that wasn't covered by soft_dirty and I think it should have.
      
      Example: in the THP migration code migrate_misplaced_transhuge_page()
      pmd_mkdirty is called unconditionally after mk_huge_pmd.
      
      	entry = mk_huge_pmd(new_page, vma->vm_page_prot);
      	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
      
      That sets soft dirty too (it's a false positive for soft dirty, the soft
      dirty bit could be more finegrained and transfer the bit like uffd_wp
      will do..  pmd/pte_uffd_wp() enforces the invariant that when it's set
      pmd/pte_write is not set).
      
      However in the THP split there's no unconditional pmd_mkdirty after
      mk_huge_pmd and pte_swp_mksoft_dirty isn't called after the migration
      entry is created.  The code sets the dirty bit in the struct page
      instead of setting it in the pagetable (which is fully equivalent as far
      as the real dirty bit is concerned, as the whole point of pagetable bits
      is to be eventually flushed out of to the page, but that is not
      equivalent for the soft-dirty bit that gets lost in translation).
      
      This was found by code review only and totally untested as I'm working
      to actually replace soft dirty and I don't have time to test potential
      soft dirty bugfixes as well :).
      
      Transfer the soft_dirty from pmd to pte during THP splits.
      
      This fix avoids losing the soft_dirty bit and avoids userland memory
      corruption in the checkpoint.
      
      Fixes: eef1b3ba ("thp: implement split_huge_pmd()")
      Link: http://lkml.kernel.org/r/1471610515-30229-2-git-send-email-aarcange@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@virtuozzo.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      804dd150
    • Subash Abhinov Kasiviswanathan's avatar
      sysctl: handle error writing UINT_MAX to u32 fields · e7d316a0
      Subash Abhinov Kasiviswanathan authored
      We have scripts which write to certain fields on 3.18 kernels but this
      seems to be failing on 4.4 kernels.  An entry which we write to here is
      xfrm_aevent_rseqth which is u32.
      
        echo 4294967295  > /proc/sys/net/core/xfrm_aevent_rseqth
      
      Commit 230633d1 ("kernel/sysctl.c: detect overflows when converting
      to int") prevented writing to sysctl entries when integer overflow
      occurs.  However, this does not apply to unsigned integers.
      
      Heinrich suggested that we introduce a new option to handle 64 bit
      limits and set min as 0 and max as UINT_MAX.  This might not work as it
      leads to issues similar to __do_proc_doulongvec_minmax.  Alternatively,
      we would need to change the datatype of the entry to 64 bit.
      
        static int __do_proc_doulongvec_minmax(void *data, struct ctl_table
        {
            i = (unsigned long *) data;   //This cast is causing to read beyond the size of data (u32)
            vleft = table->maxlen / sizeof(unsigned long); //vleft is 0 because maxlen is sizeof(u32) which is lesser than sizeof(unsigned long) on x86_64.
      
      Introduce a new proc handler proc_douintvec.  Individual proc entries
      will need to be updated to use the new handler.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Fixes: 230633d1 ("kernel/sysctl.c:detect overflows when converting to int")
      Link: http://lkml.kernel.org/r/1471479806-5252-1-git-send-email-subashab@codeaurora.orgSigned-off-by: default avatarSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7d316a0
    • Joe Perches's avatar
      get_maintainer: quiet noisy implicit -f vcs_file_exists checking · 8582fb59
      Joe Perches authored
      Checking command line filenames that are outside the git tree can emit a
      noisy and confusing message.
      
      Quiet that message by redirecting stderr.
      Verify that the command was executed successfully.
      
      Fixes: 4cad35a7 ("get_maintainer.pl: reduce need for command-line option -f")
      Link: http://lkml.kernel.org/r/1970a1d2fecb258e384e2e4fdaacdc9ccf3e30a4.1470955439.git.joe@perches.comSigned-off-by: default avatarJoe Perches <joe@perches.com>
      Reported-by: default avatarWolfram Sang <wsa@the-dreams.de>
      Tested-by: default avatarWolfram Sang <wsa@the-dreams.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8582fb59
    • Johannes Berg's avatar
      byteswap: don't use __builtin_bswap*() with sparse · 101b29a2
      Johannes Berg authored
      Although sparse declares __builtin_bswap*(), it can't actually do
      constant folding inside them (yet).  As such, things like
      
        switch (protocol) {
        case htons(ETH_P_IP):
                break;
        }
      
      which we do all over the place cause sparse to warn that it expects a
      constant instead of a function call.
      
      Disable __HAVE_BUILTIN_BSWAP*__ if __CHECKER__ is defined to avoid this.
      
      Fixes: 7322dd75 ("byteswap: try to avoid __builtin_constant_p gcc bug")
      Link: http://lkml.kernel.org/r/1470914102-26389-1-git-send-email-johannes@sipsolutions.netSigned-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      101b29a2
  2. 26 Aug, 2016 1 commit
    • Eric Ren's avatar
      dlm: fix malfunction of dlm_tool caused by debugfs changes · 079d37df
      Eric Ren authored
      With the current kernel, `dlm_tool lockdebug` fails as below:
      
      "dlm_tool lockdebug ED0BD86DCE724393918A1AE8FDBF1EE3
      can't open /sys/kernel/debug/dlm/ED0BD86DCE724393918A1AE8FDBF1EE3:
      Operation not permitted"
      
      This is because table_open() depends on file->f_op to tell which
      seq_file ops should be passed down. But, the original file ops in
      file->f_op is replaced by "debugfs_full_proxy_file_operations" with
      commit 49d200de ("debugfs: prevent access to removed files'
      private data").
      
      Currently, I can think up 2 solutions: 1st, replace
      debugfs_create_file() with debugfs_create_file_unsafe();
      2nd, make different table_open#() accordingly. The 1st one
      is neat, but I don't thoroughly understand its risk. Maybe
      someone has a better one.
      Signed-off-by: default avatarEric Ren <zren@suse.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      079d37df
  3. 25 Aug, 2016 17 commits
    • Adrian Hunter's avatar
      mmc: fix use-after-free of struct request · 869c5548
      Adrian Hunter authored
      We call mmc_req_is_special() after having processed a request, but
      it could be freed after that. Check that ahead of time, and use
      the cached value.
      Reported-by: default avatarHans de Goede <hdegoede@redhat.com>
      Tested-by: default avatarHans de Goede <hdegoede@redhat.com>
      Fixes: c2df40df ("drivers: use req op accessor")
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      869c5548
    • Takashi Iwai's avatar
      Merge tag 'asoc-fix-v4.8-rc4' of... · a820cd3d
      Takashi Iwai authored
      Merge tag 'asoc-fix-v4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
      
      ASoC: Fixes for v4.8
      
      A clutch of fixes for v4.8.  These are mainly driver specific, the most
      notable ones being those for OMAP which fix a series of issues that
      broke boot on some platforms there when deferred probe kicked in.
      There's also one core fix for an issue when unbinding a card which for
      some reason had managed to not manifest until recently.
      a820cd3d
    • Doug Ledford's avatar
      049b1e7c
    • Tatyana Nikolova's avatar
      i40iw: Send last streaming mode message for loopback connections · 07c72d7d
      Tatyana Nikolova authored
      Send a zero length last streaming mode message for loopback
      connections to synchronize between accepting QP and connecting QP.
      This avoids data transfer to start on the accepting QP before
      the connecting QP is in RTS. Also remove function i40iw_loopback_nop()
      as it is no longer used.
      
      Fixes: f27b4746 ("i40iw: add connection management code")
      Signed-off-by: default avatarTatyana Nikolova <tatyana.e.nikolova@intel.com>
      Signed-off-by: default avatarShiraz Saleem <shiraz.saleem@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      07c72d7d
    • Jens Axboe's avatar
      Revert "floppy: refactor open() flags handling" · f2791e7e
      Jens Axboe authored
      This reverts commit 09954bad.
      f2791e7e
    • Jens Axboe's avatar
      Revert "floppy: fix open(O_ACCMODE) for ioctl-only open" · 468c298a
      Jens Axboe authored
      This reverts commit ff06db1e.
      468c298a
    • Andrey Ryabinin's avatar
      fs/block_dev: fix potential NULL ptr deref in freeze_bdev() · 5bb53c0f
      Andrey Ryabinin authored
      Calling freeze_bdev() twice on the same block device without mounted
      filesystem get_super() will return NULL, which will lead to NULL-ptr
      dereference later in drop_super().
      
      Check get_super() result to fix that.
      
      Note, that this is a purely theoretical issue. We have only 3
      freeze_bdev() callers. 2 of them are in filesystem code and used on a
      device with mounted fs. The third one in lock_fs() has protection in
      upper-layer code against freezing block device the second time without
      thawing it first.
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      5bb53c0f
    • Filipe Manana's avatar
      Btrfs: fix lockdep warning on deadlock against an inode's log mutex · 28a23593
      Filipe Manana authored
      Commit 44f714da ("Btrfs: improve performance on fsync against new
      inode after rename/unlink"), which landed in 4.8-rc2, introduced a
      possibility for a deadlock due to double locking of an inode's log mutex
      by the same task, which lockdep reports with:
      
      [23045.433975] =============================================
      [23045.434748] [ INFO: possible recursive locking detected ]
      [23045.435426] 4.7.0-rc6-btrfs-next-34+ #1 Not tainted
      [23045.436044] ---------------------------------------------
      [23045.436044] xfs_io/3688 is trying to acquire lock:
      [23045.436044]  (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]
                     but task is already holding lock:
      [23045.436044]  (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]
                     other info that might help us debug this:
      [23045.436044]  Possible unsafe locking scenario:
      
      [23045.436044]        CPU0
      [23045.436044]        ----
      [23045.436044]   lock(&ei->log_mutex);
      [23045.436044]   lock(&ei->log_mutex);
      [23045.436044]
                      *** DEADLOCK ***
      
      [23045.436044]  May be due to missing lock nesting notation
      
      [23045.436044] 3 locks held by xfs_io/3688:
      [23045.436044]  #0:  (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffffa035f2ae>] btrfs_sync_file+0x14e/0x425 [btrfs]
      [23045.436044]  #1:  (sb_internal#2){.+.+.+}, at: [<ffffffff8118446b>] __sb_start_write+0x5f/0xb0
      [23045.436044]  #2:  (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]
                     stack backtrace:
      [23045.436044] CPU: 4 PID: 3688 Comm: xfs_io Not tainted 4.7.0-rc6-btrfs-next-34+ #1
      [23045.436044] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [23045.436044]  0000000000000000 ffff88022f5f7860 ffffffff8127074d ffffffff82a54b70
      [23045.436044]  ffffffff82a54b70 ffff88022f5f7920 ffffffff81092897 ffff880228015d68
      [23045.436044]  0000000000000000 ffffffff82a54b70 ffffffff829c3f00 ffff880228015d68
      [23045.436044] Call Trace:
      [23045.436044]  [<ffffffff8127074d>] dump_stack+0x67/0x90
      [23045.436044]  [<ffffffff81092897>] __lock_acquire+0xcbb/0xe4e
      [23045.436044]  [<ffffffff8109155f>] ? mark_lock+0x24/0x201
      [23045.436044]  [<ffffffff8109179a>] ? mark_held_locks+0x5e/0x74
      [23045.436044]  [<ffffffff81092de0>] lock_acquire+0x12f/0x1c3
      [23045.436044]  [<ffffffff81092de0>] ? lock_acquire+0x12f/0x1c3
      [23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]  [<ffffffff814a51a4>] mutex_lock_nested+0x77/0x3a7
      [23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]  [<ffffffffa039705e>] ? btrfs_release_delayed_node+0xb/0xd [btrfs]
      [23045.436044]  [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]  [<ffffffff810a0ed1>] ? vprintk_emit+0x453/0x465
      [23045.436044]  [<ffffffffa0385a61>] btrfs_log_inode+0x66e/0xc95 [btrfs]
      [23045.436044]  [<ffffffffa03c084d>] log_new_dir_dentries+0x26c/0x359 [btrfs]
      [23045.436044]  [<ffffffffa03865aa>] btrfs_log_inode_parent+0x4a6/0x628 [btrfs]
      [23045.436044]  [<ffffffffa0387552>] btrfs_log_dentry_safe+0x5a/0x75 [btrfs]
      [23045.436044]  [<ffffffffa035f464>] btrfs_sync_file+0x304/0x425 [btrfs]
      [23045.436044]  [<ffffffff811acaf4>] vfs_fsync_range+0x8c/0x9e
      [23045.436044]  [<ffffffff811acb22>] vfs_fsync+0x1c/0x1e
      [23045.436044]  [<ffffffff811acc79>] do_fsync+0x31/0x4a
      [23045.436044]  [<ffffffff811ace99>] SyS_fsync+0x10/0x14
      [23045.436044]  [<ffffffff814a88e5>] entry_SYSCALL_64_fastpath+0x18/0xa8
      [23045.436044]  [<ffffffff8108f039>] ? trace_hardirqs_off_caller+0x3f/0xaa
      
      An example reproducer for this is:
      
         $ mkfs.btrfs -f /dev/sdb
         $ mount /dev/sdb /mnt
         $ mkdir /mnt/dir
         $ touch /mnt/dir/foo
         $ sync
         $ mv /mnt/dir/foo /mnt/dir/bar
         $ touch /mnt/dir/foo
         $ xfs_io -c "fsync" /mnt/dir/bar
      
      This is because while logging the inode of file bar we end up logging its
      parent directory (since its inode has an unlink_trans field matching the
      current transaction id due to the rename operation), which in turn logs
      the inodes for all its new dentries, so that the new inode for the new
      file named foo gets logged which in turn triggered another logging attempt
      for the inode we are fsync'ing, since that inode had an old name that
      corresponds to the name of the new inode.
      
      So fix this by ensuring that when logging the inode for a new dentry that
      has a name matching an old name of some other inode, we don't log again
      the original inode that we are fsync'ing.
      
      Fixes: 44f714da ("Btrfs: improve performance on fsync against new inode after rename/unlink")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      28a23593
    • Liu Bo's avatar
      Btrfs: detect corruption when non-root leaf has zero item · 1ba98d08
      Liu Bo authored
      Right now we treat leaf which has zero item as a valid one
      because we could have an empty tree, that is, a root that is
      also a leaf without any item, however, in the same case but
      when the leaf is not a root, we can end up with hitting the
      BUG_ON(1) in btrfs_extend_item() called by
      setup_inline_extent_backref().
      
      This makes us check the situation as a corruption if leaf is
      not its own root.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1ba98d08
    • Liu Bo's avatar
      Btrfs: check btree node's nritems · 053ab70f
      Liu Bo authored
      When btree node (level = 1) has nritems which equals to zero,
      we can end up with panic due to insert_ptr()'s
      
      BUG_ON(slot > nritems);
      
      where slot is 1 and nritems is 0, as copy_for_split() calls
      insert_ptr(.., path->slots[1] + 1, ...);
      
      A invalid value results in the whole mess, this adds the check
      for btree's node nritems so that we stop reading block when
      when something is wrong.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      053ab70f
    • Jeff Mahoney's avatar
      btrfs: don't create or leak aliased root while cleaning up orphans · 35bbb97f
      Jeff Mahoney authored
      commit 909c3a22 (Btrfs: fix loading of orphan roots leading to BUG_ON)
      avoids the BUG_ON but can add an aliased root to the dead_roots list or
      leak the root.
      
      Since we've already been loading roots into the radix tree, we should
      use it before looking the root up on disk.
      
      Cc: <stable@vger.kernel.org> # 4.5
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      35bbb97f
    • Josef Bacik's avatar
      Btrfs: fix em leak in find_first_block_group · 187ee58c
      Josef Bacik authored
      We need to call free_extent_map() on the em we look up.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      187ee58c
    • Anand Jain's avatar
      btrfs: do not background blkdev_put() · 14238819
      Anand Jain authored
      At the end of unmount/dev-delete, if the device exclusive open is not
      actually closed, then there might be a race with another program in
      the userland who is trying to open the device in exclusive mode and
      it may fail for eg:
            unmount /btrfs; fsck /dev/x
            btrfs dev del /dev/x /btrfs; fsck /dev/x
      so here background blkdev_put() is not a choice
      Signed-off-by: default avatarAnand Jain <Anand.Jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      14238819
    • Liu Bo's avatar
      Btrfs: clarify do_chunk_alloc()'s return value · 28b737f6
      Liu Bo authored
      Function start_transaction() can return ERR_PTR(1) when flush is
      BTRFS_RESERVE_FLUSH_LIMIT, so the call graph is
      
      start_transaction (return ERR_PTR(1))
        -> btrfs_block_rsv_add (return 1)
           -> reserve_metadata_bytes (return 1)
              -> flush_space (return 1)
                 -> do_chunk_alloc  (return 1)
      
      With BTRFS_RESERVE_FLUSH_LIMIT, if flush_space is already on the
      flush_state of ALLOC_CHUNK and it successfully allocates a new
      chunk, then instead of trying to reserve space again,
      reserve_metadata_bytes returns 1 immediately.
      
      Eventually the callers who call start_transaction() usually just
      do the IS_ERR() check which ERR_PTR(1) can pass, then it'll get
      a panic when dereferencing a pointer which is ERR_PTR(1).
      
      The following patch fixes the above problem.
      "btrfs: flush_space: treat return value of do_chunk_alloc properly"
      https://patchwork.kernel.org/patch/7778651/
      
      This add comments to clarify do_chunk_alloc()'s return value.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      28b737f6
    • Wang Xiaoguang's avatar
      btrfs: fix fsfreeze hang caused by delayed iputs deal · 9e7cc91a
      Wang Xiaoguang authored
      When running fstests generic/068, sometimes we got below deadlock:
        xfs_io          D ffff8800331dbb20     0  6697   6693 0x00000080
        ffff8800331dbb20 ffff88007acfc140 ffff880034d895c0 ffff8800331dc000
        ffff880032d243e8 fffffffeffffffff ffff880032d24400 0000000000000001
        ffff8800331dbb38 ffffffff816a9045 ffff880034d895c0 ffff8800331dbba8
        Call Trace:
        [<ffffffff816a9045>] schedule+0x35/0x80
        [<ffffffff816abab2>] rwsem_down_read_failed+0xf2/0x140
        [<ffffffff8118f5e1>] ? __filemap_fdatawrite_range+0xd1/0x100
        [<ffffffff8134f978>] call_rwsem_down_read_failed+0x18/0x30
        [<ffffffffa06631fc>] ? btrfs_alloc_block_rsv+0x2c/0xb0 [btrfs]
        [<ffffffff810d32b5>] percpu_down_read+0x35/0x50
        [<ffffffff81217dfc>] __sb_start_write+0x2c/0x40
        [<ffffffffa067f5d5>] start_transaction+0x2a5/0x4d0 [btrfs]
        [<ffffffffa067f857>] btrfs_join_transaction+0x17/0x20 [btrfs]
        [<ffffffffa068ba34>] btrfs_evict_inode+0x3c4/0x5d0 [btrfs]
        [<ffffffff81230a1a>] evict+0xba/0x1a0
        [<ffffffff812316b6>] iput+0x196/0x200
        [<ffffffffa06851d0>] btrfs_run_delayed_iputs+0x70/0xc0 [btrfs]
        [<ffffffffa067f1d8>] btrfs_commit_transaction+0x928/0xa80 [btrfs]
        [<ffffffffa0646df0>] btrfs_freeze+0x30/0x40 [btrfs]
        [<ffffffff81218040>] freeze_super+0xf0/0x190
        [<ffffffff81229275>] do_vfs_ioctl+0x4a5/0x5c0
        [<ffffffff81003176>] ? do_audit_syscall_entry+0x66/0x70
        [<ffffffff810038cf>] ? syscall_trace_enter_phase1+0x11f/0x140
        [<ffffffff81229409>] SyS_ioctl+0x79/0x90
        [<ffffffff81003c12>] do_syscall_64+0x62/0x110
        [<ffffffff816acbe1>] entry_SYSCALL64_slow_path+0x25/0x25
      
      >From this warning, freeze_super() already holds SB_FREEZE_FS, but
      btrfs_freeze() will call btrfs_commit_transaction() again, if
      btrfs_commit_transaction() finds that it has delayed iputs to handle,
      it'll start_transaction(), which will try to get SB_FREEZE_FS lock
      again, then deadlock occurs.
      
      The root cause is that in btrfs, sync_filesystem(sb) does not make
      sure all metadata is updated. There still maybe some codes adding
      delayed iputs, see below sample race window:
      
               CPU1                                  |         CPU2
      |-> freeze_super()                             |
          |-> sync_filesystem(sb);                   |
          |                                          |-> cleaner_kthread()
          |                                          |   |-> btrfs_delete_unused_bgs()
          |                                          |       |-> btrfs_remove_chunk()
          |                                          |           |-> btrfs_remove_block_group()
          |                                          |               |-> btrfs_add_delayed_iput()
          |                                          |
          |-> sb->s_writers.frozen = SB_FREEZE_FS;   |
          |-> sb_wait_write(sb, SB_FREEZE_FS);       |
          |   acquire SB_FREEZE_FS lock.             |
          |                                          |
          |-> btrfs_freeze()                         |
              |-> btrfs_commit_transaction()         |
                  |-> btrfs_run_delayed_iputs()      |
                  |   will handle delayed iputs,     |
                  |   that means start_transaction() |
                  |   will be called, which will try |
                  |   to get SB_FREEZE_FS lock.      |
      
      To fix this issue, introduce a "int fs_frozen" to record internally whether
      fs has been frozen. If fs has been frozen, we can not handle delayed iputs.
      Signed-off-by: default avatarWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add comment to btrfs_freeze ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9e7cc91a
    • Wang Xiaoguang's avatar
      btrfs: update btrfs_space_info's bytes_may_use timely · 18513091
      Wang Xiaoguang authored
      This patch can fix some false ENOSPC errors, below test script can
      reproduce one false ENOSPC error:
      	#!/bin/bash
      	dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
      	dev=$(losetup --show -f fs.img)
      	mkfs.btrfs -f -M $dev
      	mkdir /tmp/mntpoint
      	mount $dev /tmp/mntpoint
      	cd /tmp/mntpoint
      	xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile
      
      Above script will fail for ENOSPC reason, but indeed fs still has free
      space to satisfy this request. Please see call graph:
      btrfs_fallocate()
      |-> btrfs_alloc_data_chunk_ondemand()
      |   bytes_may_use += 64M
      |-> btrfs_prealloc_file_range()
          |-> btrfs_reserve_extent()
              |-> btrfs_add_reserved_bytes()
              |   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
              |   change bytes_may_use, and bytes_reserved += 64M. Now
              |   bytes_may_use + bytes_reserved == 128M, which is greater
              |   than btrfs_space_info's total_bytes, false enospc occurs.
              |   Note, the bytes_may_use decrease operation will be done in
              |   end of btrfs_fallocate(), which is too late.
      
      Here is another simple case for buffered write:
                          CPU 1              |              CPU 2
                                             |
      |-> cow_file_range()                   |-> __btrfs_buffered_write()
          |-> btrfs_reserve_extent()         |   |
          |                                  |   |
          |                                  |   |
          |    .....                         |   |-> btrfs_check_data_free_space()
          |                                  |
          |                                  |
          |-> extent_clear_unlock_delalloc() |
      
      In CPU 1, btrfs_reserve_extent()->find_free_extent()->
      btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
      operation will be delayed to be done in extent_clear_unlock_delalloc().
      Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
      btrfs_check_data_free_space() tries to reserve 100MB data space.
      If
      	100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
      		data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
      		data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
      btrfs_check_data_free_space() will try to allcate new data chunk or call
      btrfs_start_delalloc_roots(), or commit current transaction in order to
      reserve some free space, obviously a lot of work. But indeed it's not
      necessary as long as decreasing bytes_may_use timely, we still have
      free space, decreasing 128M from bytes_may_use.
      
      To fix this issue, this patch chooses to update bytes_may_use for both
      data and metadata in btrfs_add_reserved_bytes(). For compress path, real
      extent length may not be equal to file content length, so introduce a
      ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
      btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
      file content length. Then compress path can update bytes_may_use
      correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
      and RESERVE_FREE.
      
      As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
      run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
      PREALLOC, we also need to update bytes_may_use, but can not pass
      EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
      here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
      to update btrfs_space_info's bytes_may_use.
      
      Meanwhile __btrfs_prealloc_file_range() will call
      btrfs_free_reserved_data_space() internally for both sucessful and failed
      path, btrfs_prealloc_file_range()'s callers does not need to call
      btrfs_free_reserved_data_space() any more.
      Signed-off-by: default avatarWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      18513091
    • Wang Xiaoguang's avatar
      btrfs: divide btrfs_update_reserved_bytes() into two functions · 4824f1f4
      Wang Xiaoguang authored
      This patch divides btrfs_update_reserved_bytes() into
      btrfs_add_reserved_bytes() and btrfs_free_reserved_bytes(), and
      next patch will extend btrfs_add_reserved_bytes()to fix some
      false ENOSPC error, please see later patch for detailed info.
      Signed-off-by: default avatarWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      4824f1f4