1. 28 May, 2016 5 commits
  2. 27 May, 2016 35 commits
    • Linus Torvalds's avatar
      Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs · 0121a322
      Linus Torvalds authored
      Pull overlayfs update from Miklos Szeredi:
       "The meat of this is a change to use the mounter's credentials for
        operations that require elevated privileges (such as whiteout
        creation).  This fixes behavior under user namespaces as well as being
        a nice cleanup"
      
      * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
        ovl: Do d_type check only if work dir creation was successful
        ovl: update documentation
        ovl: override creds with the ones from the superblock mounter
      0121a322
    • Linus Torvalds's avatar
      Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs · 559b6d90
      Linus Torvalds authored
      Pull btrfs cleanups and fixes from Chris Mason:
       "We have another round of fixes and a few cleanups.
      
        I have a fix for short returns from btrfs_copy_from_user, which
        finally nails down a very hard to find regression we added in v4.6.
      
        Dave is pushing around gfp parameters, mostly to cleanup internal apis
        and make it a little more consistent.
      
        The rest are smaller fixes, and one speelling fixup patch"
      
      * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (22 commits)
        Btrfs: fix handling of faults from btrfs_copy_from_user
        btrfs: fix string and comment grammatical issues and typos
        btrfs: scrub: Set bbio to NULL before calling btrfs_map_block
        Btrfs: fix unexpected return value of fiemap
        Btrfs: free sys_array eb as soon as possible
        btrfs: sink gfp parameter to convert_extent_bit
        btrfs: make state preallocation more speculative in __set_extent_bit
        btrfs: untangle gotos a bit in convert_extent_bit
        btrfs: untangle gotos a bit in __clear_extent_bit
        btrfs: untangle gotos a bit in __set_extent_bit
        btrfs: sink gfp parameter to set_record_extent_bits
        btrfs: sink gfp parameter to set_extent_new
        btrfs: sink gfp parameter to set_extent_defrag
        btrfs: sink gfp parameter to set_extent_delalloc
        btrfs: sink gfp parameter to clear_extent_dirty
        btrfs: sink gfp parameter to clear_record_extent_bits
        btrfs: sink gfp parameter to clear_extent_bits
        btrfs: sink gfp parameter to set_extent_bits
        btrfs: make find_workspace warn if there are no workspaces
        btrfs: make find_workspace always succeed
        ...
      559b6d90
    • Linus Torvalds's avatar
      make IS_ERR_VALUE() complain about non-pointer-sized arguments · aa00edc1
      Linus Torvalds authored
      Now that the allmodconfig x86-64 build is clean wrt IS_ERR_VALUE() uses
      on integers, add a cast to a pointer and back to the argument, so that
      any new mis-uses of IS_ERR_VALUE() will cause warnings like
      
         warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
      
      so that we don't re-introduce any bogus uses.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa00edc1
    • Linus Torvalds's avatar
      mm: remove more IS_ERR_VALUE abuses · 5d22fc25
      Linus Torvalds authored
      The do_brk() and vm_brk() return value was "unsigned long" and returned
      the starting address on success, and an error value on failure.  The
      reasons are entirely historical, and go back to it basically behaving
      like the mmap() interface does.
      
      However, nobody actually wanted that interface, and it causes totally
      pointless IS_ERR_VALUE() confusion.
      
      What every single caller actually wants is just the simpler integer
      return of zero for success and negative error number on failure.
      
      So just convert to that much clearer and more common calling convention,
      and get rid of all the IS_ERR_VALUE() uses wrt vm_brk().
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d22fc25
    • Arnd Bergmann's avatar
      remove lots of IS_ERR_VALUE abuses · 287980e4
      Arnd Bergmann authored
      Most users of IS_ERR_VALUE() in the kernel are wrong, as they
      pass an 'int' into a function that takes an 'unsigned long'
      argument. This happens to work because the type is sign-extended
      on 64-bit architectures before it gets converted into an
      unsigned type.
      
      However, anything that passes an 'unsigned short' or 'unsigned int'
      argument into IS_ERR_VALUE() is guaranteed to be broken, as are
      8-bit integers and types that are wider than 'unsigned long'.
      
      Andrzej Hajda has already fixed a lot of the worst abusers that
      were causing actual bugs, but it would be nice to prevent any
      users that are not passing 'unsigned long' arguments.
      
      This patch changes all users of IS_ERR_VALUE() that I could find
      on 32-bit ARM randconfig builds and x86 allmodconfig. For the
      moment, this doesn't change the definition of IS_ERR_VALUE()
      because there are probably still architecture specific users
      elsewhere.
      
      Almost all the warnings I got are for files that are better off
      using 'if (err)' or 'if (err < 0)'.
      The only legitimate user I could find that we get a warning for
      is the (32-bit only) freescale fman driver, so I did not remove
      the IS_ERR_VALUE() there but changed the type to 'unsigned long'.
      For 9pfs, I just worked around one user whose calling conventions
      are so obscure that I did not dare change the behavior.
      
      I was using this definition for testing:
      
       #define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \
             unlikely((unsigned long long)(x) >= (unsigned long long)(typeof(x))-MAX_ERRNO))
      
      which ends up making all 16-bit or wider types work correctly with
      the most plausible interpretation of what IS_ERR_VALUE() was supposed
      to return according to its users, but also causes a compile-time
      warning for any users that do not pass an 'unsigned long' argument.
      
      I suggested this approach earlier this year, but back then we ended
      up deciding to just fix the users that are obviously broken. After
      the initial warning that caused me to get involved in the discussion
      (fs/gfs2/dir.c) showed up again in the mainline kernel, Linus
      asked me to send the whole thing again.
      
      [ Updated the 9p parts as per Al Viro  - Linus ]
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Andrzej Hajda <a.hajda@samsung.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: https://lkml.org/lkml/2016/1/7/363
      Link: https://lkml.org/lkml/2016/5/27/486
      Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> # For nvmem part
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      287980e4
    • Linus Torvalds's avatar
      mm: fix section mismatch warning · 7ded384a
      Linus Torvalds authored
      The register_page_bootmem_info_node() function needs to be marked __init
      in order to avoid a new warning introduced by commit f65e91df ("mm:
      use early_pfn_to_nid in register_page_bootmem_info_node").
      
      Otherwise you'll get a warning about how a non-init function calls
      early_pfn_to_nid (which is __meminit)
      
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ded384a
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · af7d9372
      Linus Torvalds authored
      Merge misc updates and fixes from Andrew Morton:
      
       - late-breaking ocfs2 updates
      
       - random bunch of fixes
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm: disable DEFERRED_STRUCT_PAGE_INIT on !NO_BOOTMEM
        mm/memcontrol.c: move comments for get_mctgt_type() to proper position
        mm/memcontrol.c: fix the margin computation in mem_cgroup_margin()
        mm/cma: silence warnings due to max() usage
        mm: thp: avoid false positive VM_BUG_ON_PAGE in page_move_anon_rmap()
        oom_reaper: close race with exiting task
        mm: use early_pfn_to_nid in register_page_bootmem_info_node
        mm: use early_pfn_to_nid in page_ext_init
        MAINTAINERS: Kdump maintainers update
        MAINTAINERS: add kexec_core.c and kexec_file.c
        mm: oom: do not reap task if there are live threads in threadgroup
        direct-io: fix direct write stale data exposure from concurrent buffered read
        ocfs2: bump up o2cb network protocol version
        ocfs2: o2hb: fix hb hung time
        ocfs2: o2hb: don't negotiate if last hb fail
        ocfs2: o2hb: add some user/debug log
        ocfs2: o2hb: add NEGOTIATE_APPROVE message
        ocfs2: o2hb: add NEGO_TIMEOUT message
        ocfs2: o2hb: add negotiate timer
      af7d9372
    • Gavin Shan's avatar
      mm: disable DEFERRED_STRUCT_PAGE_INIT on !NO_BOOTMEM · 11e68567
      Gavin Shan authored
      When we have !NO_BOOTMEM, the deferred page struct initialization
      doesn't work well because the pages reserved in bootmem are released to
      the page allocator uncoditionally.  It causes memory corruption and
      system crash eventually.
      
      As Mel suggested, the bootmem is retiring slowly.  We fix the issue by
      simply hiding DEFERRED_STRUCT_PAGE_INIT when bootmem is enabled.
      
      Link: http://lkml.kernel.org/r/1460602170-5821-1-git-send-email-gwshan@linux.vnet.ibm.comSigned-off-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11e68567
    • Li RongQing's avatar
      mm/memcontrol.c: move comments for get_mctgt_type() to proper position · 7cf7806c
      Li RongQing authored
      Move the comments for get_mctgt_type() to be before get_mctgt_type()
      implementation.
      
      Link: http://lkml.kernel.org/r/1463644638-7446-1-git-send-email-roy.qing.li@gmail.comSigned-off-by: default avatarLi RongQing <roy.qing.li@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7cf7806c
    • Li RongQing's avatar
      mm/memcontrol.c: fix the margin computation in mem_cgroup_margin() · cbedbac3
      Li RongQing authored
      mem_cgroup_margin() might return (memory.limit - memory_count) when the
      memsw.limit is in excess.  This doesn't happen usually because we do not
      allow excess on hard limits and (memory.limit <= memsw.limit), but
      __GFP_NOFAIL charges can force the charge and cause the excess when no
      memory is really swappable (swap is full or no anonymous memory is
      left).
      
      [mhocko@suse.com: rewrote changelog]
        Link: http://lkml.kernel.org/r/20160525155122.GK20132@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/1464068266-27736-1-git-send-email-roy.qing.li@gmail.comSigned-off-by: default avatarLi RongQing <roy.qing.li@gmail.com>
      Acked-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cbedbac3
    • Stephen Rothwell's avatar
      mm/cma: silence warnings due to max() usage · badbda53
      Stephen Rothwell authored
      pageblock_order can be (at least) an unsigned int or an unsigned long
      depending on the kernel config and architecture, so use max_t(unsigned
      long, ...) when comparing it.
      
      fixes these warnings:
      
      In file included from include/asm-generic/bug.h:13:0,
                       from arch/powerpc/include/asm/bug.h:127,
                       from include/linux/bug.h:4,
                       from include/linux/mmdebug.h:4,
                       from include/linux/mm.h:8,
                       from include/linux/memblock.h:18,
                       from mm/cma.c:28:
      mm/cma.c: In function 'cma_init_reserved_mem':
      include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast
        (void) (&_max1 == &_max2);                   ^
      mm/cma.c:186:27: note: in expansion of macro 'max'
        alignment = PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order);
                                 ^
      mm/cma.c: In function 'cma_declare_contiguous':
      include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast
        (void) (&_max1 == &_max2);                   ^
      include/linux/kernel.h:747:9: note: in definition of macro 'max'
        typeof(y) _max2 = (y);            ^
      mm/cma.c:270:29: note: in expansion of macro 'max'
         (phys_addr_t)PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order));
                                   ^
      include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast
        (void) (&_max1 == &_max2);                   ^
      include/linux/kernel.h:747:21: note: in definition of macro 'max'
        typeof(y) _max2 = (y);                        ^
      mm/cma.c:270:29: note: in expansion of macro 'max'
         (phys_addr_t)PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order));
                                   ^
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20160526150748.5be38a4f@canb.auug.org.auSigned-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      badbda53
    • Kirill A. Shutemov's avatar
      mm: thp: avoid false positive VM_BUG_ON_PAGE in page_move_anon_rmap() · 0798d3c0
      Kirill A. Shutemov authored
      If page_move_anon_rmap() is refiling a pmd-splitted THP mapped in a tail
      page from a pte, the "address" must be THP aligned in order for the
      page->index bugcheck to pass in the CONFIG_DEBUG_VM=y builds.
      
      Link: http://lkml.kernel.org/r/1464253620-106404-1-git-send-email-kirill.shutemov@linux.intel.com
      Fixes: 6d0a07ed ("mm: thp: calculate the mapcount correctly for THP pages during WP faults")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarMika Westerberg <mika.westerberg@linux.intel.com>
      Tested-by: default avatarMika Westerberg <mika.westerberg@linux.intel.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>        [4.5]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0798d3c0
    • Michal Hocko's avatar
      oom_reaper: close race with exiting task · e2fe1456
      Michal Hocko authored
      Tetsuo has reported:
        Out of memory: Kill process 443 (oleg's-test) score 855 or sacrifice child
        Killed process 443 (oleg's-test) total-vm:493248kB, anon-rss:423880kB, file-rss:4kB, shmem-rss:0kB
        sh invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
        sh cpuset=/ mems_allowed=0
        CPU: 2 PID: 1 Comm: sh Not tainted 4.6.0-rc7+ #51
        Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
        Call Trace:
          dump_stack+0x85/0xc8
          dump_header+0x5b/0x394
        oom_reaper: reaped process 443 (oleg's-test), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      In other words:
      
        __oom_reap_task		exit_mm
          atomic_inc_not_zero
      				  tsk->mm = NULL
      				  mmput
      				    atomic_dec_and_test # > 0
      				  exit_oom_victim # New victim will be
      						  # selected
      				<OOM killer invoked>
      				  # no TIF_MEMDIE task so we can select a new one
          unmap_page_range # to release the memory
      
      The race exists even without the oom_reaper because anybody who pins the
      address space and gets preempted might race with exit_mm but oom_reaper
      made this race more probable.
      
      We can address the oom_reaper part by using oom_lock for __oom_reap_task
      because this would guarantee that a new oom victim will not be selected
      if the oom reaper might race with the exit path.  This doesn't solve the
      original issue, though, because somebody else still might be pinning
      mm_users and so __mmput won't be called to release the memory but that
      is not really realiably solvable because the task will get away from the
      oom sight as soon as it is unhashed from the task_list and so we cannot
      guarantee a new victim won't be selected.
      
      [akpm@linux-foundation.org: fix use of unused `mm', Per Stephen]
      [akpm@linux-foundation.org: coding-style fixes]
      Fixes: aac45363 ("mm, oom: introduce oom reaper")
      Link: http://lkml.kernel.org/r/1464271493-20008-1-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2fe1456
    • Yang Shi's avatar
      mm: use early_pfn_to_nid in register_page_bootmem_info_node · f65e91df
      Yang Shi authored
      register_page_bootmem_info_node() is invoked in mem_init(), so it will
      be called before page_alloc_init_late() if DEFERRED_STRUCT_PAGE_INIT is
      enabled.  But, pfn_to_nid() depends on memmap which won't be fully setup
      until page_alloc_init_late() is done, so replace pfn_to_nid() by
      early_pfn_to_nid().
      
      Link: http://lkml.kernel.org/r/1464210007-30930-1-git-send-email-yang.shi@linaro.orgSigned-off-by: default avatarYang Shi <yang.shi@linaro.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f65e91df
    • Yang Shi's avatar
      mm: use early_pfn_to_nid in page_ext_init · fe53ca54
      Yang Shi authored
      page_ext_init() checks suitable pages with pfn_to_nid(), but
      pfn_to_nid() depends on memmap which will not be setup fully until
      page_alloc_init_late() is done.  Use early_pfn_to_nid() instead of
      pfn_to_nid() so that page extension could be still used early even
      though CONFIG_ DEFERRED_STRUCT_PAGE_INIT is enabled and catch early page
      allocation call sites.
      
      Suggested by Joonsoo Kim [1], this fix basically undoes the change
      introduced by commit b8f1a75d ("mm: call page_ext_init() after all
      struct pages are initialized") and fixes the same problem with a better
      approach.
      
      [1] http://lkml.kernel.org/r/CAAmzW4OUmyPwQjvd7QUfc6W1Aic__TyAuH80MLRZNMxKy0-wPQ@mail.gmail.com
      
      Link: http://lkml.kernel.org/r/1464198689-23458-1-git-send-email-yang.shi@linaro.orgSigned-off-by: default avatarYang Shi <yang.shi@linaro.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe53ca54
    • Vivek Goyal's avatar
      MAINTAINERS: Kdump maintainers update · f871f191
      Vivek Goyal authored
      I am proposing following updates to kdump maintainership.  I have got
      busy in other things and not getting time to spend on kdump.
      
      Remove Haren Myneni as he has not participated in kdump development for
      a long time now.
      
      Add the names of Dave and Baoquan as kdump maintainers as they have been
      contributing to kdump for a long time now and they are in a much better
      position to spend time on this than me.
      
      Mark myself as a reviewer.
      
      Link: http://lkml.kernel.org/r/20160525131616.GB27291@redhat.comSigned-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Acked-by: default avatarSimon Horman <horms@verge.net.au>
      Cc: Haren Myneni <hbabu@us.ibm.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f871f191
    • Minfei Huang's avatar
      MAINTAINERS: add kexec_core.c and kexec_file.c · 10540a69
      Minfei Huang authored
      In the below commits kexec.c was split to kexec.c, kexec_file.c and
      kexec_core.c.
      
      commit a43cac0d ("kexec: split kexec_file syscall code to kexec_file.c")
      commit 2965faa5 ("kexec: split kexec_load syscall from kexec core code")
      
      Both kexec_file.c and kexec_core.c still belong to the kexec component.
      In order to get correct mail lists by using the script get_maintainer.pl,
      add these files to MAINTAINERS.
      
      Link: http://lkml.kernel.org/r/1464189735-59113-1-git-send-email-mnghuan@gmail.comSigned-off-by: default avatarMinfei Huang <mnghuan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10540a69
    • Vladimir Davydov's avatar
      mm: oom: do not reap task if there are live threads in threadgroup · edd9f723
      Vladimir Davydov authored
      If the current process is exiting, we don't invoke oom killer, instead
      we give it access to memory reserves and try to reap its mm in case
      nobody is going to use it.  There's a mistake in the code performing
      this check - we just ignore any process of the same thread group no
      matter if it is exiting or not - see try_oom_reaper.  Fix it.
      
      Link: http://lkml.kernel.org/r/1464087628-7318-1-git-send-email-vdavydov@virtuozzo.com
      Fixes: 3ef22dff ("oom, oom_reaper: try to reap tasks which skip regular OOM killer path")Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      edd9f723
    • Eryu Guan's avatar
      direct-io: fix direct write stale data exposure from concurrent buffered read · 9ecd10b7
      Eryu Guan authored
      Currently direct writes inside i_size on a DIO_SKIP_HOLES filesystem are
      not allowed to allocate blocks(get_more_blocks() sets 'create' to 0
      before calling get_block() callback), if it's a sparse file, direct
      writes fall back to buffered writes to avoid stale data exposure from
      concurrent buffered read.  But there're two cases that can result in
      stale data exposure are not correctly detected.
      
      1. The detection for "writing inside i_size" is not sufficient,
         writes can be treated as "extending writes" wrongly.  For example,
         direct write 1FSB (file system block) to a 1FSB sparse file on
         ext2/3/4, starting from offset 0, in this case it's writing inside
         i_size, but 'create' is non-zero, because 'block_in_file' and
         '(i_size_read(inode) >> blkbits' are both zero.
      
      2. Direct writes starting from or beyong i_size (not inside i_size)
         also could trigger block allocation and expose stale data.  For
         example, consider a sparse file with i_size of 2k, and a write to
         offset 2k or 3k into the file, with a filesystem block size of 4k.
         (Thanks to Jeff Moyer for pointing this case out in his review.)
      
      The first problem can be demostrated by running ltp-aiodio test ADSP045
      many times.  When testing on extN filesystems, I see test failures
      occasionally, buffered read could read non-zero (stale) data.
      
      ADSP045: dio_sparse -a 4k -w 4k -s 2k -n 1
      
      dio_sparse    0  TINFO  :  Dirtying free blocks
      dio_sparse    0  TINFO  :  Starting I/O tests
      non zero buffer at buf[0] => 0xffffffaa,ffffffaa,ffffffaa,ffffffaa
      non-zero read at offset 0
      dio_sparse    0  TINFO  :  Killing childrens(s)
      dio_sparse    1  TFAIL  :  dio_sparse.c:191: 1 children(s) exited abnormally
      
      The second problem can also be reproduced easily by a hacked dio_sparse
      program, which accepts an option to specify the write offset.
      
      What we should really do is to disable block allocation for writes that
      could result in filling holes inside i_size.
      
      Link: http://lkml.kernel.org/r/1463156728-13357-1-git-send-email-guaneryu@gmail.comReviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarEryu Guan <guaneryu@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ecd10b7
    • Junxiao Bi's avatar
      ocfs2: bump up o2cb network protocol version · 38b52efd
      Junxiao Bi authored
      Two new messages are added to support negotiating hb timeout.  Stop
      nodes frmo talking an old version to mount as they will cause the
      negotiation to fail.
      
      Link: http://lkml.kernel.org/r/1464231615-27939-1-git-send-email-junxiao.bi@oracle.comSigned-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38b52efd
    • Junxiao Bi's avatar
      ocfs2: o2hb: fix hb hung time · 6633ca57
      Junxiao Bi authored
      hr_last_timeout_start should be set as the last time where hb is
      still OK.  When hb write timeout, hung time will be (jiffies -
      hr_last_timeout_start).
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Gang He <ghe@suse.com>
      Cc: rwxybh <rwxybh@126.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6633ca57
    • Junxiao Bi's avatar
      ocfs2: o2hb: don't negotiate if last hb fail · 88dbe98d
      Junxiao Bi authored
      Sometimes io error is returned when storage is down for a while.  Like
      for iscsi device, stroage is made offline when session timeout, and this
      will make all io return -EIO.  For this case, nodes shouldn't do
      negotiate timeout but should fence self.  So let nodes fence self when
      o2hb_do_disk_heartbeat return an error, this is the same behavior with
      o2hb without negotiate timer.
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Gang He <ghe@suse.com>
      Cc: rwxybh <rwxybh@126.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88dbe98d
    • Junxiao Bi's avatar
      ocfs2: o2hb: add some user/debug log · 1bd12902
      Junxiao Bi authored
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Gang He <ghe@suse.com>
      Cc: rwxybh <rwxybh@126.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1bd12902
    • Junxiao Bi's avatar
      ocfs2: o2hb: add NEGOTIATE_APPROVE message · e76f8237
      Junxiao Bi authored
      This message is used to re-queue write timeout timer and negotiate timer
      when all nodes suffer a write hung to storage, this makes node not fence
      self if storage down.
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Gang He <ghe@suse.com>
      Cc: rwxybh <rwxybh@126.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e76f8237
    • Junxiao Bi's avatar
      ocfs2: o2hb: add NEGO_TIMEOUT message · 34069b88
      Junxiao Bi authored
      This message is sent to master node when non-master nodes's negotiate
      timer expired.  Master node records these nodes in a bitmap which is
      used to do write timeout timer re-queue decision.
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Gang He <ghe@suse.com>
      Cc: rwxybh <rwxybh@126.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34069b88
    • Junxiao Bi's avatar
      ocfs2: o2hb: add negotiate timer · e0cbb798
      Junxiao Bi authored
      This series of patches is to fix the issue that when storage down, all
      nodes will fence self due to write timeout.
      
      With this patch set, all nodes will keep going until storage back
      online, except if the following issue happens, then all nodes will do as
      before to fence self.
      
      1. io error got
      2. network between nodes down
      3. nodes panic
      
      This patch (of 6):
      
      When storage down, all nodes will fence self due to write timeout.  The
      negotiate timer is designed to avoid this, with it node will wait until
      storage up again.
      
      Negotiate timer working in the following way:
      
      1. The timer expires before write timeout timer, its timeout is half
         of write timeout now.  It is re-queued along with write timeout timer.
         If expires, it will send NEGO_TIMEOUT message to master node(node with
         lowest node number).  This message does nothing but marks a bit in a
         bitmap recording which nodes are negotiating timeout on master node.
      
      2. If storage down, nodes will send this message to master node, then
         when master node finds its bitmap including all online nodes, it sends
         NEGO_APPROVL message to all nodes one by one, this message will
         re-queue write timeout timer and negotiate timer.  For any node doesn't
         receive this message or meets some issue when handling this message, it
         will be fenced.  If storage up at any time, o2hb_thread will run and
         re-queue all the timer, nothing will be affected by these two steps.
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Gang He <ghe@suse.com>
      Cc: rwxybh <rwxybh@126.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0cbb798
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.dk/linux-block · 564884fb
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "A set of fixes that wasn't included in the first merge window pull
        request.  This pull request contains:
      
         - A set of NVMe fixes from Keith, and one from Nic for the integrity
           side of it.
      
         - Fix from Ming, clearing ->mq_ops if we don't successfully setup a
           queue for multiqueue.
      
         - A set of stability fixes for bcache from Jiri, and also marking
           bcache as orphaned as it's no longer actively maintained (in
           mainline, at least)"
      
      * 'for-linus' of git://git.kernel.dk/linux-block:
        blk-mq: clear q->mq_ops if init fail
        MAINTAINERS: mark bcache as orphan
        bcache: bch_gc_thread() is not freezable
        bcache: bch_allocator_thread() is not freezable
        bcache: bch_writeback_thread() is not freezable
        nvme/host: Add missing blk_integrity tag_size + flags assignments
        NVMe: Add device ID's with stripe quirk
        NVMe: Short-cut removal on surprise hot-unplug
        NVMe: Allow user initiated rescan
        NVMe: Reduce driver log spamming
        NVMe: Unbind driver on failure
        NVMe: Delete only created queues
        NVMe: Allocate queues only for online cpus
      564884fb
    • Linus Torvalds's avatar
      Merge tag 'for-linus-20160527' of git://git.infradead.org/linux-mtd · 4cae85be
      Linus Torvalds authored
      Pull MTD fixes from Brian Norris:
       "We've already noticed a few flaws in the MTD work for v4.7-rc1:
      
         - The Atmel folks got ahead of themselves on trying to support their
           latest hardware and were working off incorrect documentation.  Fix
           up the NAND driver to get this correct.
      
         - Fix up device tree example documentation to use the latest
           recommendations for describing NAND ECC algorithms"
      
      * tag 'for-linus-20160527' of git://git.infradead.org/linux-mtd:
        Documentation: dt: mtd: drop "soft_bch" from example
        Revert "mtd: atmel_nand: Support variable RB_EDGE interrupts"
      4cae85be
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-v4.7-rc1' of git://people.freedesktop.org/~airlied/linux · c61b49c7
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
      
       - one IMX built-in regression fix
      
       - a set of amdgpu fixes, mostly powerplay and polaris GPU stuff
      
       - a set of i915 fixes all over, many cc'ed to stable.
      
         The i915 batch contain support for DP++ dongle detection, which is
         used to fix some regressions in the HDMI color depth area
      
      * tag 'drm-fixes-v4.7-rc1' of git://people.freedesktop.org/~airlied/linux: (44 commits)
        drm/amd: add Kconfig dependency for ACP on DRM_AMDGPU
        drm/amdgpu: Fix hdmi deep color support.
        drm/amdgpu: fix bug in fence driver fini
        drm/i915: Stop automatically retiring requests after a GPU hang
        drm/i915: Unify intel_ring_begin()
        drm/i915: Ignore stale wm register values on resume on ilk-bdw (v2)
        drm/i915/psr: Try to program link training times correctly
        drm/imx: Match imx-ipuv3-crtc components using device node in platform data
        drm/i915/bxt: Adjusting the error in horizontal timings retrieval
        drm/i915: Don't leave old junk in ilk active watermarks on readout
        drm/i915: s/DPPL/DPLL/ for SKL DPLLs
        drm/i915: Fix gen8 semaphores id for legacy mode
        drm/i915: Set crtc_state->lane_count for HDMI
        drm/i915/BXT: Retrieving the horizontal timing for DSI
        drm/i915: Protect gen7 irq_seqno_barrier with uncore lock
        drm/i915: Re-enable GGTT earlier during resume on pre-gen6 platforms
        drm/i915: Determine DP++ type 1 DVI adaptor presence based on VBT
        drm/i915: Enable/disable TMDS output buffers in DP++ adaptor as needed
        drm/i915: Respect DP++ adaptor TMDS clock limit
        drm: Add helper for DP++ adaptors
        ...
      c61b49c7
    • Linus Torvalds's avatar
      Merge tag 'platform-drivers-x86-v4.7-1' of... · 1e8143db
      Linus Torvalds authored
      Merge tag 'platform-drivers-x86-v4.7-1' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86
      
      Pull x86 platform driver updates from Darren Hart:
       "Mostly minor updates and cleanups.  One new power management
        controller driver for Intel Core SoCs.
      
        platform/x86:
         - Add PMC Driver for Intel Core SoC
      
        dell-rbtn:
         - Ignore ACPI notifications if device is suspended
      
        thinkpad_acpi:
         - save kbdlight state on suspend and restore it on resume
      
        intel_menlow:
         - reduce code duplication
      
        asus-wmi:
         - provide access to ALS control
      
        ideapad-laptop:
         - add a new WMI string for ESC key
      
        surfacepro3_button:
         - Add a warning when switching to tablet mode
      
        sony-laptop:
         - Avoid oops on module unload for older laptops
      
        intel_telemetry:
         - Constify telemetry_core_ops structures
      
        fujitsu-laptop:
         - Use IS_ENABLED() instead of checking for built-in or module
      
        asus-laptop:
         - correct error handling in sysfs_acpi_set
         - remove redundant initializers
         - correct error handling in asus_read_brightness()
      
        fujitsu-laptop:
         - Support radio LED"
      
      * tag 'platform-drivers-x86-v4.7-1' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86:
        platform/x86: Add PMC Driver for Intel Core SoC
        dell-rbtn: Ignore ACPI notifications if device is suspended
        thinkpad_acpi: save kbdlight state on suspend and restore it on resume
        intel_menlow: reduce code duplication
        asus-wmi: provide access to ALS control
        ideapad-laptop: add a new WMI string for ESC key
        surfacepro3_button: Add a warning when switching to tablet mode
        sony-laptop: Avoid oops on module unload for older laptops
        intel_telemetry: Constify telemetry_core_ops structures
        fujitsu-laptop: Use IS_ENABLED() instead of checking for built-in or module
        asus-laptop: correct error handling in sysfs_acpi_set
        asus-laptop: remove redundant initializers
        asus-laptop: correct error handling in asus_read_brightness()
        fujitsu-laptop: Support radio LED
      1e8143db
    • Linus Torvalds's avatar
      Merge git://git.infradead.org/intel-iommu · 25662785
      Linus Torvalds authored
      Pull intel IOMMU updates from David Woodhouse:
       "This patchset improves the scalability of the Intel IOMMU code by
        resolving two spinlock bottlenecks and eliminating the linearity of
        the IOVA allocator, yielding up to ~5x performance improvement and
        approaching 'iommu=off' performance"
      
      * git://git.infradead.org/intel-iommu:
        iommu/vt-d: Use per-cpu IOVA caching
        iommu/iova: introduce per-cpu caching to iova allocation
        iommu/vt-d: change intel-iommu to use IOVA frame numbers
        iommu/vt-d: avoid dev iotlb logic for domains with no dev iotlbs
        iommu/vt-d: only unmap mapped entries
        iommu/vt-d: correct flush_unmaps pfn usage
        iommu/vt-d: per-cpu deferred invalidation queues
        iommu/vt-d: refactoring of deferred flush entries
      25662785
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · e28e909c
      Linus Torvalds authored
      Pull second batch of KVM updates from Radim Krčmář:
       "General:
      
         - move kvm_stat tool from QEMU repo into tools/kvm/kvm_stat (kvm_stat
           had nothing to do with QEMU in the first place -- the tool only
           interprets debugfs)
      
         - expose per-vm statistics in debugfs and support them in kvm_stat
           (KVM always collected per-vm statistics, but they were summarised
           into global statistics)
      
        x86:
      
         - fix dynamic APICv (VMX was improperly configured and a guest could
           access host's APIC MSRs, CVE-2016-4440)
      
         - minor fixes
      
        ARM changes from Christoffer Dall:
      
         - new vgic reimplementation of our horribly broken legacy vgic
           implementation.  The two implementations will live side-by-side
           (with the new being the configured default) for one kernel release
           and then we'll remove the legacy one.
      
         - fix for a non-critical issue with virtual abort injection to guests"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (70 commits)
        tools: kvm_stat: Add comments
        tools: kvm_stat: Introduce pid monitoring
        KVM: Create debugfs dir and stat files for each VM
        MAINTAINERS: Add kvm tools
        tools: kvm_stat: Powerpc related fixes
        tools: Add kvm_stat man page
        tools: Add kvm_stat vm monitor script
        kvm:vmx: more complete state update on APICv on/off
        KVM: SVM: Add more SVM_EXIT_REASONS
        KVM: Unify traced vector format
        svm: bitwise vs logical op typo
        KVM: arm/arm64: vgic-new: Synchronize changes to active state
        KVM: arm/arm64: vgic-new: enable build
        KVM: arm/arm64: vgic-new: implement mapped IRQ handling
        KVM: arm/arm64: vgic-new: Wire up irqfd injection
        KVM: arm/arm64: vgic-new: Add vgic_v2/v3_enable
        KVM: arm/arm64: vgic-new: vgic_init: implement map_resources
        KVM: arm/arm64: vgic-new: vgic_init: implement vgic_init
        KVM: arm/arm64: vgic-new: vgic_init: implement vgic_create
        KVM: arm/arm64: vgic-new: vgic_init: implement kvm_vgic_hyp_init
        ...
      e28e909c
    • Al Viro's avatar
      switch xattr_handler->set() to passing dentry and inode separately · 59301226
      Al Viro authored
      preparation for similar switch in ->setxattr() (see the next commit for
      rationale).
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      59301226
    • Rajneesh Bhardwaj's avatar
      platform/x86: Add PMC Driver for Intel Core SoC · b740d2e9
      Rajneesh Bhardwaj authored
      This patch adds the Power Management Controller driver as a PCI driver
      for Intel Core SoC architecture.
      
      This driver can utilize debugging capabilities and supported features
      as exposed by the Power Management Controller.
      
      Please refer to the below specification for more details on PMC features.
      http://www.intel.in/content/www/in/en/chipsets/100-series-chipset-datasheet-vol-2.html
      
      The current version of this driver exposes SLP_S0_RESIDENCY counter.
      This counter can be used for detecting fragile SLP_S0 signal related
      failures and take corrective actions when PCH SLP_S0 signal is not
      asserted after kernel freeze as part of suspend to idle flow
      (echo freeze > /sys/power/state).
      
      Intel Platform Controller Hub (PCH) asserts SLP_S0 signal when it
      detects favorable conditions to enter its low power mode. As a
      pre-requisite the SoC should be in deepest possible Package C-State
      and devices should be in low power mode. For example, on Skylake SoC
      the deepest Package C-State is Package C10 or PC10. Suspend to idle
      flow generally leads to PC10 state but PC10 state may not be sufficient
      for realizing the platform wide power potential which SLP_S0 signal
      assertion can provide.
      
      SLP_S0 signal is often connected to the Embedded Controller (EC) and the
      Power Management IC (PMIC) for other platform power management related
      optimizations.
      
      In general, SLP_S0 assertion == PC10 + PCH low power mode + ModPhy Lanes
      power gated + PLL Idle.
      
      As part of this driver, a mechanism to read the SLP_S0_RESIDENCY is exposed
      as an API and also debugfs features are added to indicate SLP_S0 signal
      assertion residency in microseconds.
      
      echo freeze > /sys/power/state
      wake the system
      cat /sys/kernel/debug/pmc_core/slp_s0_residency_usec
      Signed-off-by: default avatarRajneesh Bhardwaj <rajneesh.bhardwaj@intel.com>
      Signed-off-by: default avatarVishwanath Somayaji <vishwanath.somayaji@intel.com>
      Reviewed-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: default avatarDarren Hart <dvhart@linux.intel.com>
      b740d2e9
    • Gabriele Mazzotta's avatar
      dell-rbtn: Ignore ACPI notifications if device is suspended · ff865123
      Gabriele Mazzotta authored
      Some BIOSes unconditionally send an ACPI notification to RBTN when the
      system is resuming from suspend. This makes dell-rbtn send an input
      event to userspace as if a function key was pressed. Prevent this by
      ignoring all the notifications received while the device is suspended.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=106031Signed-off-by: default avatarGabriele Mazzotta <gabriele.mzt@gmail.com>
      Tested-by: default avatarAlex Hung <alex.hung@canonical.com>
      Reviewed-by: default avatarPali Rohár <pali.rohar@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDarren Hart <dvhart@linux.intel.com>
      ff865123