1. 13 Jun, 2017 31 commits
  2. 08 Jun, 2017 9 commits
    • Yisheng Xie's avatar
      mlock: fix mlock count can not decrease in race condition · 00fc586e
      Yisheng Xie authored
      [ Upstream commit 70feee0e ]
      
      Kefeng reported that when running the follow test, the mlock count in
      meminfo will increase permanently:
      
       [1] testcase
       linux:~ # cat test_mlockal
       grep Mlocked /proc/meminfo
        for j in `seq 0 10`
        do
       	for i in `seq 4 15`
       	do
       		./p_mlockall >> log &
       	done
       	sleep 0.2
       done
       # wait some time to let mlock counter decrease and 5s may not enough
       sleep 5
       grep Mlocked /proc/meminfo
      
       linux:~ # cat p_mlockall.c
       #include <sys/mman.h>
       #include <stdlib.h>
       #include <stdio.h>
      
       #define SPACE_LEN	4096
      
       int main(int argc, char ** argv)
       {
      	 	int ret;
      	 	void *adr = malloc(SPACE_LEN);
      	 	if (!adr)
      	 		return -1;
      
      	 	ret = mlockall(MCL_CURRENT | MCL_FUTURE);
      	 	printf("mlcokall ret = %d\n", ret);
      
      	 	ret = munlockall();
      	 	printf("munlcokall ret = %d\n", ret);
      
      	 	free(adr);
      	 	return 0;
      	 }
      
      In __munlock_pagevec() we should decrement NR_MLOCK for each page where
      we clear the PageMlocked flag.  Commit 1ebb7cc6 ("mm: munlock: batch
      NR_MLOCK zone state updates") has introduced a bug where we don't
      decrement NR_MLOCK for pages where we clear the flag, but fail to
      isolate them from the lru list (e.g.  when the pages are on some other
      cpu's percpu pagevec).  Since PageMlocked stays cleared, the NR_MLOCK
      accounting gets permanently disrupted by this.
      
      Fix it by counting the number of page whose PageMlock flag is cleared.
      
      Fixes: 1ebb7cc6 (" mm: munlock: batch NR_MLOCK zone state updates")
      Link: http://lkml.kernel.org/r/1495678405-54569-1-git-send-email-xieyisheng1@huawei.comSigned-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Reported-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Tested-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhongjiang <zhongjiang@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      00fc586e
    • Naoya Horiguchi's avatar
      mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling · 00136071
      Naoya Horiguchi authored
      [ Upstream commit ead07f6a ]
      
      memory_failure() can run in 2 different mode (specified by
      MF_COUNT_INCREASED) in page refcount perspective.  When
      MF_COUNT_INCREASED is set, memory_failure() assumes that the caller
      takes a refcount of the target page.  And if cleared, memory_failure()
      takes it in it's own.
      
      In current code, however, refcounting is done differently in each caller.
      For example, madvise_hwpoison() uses get_user_pages_fast() and
      hwpoison_inject() uses get_page_unless_zero().  So this inconsistent
      refcounting causes refcount failure especially for thp tail pages.
      Typical user visible effects are like memory leak or
      VM_BUG_ON_PAGE(!page_count(page)) in isolate_lru_page().
      
      To fix this refcounting issue, this patch introduces get_hwpoison_page()
      to handle thp tail pages in the same manner for each caller of hwpoison
      code.
      
      memory_failure() might fail to split thp and in such case it returns
      without completing page isolation.  This is not good because PageHWPoison
      on the thp is still set and there's no easy way to unpoison such thps.  So
      this patch try to roll back any action to the thp in "non anonymous thp"
      case and "thp split failed" case, expecting an MCE(SRAR) generated by
      later access afterward will properly free such thps.
      
      [akpm@linux-foundation.org: fix CONFIG_HWPOISON_INJECT=m]
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      00136071
    • Naoya Horiguchi's avatar
      mm/memory-failure: split thp earlier in memory error handling · da7cbd0c
      Naoya Horiguchi authored
      [ Upstream commit 415c64c1 ]
      
      memory_failure() doesn't handle thp itself at this time and need to split
      it before doing isolation.  Currently thp is split in the middle of
      hwpoison_user_mappings(), but there're corner cases where memory_failure()
      wrongly tries to handle thp without splitting.
      
      1) "non anonymous" thp, which is not a normal operating mode of thp,
         but a memory error could hit a thp before anon_vma is initialized.  In
         such case, split_huge_page() fails and me_huge_page() (intended for
         hugetlb) is called for thp, which triggers BUG_ON in page_hstate().
      
      2) !PageLRU case, where hwpoison_user_mappings() returns with
         SWAP_SUCCESS and the result is the same as case 1.
      
      memory_failure() can't avoid splitting, so let's split it more earlier,
      which also reduces code which are prepared for both of normal page and
      thp.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      da7cbd0c
    • Thomas Gleixner's avatar
      slub/memcg: cure the brainless abuse of sysfs attributes · aeb3435b
      Thomas Gleixner authored
      [ Upstream commit 478fe303 ]
      
      memcg_propagate_slab_attrs() abuses the sysfs attribute file functions
      to propagate settings from the root kmem_cache to a newly created
      kmem_cache.  It does that with:
      
           attr->show(root, buf);
           attr->store(new, buf, strlen(bug);
      
      Aside of being a lazy and absurd hackery this is broken because it does
      not check the return value of the show() function.
      
      Some of the show() functions return 0 w/o touching the buffer.  That
      means in such a case the store function is called with the stale content
      of the previous show().  That causes nonsense like invoking
      kmem_cache_shrink() on a newly created kmem_cache.  In the worst case it
      would cause handing in an uninitialized buffer.
      
      This should be rewritten proper by adding a propagate() callback to
      those slub_attributes which must be propagated and avoid that insane
      conversion to and from ASCII, but that's too large for a hot fix.
      
      Check at least the return value of the show() function, so calling
      store() with stale content is prevented.
      
      Steven said:
       "It can cause a deadlock with get_online_cpus() that has been uncovered
        by recent cpu hotplug and lockdep changes that Thomas and Peter have
        been doing.
      
           Possible unsafe locking scenario:
      
                 CPU0                    CPU1
                 ----                    ----
            lock(cpu_hotplug.lock);
                                         lock(slab_mutex);
                                         lock(cpu_hotplug.lock);
            lock(slab_mutex);
      
           *** DEADLOCK ***"
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705201244540.2255@nanosSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reported-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      aeb3435b
    • Tejun Heo's avatar
      blkcg: use blkg_free() in blkcg_init_queue() failure path · afc6ec14
      Tejun Heo authored
      [ Upstream commit 994b7832 ]
      
      When blkcg_init_queue() fails midway after creating a new blkg, it
      performs kfree() directly; however, this doesn't free the policy data
      areas.  Make it use blkg_free() instead.  In turn, blkg_free() is
      updated to handle root request_list special case.
      
      While this fixes a possible memory leak, it's on an unlikely failure
      path of an already cold path and the size leaked per occurrence is
      miniscule too.  I don't think it needs to be tagged for -stable.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      afc6ec14
    • Tejun Heo's avatar
      blkcg: always create the blkcg_gq for the root blkcg · f9fac98f
      Tejun Heo authored
      [ Upstream commit ec13b1d6 ]
      
      Currently, blkcg does a minor optimization where the root blkcg is
      created when the first blkcg policy is activated on a queue and
      destroyed on the deactivation of the last.  On systems where blkcg is
      configured but not used, this saves one blkcg_gq struct per queue.  On
      systems where blkcg is actually used, there's no difference.  The only
      case where this can lead to any meaninful, albeit still minute, save
      in memory consumption is when all blkcg policies are deactivated after
      being widely used in the system, which is a hihgly unlikely scenario.
      
      The conditional existence of root blkcg_gq has already created several
      bugs in blkcg and became an issue once again for the new per-cgroup
      wb_congested mechanism for cgroup writeback support leading to a NULL
      dereference when no blkcg policy is active.  This is really not worth
      bothering with.  This patch makes blkcg always allocate and link the
      root blkcg_gq and release it only on queue destruction.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      f9fac98f
    • Herbert Xu's avatar
      iscsi-target: Use shash and ahash · 712b6a6d
      Herbert Xu authored
      [ Upstream commit 69110e3c ]
      
      This patch replaces uses of the long obsolete hash interface with
      either shash (for non-SG users) or ahash.
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      712b6a6d
    • Alexei Potashnik's avatar
      target/iscsi: Use proper SGL accessors for digest computation · 1bd31de3
      Alexei Potashnik authored
      [ Upstream commit aa75679c ]
      
      Current implementation assumes that all the buffers of an IO are linked
      with a single SG list, which is OK because target-core is only allocating
      a contigious scatterlist region.  However, this assumption is wrong for
      se_cmd descriptors that want to use chaining across multiple SGL regions.
      
      Fix this up by using proper SGL accessors for digest payload computation.
      Signed-off-by: default avatarAlexei Potashnik <alexei@purestorage.com>
      Cc: Roland Dreier <roland@purestorage.com>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      1bd31de3
    • Nicholas Bellinger's avatar
      iscsi-target: Fix initial login PDU asynchronous socket close OOPs · 89ff28d0
      Nicholas Bellinger authored
      [ Upstream commit 25cdda95 ]
      
      This patch fixes a OOPs originally introduced by:
      
         commit bb048357
         Author: Nicholas Bellinger <nab@linux-iscsi.org>
         Date:   Thu Sep 5 14:54:04 2013 -0700
      
         iscsi-target: Add sk->sk_state_change to cleanup after TCP failure
      
      which would trigger a NULL pointer dereference when a TCP connection
      was closed asynchronously via iscsi_target_sk_state_change(), but only
      when the initial PDU processing in iscsi_target_do_login() from iscsi_np
      process context was blocked waiting for backend I/O to complete.
      
      To address this issue, this patch makes the following changes.
      
      First, it introduces some common helper functions used for checking
      socket closing state, checking login_flags, and atomically checking
      socket closing state + setting login_flags.
      
      Second, it introduces a LOGIN_FLAGS_INITIAL_PDU bit to know when a TCP
      connection has dropped via iscsi_target_sk_state_change(), but the
      initial PDU processing within iscsi_target_do_login() in iscsi_np
      context is still running.  For this case, it sets LOGIN_FLAGS_CLOSED,
      but doesn't invoke schedule_delayed_work().
      
      The original NULL pointer dereference case reported by MNC is now handled
      by iscsi_target_do_login() doing a iscsi_target_sk_check_close() before
      transitioning to FFP to determine when the socket has already closed,
      or iscsi_target_start_negotiation() if the login needs to exchange
      more PDUs (eg: iscsi_target_do_login returned 0) but the socket has
      closed.  For both of these cases, the cleanup up of remaining connection
      resources will occur in iscsi_target_start_negotiation() from iscsi_np
      process context once the failure is detected.
      
      Finally, to handle to case where iscsi_target_sk_state_change() is
      called after the initial PDU procesing is complete, it now invokes
      conn->login_work -> iscsi_target_do_login_rx() to perform cleanup once
      existing iscsi_target_sk_check_close() checks detect connection failure.
      For this case, the cleanup of remaining connection resources will occur
      in iscsi_target_do_login_rx() from delayed workqueue process context
      once the failure is detected.
      Reported-by: default avatarMike Christie <mchristi@redhat.com>
      Reviewed-by: default avatarMike Christie <mchristi@redhat.com>
      Tested-by: default avatarMike Christie <mchristi@redhat.com>
      Cc: Mike Christie <mchristi@redhat.com>
      Reported-by: default avatarHannes Reinecke <hare@suse.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Varun Prakash <varun@chelsio.com>
      Cc: <stable@vger.kernel.org> # v3.12+
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      89ff28d0