1. 12 Feb, 2019 30 commits
  2. 06 Feb, 2019 10 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.14.98 · 0d7866d5
      Greg Kroah-Hartman authored
      0d7866d5
    • Amir Goldstein's avatar
      fanotify: fix handling of events on child sub-directory · 515160e3
      Amir Goldstein authored
      commit b469e7e4 upstream.
      
      When an event is reported on a sub-directory and the parent inode has
      a mark mask with FS_EVENT_ON_CHILD|FS_ISDIR, the event will be sent to
      fsnotify() even if the event type is not in the parent mark mask
      (e.g. FS_OPEN).
      
      Further more, if that event happened on a mount or a filesystem with
      a mount/sb mark that does have that event type in their mask, the "on
      child" event will be reported on the mount/sb mark.  That is not
      desired, because user will get a duplicate event for the same action.
      
      Note that the event reported on the victim inode is never merged with
      the event reported on the parent inode, because of the check in
      should_merge(): old_fsn->inode == new_fsn->inode.
      
      Fix this by looking for a match of an actual event type (i.e. not just
      FS_ISDIR) in parent's inode mark mask and by not reporting an "on child"
      event to group if event type is only found on mount/sb marks.
      
      [backport hint: The bug seems to have always been in fanotify, but this
                      patch will only apply cleanly to v4.19.y]
      
      Cc: <stable@vger.kernel.org> # v4.19
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      [amir: backport to v4.9]
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      515160e3
    • Benjamin Herrenschmidt's avatar
      drivers: core: Remove glue dirs from sysfs earlier · 2f4da60e
      Benjamin Herrenschmidt authored
      commit 726e4109 upstream.
      
      For devices with a class, we create a "glue" directory between
      the parent device and the new device with the class name.
      
      This directory is never "explicitely" removed when empty however,
      this is left to the implicit sysfs removal done by kobject_release()
      when the object loses its last reference via kobject_put().
      
      This is problematic because as long as it's not been removed from
      sysfs, it is still present in the class kset and in sysfs directory
      structure.
      
      The presence in the class kset exposes a use after free bug fixed
      by the previous patch, but the presence in sysfs means that until
      the kobject is released, which can take a while (especially with
      kobject debugging), any attempt at re-creating such as binding a
      new device for that class/parent pair, will result in a sysfs
      duplicate file name error.
      
      This fixes it by instead doing an explicit kobject_del() when
      the glue dir is empty, by keeping track of the number of
      child devices of the gluedir.
      
      This is made easy by the fact that all glue dir operations are
      done with a global mutex, and there's already a function
      (cleanup_glue_dir) called in all the right places taking that
      mutex that can be enhanced for this. It appears that this was
      in fact the intent of the function, but the implementation was
      wrong.
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Guenter Roeck <groeck@google.com>
      Cc: Zubin Mithra <zsm@chromium.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2f4da60e
    • Paulo Alcantara's avatar
      cifs: Always resolve hostname before reconnecting · 834adff8
      Paulo Alcantara authored
      commit 28eb24ff upstream.
      
      In case a hostname resolves to a different IP address (e.g. long
      running mounts), make sure to resolve it every time prior to calling
      generic_ip_connect() in reconnect.
      Suggested-by: default avatarSteve French <stfrench@microsoft.com>
      Signed-off-by: default avatarPaulo Alcantara <palcantara@suse.de>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      Signed-off-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      834adff8
    • Alexei Naberezhnov's avatar
      md/raid5: fix 'out of memory' during raid cache recovery · fafc8e09
      Alexei Naberezhnov authored
      commit 483cbbed upstream.
      
      This fixes the case when md array assembly fails because of raid cache recovery
      unable to allocate a stripe, despite attempts to replay stripes and increase
      cache size. This happens because stripes released by r5c_recovery_replay_stripes
      and raid5_set_cache_size don't become available for allocation immediately.
      Released stripes first are placed on conf->released_stripes list and require
      md thread to merge them on conf->inactive_list before they can be allocated.
      
      Patch allows final allocation attempt during cache recovery to wait for
      new stripes to become availabe for allocation.
      
      Cc: linux-raid@vger.kernel.org
      Cc: Shaohua Li <shli@kernel.org>
      Cc: linux-stable <stable@vger.kernel.org> # 4.10+
      Fixes: b4c625c6 ("md/r5cache: r5cache recovery: part 1")
      Signed-off-by: default avatarAlexei Naberezhnov <anaberezhnov@fb.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fafc8e09
    • David Hildenbrand's avatar
      mm: migrate: don't rely on __PageMovable() of newpage after unlocking it · 4ebbe06b
      David Hildenbrand authored
      commit e0a352fa upstream.
      
      We had a race in the old balloon compaction code before b1123ea6
      ("mm: balloon: use general non-lru movable page feature") refactored it
      that became visible after backporting 195a8c43 ("virtio-balloon:
      deflate via a page list") without the refactoring.
      
      The bug existed from commit d6d86c0a ("mm/balloon_compaction:
      redesign ballooned pages management") till b1123ea6 ("mm: balloon:
      use general non-lru movable page feature").  d6d86c0a
      ("mm/balloon_compaction: redesign ballooned pages management") was
      backported to 3.12, so the broken kernels are stable kernels [3.12 -
      4.7].
      
      There was a subtle race between dropping the page lock of the newpage in
      __unmap_and_move() and checking for __is_movable_balloon_page(newpage).
      
      Just after dropping this page lock, virtio-balloon could go ahead and
      deflate the newpage, effectively dequeueing it and clearing PageBalloon,
      in turn making __is_movable_balloon_page(newpage) fail.
      
      This resulted in dropping the reference of the newpage via
      putback_lru_page(newpage) instead of put_page(newpage), leading to
      page->lru getting modified and a !LRU page ending up in the LRU lists.
      With 195a8c43 ("virtio-balloon: deflate via a page list")
      backported, one would suddenly get corrupted lists in
      release_pages_balloon():
      
      - WARNING: CPU: 13 PID: 6586 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
      - list_del corruption. prev->next should be ffffe253961090a0, but was dead000000000100
      
      Nowadays this race is no longer possible, but it is hidden behind very
      ugly handling of __ClearPageMovable() and __PageMovable().
      
      __ClearPageMovable() will not make __PageMovable() fail, only
      PageMovable().  So the new check (__PageMovable(newpage)) will still
      hold even after newpage was dequeued by virtio-balloon.
      
      If anybody would ever change that special handling, the BUG would be
      introduced again.  So instead, make it explicit and use the information
      of the original isolated page before migration.
      
      This patch can be backported fairly easy to stable kernels (in contrast
      to the refactoring).
      
      Link: http://lkml.kernel.org/r/20190129233217.10747-1-david@redhat.com
      Fixes: d6d86c0a ("mm/balloon_compaction: redesign ballooned pages management")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarVratislav Bendel <vbendel@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vratislav Bendel <vbendel@redhat.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Konstantin Khlebnikov <k.khlebnikov@samsung.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>	[3.12 - 4.7]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4ebbe06b
    • Naoya Horiguchi's avatar
      mm: hwpoison: use do_send_sig_info() instead of force_sig() · 0783205e
      Naoya Horiguchi authored
      commit 6376360e upstream.
      
      Currently memory_failure() is racy against process's exiting, which
      results in kernel crash by null pointer dereference.
      
      The root cause is that memory_failure() uses force_sig() to forcibly
      kill asynchronous (meaning not in the current context) processes.  As
      discussed in thread https://lkml.org/lkml/2010/6/8/236 years ago for OOM
      fixes, this is not a right thing to do.  OOM solves this issue by using
      do_send_sig_info() as done in commit d2d39309 ("signal:
      oom_kill_task: use SEND_SIG_FORCED instead of force_sig()"), so this
      patch is suggesting to do the same for hwpoison.  do_send_sig_info()
      properly accesses to siglock with lock_task_sighand(), so is free from
      the reported race.
      
      I confirmed that the reported bug reproduces with inserting some delay
      in kill_procs(), and it never reproduces with this patch.
      
      Note that memory_failure() can send another type of signal using
      force_sig_mceerr(), and the reported race shouldn't happen on it because
      force_sig_mceerr() is called only for synchronous processes (i.e.
      BUS_MCEERR_AR happens only when some process accesses to the corrupted
      memory.)
      
      Link: http://lkml.kernel.org/r/20190116093046.GA29835@hori1.linux.bs1.fc.nec.co.jpSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0783205e
    • Shakeel Butt's avatar
      mm, oom: fix use-after-free in oom_kill_process · 43f7e8be
      Shakeel Butt authored
      commit cefc7ef3 upstream.
      
      Syzbot instance running on upstream kernel found a use-after-free bug in
      oom_kill_process.  On further inspection it seems like the process
      selected to be oom-killed has exited even before reaching
      read_lock(&tasklist_lock) in oom_kill_process().  More specifically the
      tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task()
      and the put_task_struct within for_each_thread() frees the tsk and
      for_each_thread() tries to access the tsk.  The easiest fix is to do
      get/put across the for_each_thread() on the selected task.
      
      Now the next question is should we continue with the oom-kill as the
      previously selected task has exited? However before adding more
      complexity and heuristics, let's answer why we even look at the children
      of oom-kill selected task? The select_bad_process() has already selected
      the worst process in the system/memcg.  Due to race, the selected
      process might not be the worst at the kill time but does that matter?
      The userspace can use the oom_score_adj interface to prefer children to
      be killed before the parent.  I looked at the history but it seems like
      this is there before git history.
      
      Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com
      Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com
      Fixes: 6b0c81b3 ("mm, oom: reduce dependency on tasklist_lock")
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43f7e8be
    • Tetsuo Handa's avatar
      oom, oom_reaper: do not enqueue same task twice · 73178548
      Tetsuo Handa authored
      commit 9bcdeb51 upstream.
      
      Arkadiusz reported that enabling memcg's group oom killing causes
      strange memcg statistics where there is no task in a memcg despite the
      number of tasks in that memcg is not 0.  It turned out that there is a
      bug in wake_oom_reaper() which allows enqueuing same task twice which
      makes impossible to decrease the number of tasks in that memcg due to a
      refcount leak.
      
      This bug existed since the OOM reaper became invokable from
      task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
      
        T1@P1     |T2@P1     |T3@P1     |OOM reaper
        ----------+----------+----------+------------
                                         # Processing an OOM victim in a different memcg domain.
                              try_charge()
                                mem_cgroup_out_of_memory()
                                  mutex_lock(&oom_lock)
                   try_charge()
                     mem_cgroup_out_of_memory()
                       mutex_lock(&oom_lock)
        try_charge()
          mem_cgroup_out_of_memory()
            mutex_lock(&oom_lock)
                                  out_of_memory()
                                    oom_kill_process(P1)
                                      do_send_sig_info(SIGKILL, @P1)
                                      mark_oom_victim(T1@P1)
                                      wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
                                  mutex_unlock(&oom_lock)
                       out_of_memory()
                         mark_oom_victim(T2@P1)
                         wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
                       mutex_unlock(&oom_lock)
            out_of_memory()
              mark_oom_victim(T1@P1)
              wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
            mutex_unlock(&oom_lock)
                                         # Completed processing an OOM victim in a different memcg domain.
                                         spin_lock(&oom_reaper_lock)
                                         # T1P1 is dequeued.
                                         spin_unlock(&oom_reaper_lock)
      
      but memcg's group oom killing made it easier to trigger this bug by
      calling wake_oom_reaper() on the same task from one out_of_memory()
      request.
      
      Fix this bug using an approach used by commit 855b0183 ("oom,
      oom_reaper: disable oom_reaper for oom_kill_allocating_task").  As a
      side effect of this patch, this patch also avoids enqueuing multiple
      threads sharing memory via task_will_free_mem(current) path.
      
      Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp
      Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp
      Fixes: af8e15cc ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: default avatarArkadiusz Miskiewicz <arekm@maven.pl>
      Tested-by: default avatarArkadiusz Miskiewicz <arekm@maven.pl>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aleksa Sarai <asarai@suse.de>
      Cc: Jay Kamat <jgkamat@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      73178548
    • Andrei Vagin's avatar
      kernel/exit.c: release ptraced tasks before zap_pid_ns_processes · 3d98fc4d
      Andrei Vagin authored
      commit 8fb335e0 upstream.
      
      Currently, exit_ptrace() adds all ptraced tasks in a dead list, then
      zap_pid_ns_processes() waits on all tasks in a current pidns, and only
      then are tasks from the dead list released.
      
      zap_pid_ns_processes() can get stuck on waiting tasks from the dead
      list.  In this case, we will have one unkillable process with one or
      more dead children.
      
      Thanks to Oleg for the advice to release tasks in find_child_reaper().
      
      Link: http://lkml.kernel.org/r/20190110175200.12442-1-avagin@gmail.com
      Fixes: 7c8bd232 ("exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()")
      Signed-off-by: default avatarAndrei Vagin <avagin@gmail.com>
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3d98fc4d