1. 17 Mar, 2016 40 commits
    • Matthew Wilcox's avatar
      radix-tree,shmem: introduce radix_tree_iter_next() · 7165092f
      Matthew Wilcox authored
      shmem likes to occasionally drop the lock, schedule, then reacqire the
      lock and continue with the iteration from the last place it left off.
      This is currently done with a pretty ugly goto.  Introduce
      radix_tree_iter_next() and use it throughout shmem.c.
      
      [koct9i@gmail.com: fix bug in radix_tree_iter_next() for tagged iteration]
      Signed-off-by: default avatarMatthew Wilcox <willy@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7165092f
    • Matthew Wilcox's avatar
      mm: use radix_tree_iter_retry() · 2cf938aa
      Matthew Wilcox authored
      Instead of a 'goto restart', we can now use radix_tree_iter_retry() to
      restart from our current position.  This will make a difference when
      there are more ways to happen across an indirect pointer.  And it
      eliminates some confusing gotos.
      
      [vbabka@suse.cz: remove now-obsolete-and-misleading comment]
      Signed-off-by: default avatarMatthew Wilcox <willy@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2cf938aa
    • Matthew Wilcox's avatar
      btrfs: use radix_tree_iter_retry() · c28f2420
      Matthew Wilcox authored
      Even though this is a 'can't happen' situation, use the new
      radix_tree_iter_retry() pattern to eliminate a goto.
      
      [akpm@linux-foundation.org: fix btrfs build]
      Signed-off-by: default avatarMatthew Wilcox <willy@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: David Sterba <dsterba@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c28f2420
    • Matthew Wilcox's avatar
      radix_tree: add radix_tree_dump · 7cf19af4
      Matthew Wilcox authored
      This is debug code which is #if 0 out.
      Signed-off-by: default avatarMatthew Wilcox <willy@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7cf19af4
    • Matthew Wilcox's avatar
      radix_tree: add support for multi-order entries · e6145236
      Matthew Wilcox authored
      With huge pages, it is convenient to have the radix tree be able to
      return an entry that covers multiple indices.  Previous attempts to deal
      with the problem have involved inserting N duplicate entries, which is a
      waste of memory and leads to problems trying to handle aliased tags, or
      probing the tree multiple times to find alternative entries which might
      cover the requested index.
      
      This approach inserts one canonical entry into the tree for a given
      range of indices, and may also insert other entries in order to ensure
      that lookups find the canonical entry.
      
      This solution only tolerates inserting powers of two that are greater
      than the fanout of the tree.  If we wish to expand the radix tree's
      abilities to support large-ish pages that is less than the fanout at the
      penultimate level of the tree, then we would need to add one more step
      in lookup to ensure that any sibling nodes in the final level of the
      tree are dereferenced and we return the canonical entry that they
      reference.
      Signed-off-by: default avatarMatthew Wilcox <willy@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6145236
    • Matthew Wilcox's avatar
      radix_tree: loop based on shift count, not height · 0070e28d
      Matthew Wilcox authored
      When we introduce entries that can cover multiple indices, we will need
      to stop in __radix_tree_create based on the shift, not the height.
      Split out for ease of bisect.
      Signed-off-by: default avatarMatthew Wilcox <willy@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0070e28d
    • Matthew Wilcox's avatar
      radix_tree: tag all internal tree nodes as indirect pointers · 339e6353
      Matthew Wilcox authored
      Set the 'indirect_ptr' bit on all the pointers to internal nodes, not
      just on the root node.  This enables the following patches to support
      multi-order entries in the radix tree.  This patch is split out for ease
      of bisection.
      Signed-off-by: default avatarMatthew Wilcox <willy@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      339e6353
    • Matthew Wilcox's avatar
      radix tree test harness · 1366c37e
      Matthew Wilcox authored
      This code is mostly from Andrew Morton and Nick Piggin; tarball downloaded
      from http://ozlabs.org/~akpm/rtth.tar.gz with sha1sum
      0ce679db9ec047296b5d1ff7a1dfaa03a7bef1bd
      
      Some small modifications were necessary to the test harness to fix the
      build with the current Linux source code.
      
      I also made minor modifications to automatically test the radix-tree.c
      and radix-tree.h files that are in the current source tree, as opposed
      to a copied and slightly modified version.  I am sure more could be done
      to tidy up the harness, as well as adding more tests.
      
      [koct9i@gmail.com: fix compilation]
      Signed-off-by: default avatarMatthew Wilcox <willy@linux.intel.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1366c37e
    • Matthew Wilcox's avatar
      radix-tree: add an explicit include of bitops.h · f67c07f0
      Matthew Wilcox authored
      The radix-tree header uses the __ffs() function, which is defined in
      bitops.h.  The current kernel headers implicitly include bitops.h, but
      the userspace test harness does not.
      Signed-off-by: default avatarMatthew Wilcox <willy@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f67c07f0
    • Heiko Carstens's avatar
      lib/bug.c: make panic_on_warn available for all architectures · d7b85cab
      Heiko Carstens authored
      Christian Borntraeger reported that panic_on_warn doesn't have any
      effect on s390.
      
      The panic_on_warn feature was introduced with 9e3961a0 ("kernel: add
      panic_on_warn").  However it did care only for the case when
      WANT_WARN_ON_SLOWPATH is defined.  This is turn is only the case for
      architectures which do not have an own __WARN_TAINT defined.
      
      Other architectures which do have __WARN_TAINT defined call report_bug()
      for warnings within lib/bug.c which does not call panic() in case
      panic_on_warn is set.
      
      Let's simply enable the panic_on_warn feature by adding the same code
      like it was added to warn_slowpath_common() in panic.c.
      
      This enables panic_on_warn also for arm64, parisc, powerpc, s390 and sh.
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Reported-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Tested-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: default avatarPrarit Bhargava <prarit@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7b85cab
    • Chen Gang's avatar
      include/linux/list_bl.h: use bool instead of int for boolean functions · 26a247fd
      Chen Gang authored
      hlist_bl_unhashed() and hlist_bl_empty() are all boolean functions, so
      return bool instead of int.
      Signed-off-by: default avatarChen Gang <gang.chen.5i5j@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      26a247fd
    • David Kershner's avatar
      MAINTAINERS: update s-Par driver maintainer list · f68404bd
      David Kershner authored
      Benjamin Romer is no longer a maintainer for the Unisys s-Par driver,
      presently in drivers/staging/unisys/.
      Signed-off-by: default avatarDavid Kershner <david.kershner@unisys.com>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f68404bd
    • Ivan Delalande's avatar
      printk: add clear_idx symbol to vmcoreinfo · f468908b
      Ivan Delalande authored
      This allows us to extract from the vmcore only the messages emitted
      since the last time the ring buffer was cleared.  We just have to make
      sure its value is always up-to-date, when old messages are discarded to
      free space in log_make_free_space() for example.
      Signed-off-by: default avatarZeyu Zhao <zzy8200@gmail.com>
      Signed-off-by: default avatarIvan Delalande <colona@arista.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f468908b
    • Sergey Senozhatsky's avatar
      printk: check CON_ENABLED in have_callable_console() · adaf6590
      Sergey Senozhatsky authored
      have_callable_console() must also test CON_ENABLED bit, not just
      CON_ANYTIME.  We may have disabled CON_ANYTIME console so printk can
      wrongly assume that it's safe to call_console_drivers().
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: default avatarPetr Mladek <pmladek@suse.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kyle McMartin <kyle@kernel.org>
      Cc: Dave Jones <davej@codemonkey.org.uk>
      Cc: Calvin Owens <calvinowens@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      adaf6590
    • Sergey Senozhatsky's avatar
      printk: set may_schedule for some of console_trylock() callers · 6b97a20d
      Sergey Senozhatsky authored
      console_unlock() allows to cond_resched() if its caller has set
      `console_may_schedule' to 1, since 8d91f8b1 ("printk: do
      cond_resched() between lines while outputting to consoles").
      
      The rules are:
      -- console_lock() always sets `console_may_schedule' to 1
      -- console_trylock() always sets `console_may_schedule' to 0
      
      However, console_trylock() callers (among them is printk()) do not
      always call printk() from atomic contexts, and some of them can
      cond_resched() in console_unlock(), so console_trylock() can set
      `console_may_schedule' to 1 for such processes.
      
      For !CONFIG_PREEMPT_COUNT kernels, however, console_trylock() always
      sets `console_may_schedule' to 0.
      
      It's possible to drop explicit preempt_disable()/preempt_enable() in
      vprintk_emit(), because console_unlock() and console_trylock() are now
      smart enough:
       a) console_unlock() does not cond_resched() when it's unsafe
          (console_trylock() takes care of that)
       b) console_unlock() does can_use_console() check.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: default avatarPetr Mladek <pmladek@suse.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kyle McMartin <kyle@kernel.org>
      Cc: Dave Jones <davej@codemonkey.org.uk>
      Cc: Calvin Owens <calvinowens@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b97a20d
    • Sergey Senozhatsky's avatar
      printk: move can_use_console() out of console_trylock_for_printk() · a8199371
      Sergey Senozhatsky authored
      console_unlock() allows to cond_resched() if its caller has set
      `console_may_schedule' to 1 (this functionality is present since
      8d91f8b1 ("printk: do cond_resched() between lines while outputting
      to consoles").
      
      The rules are:
      -- console_lock() always sets `console_may_schedule' to 1
      -- console_trylock() always sets `console_may_schedule' to 0
      
      printk() calls console_unlock() with preemption desabled, which
      basically can lead to RCU stalls, watchdog soft lockups, etc.  if
      something is simultaneously calling printk() frequent enough (IOW,
      console_sem owner always has new data to send to console divers and
      can't leave console_unlock() for a long time).
      
      printk()->console_trylock() callers do not necessarily execute in atomic
      contexts, and some of them can cond_resched() in console_unlock().
      console_trylock() can set `console_may_schedule' to 1 (allow
      cond_resched() later in consoe_unlock()) when it's safe.
      
      This patch (of 3):
      
      vprintk_emit() disables preemption around console_trylock_for_printk()
      and console_unlock() calls for a strong reason -- can_use_console()
      check.  The thing is that vprintl_emit() can be called on a CPU that is
      not fully brought up yet (!cpu_online()), which potentially can cause
      problems if console driver wants to access per-cpu data.  A console
      driver can explicitly state that it's safe to call it from !online cpu
      by setting CON_ANYTIME bit in console ->flags.  That's why for
      !cpu_online() can_use_console() iterates all the console to find out if
      there is a CON_ANYTIME console, otherwise console_unlock() must be
      avoided.
      
      can_use_console() ensures that console_unlock() call is safe in
      vprintk_emit() only; console_lock() and console_trylock() are not
      covered by this check.  Even though call_console_drivers(), invoked from
      console_cont_flush() and console_unlock(), tests `!cpu_online() &&
      CON_ANYTIME' for_each_console(), it may be too late, which can result in
      messages loss.
      
      Assume that we have 2 cpus -- CPU0 is online, CPU1 is !online, and no
      CON_ANYTIME consoles available.
      
      CPU0 online                        CPU1 !online
                                       console_trylock()
                                       ...
                                       console_unlock()
                                         console_cont_flush
                                           spin_lock logbuf_lock
                                           if (!cont.len) {
                                              spin_unlock logbuf_lock
                                              return
                                           }
                                         for (;;) {
      vprintk_emit
        spin_lock logbuf_lock
        log_store
        spin_unlock logbuf_lock
                                           spin_lock logbuf_lock
        !console_trylock_for_printk        msg_print_text
       return                              console_idx = log_next()
                                           console_seq++
                                           console_prev = msg->flags
                                           spin_unlock logbuf_lock
      
                                           call_console_drivers()
                                             for_each_console(con) {
                                               if (!cpu_online() &&
                                                   !(con->flags & CON_ANYTIME))
                                                       continue;
                                               }
                                         /*
                                          * no message printed, we lost it
                                          */
      vprintk_emit
        spin_lock logbuf_lock
        log_store
        spin_unlock logbuf_lock
        !console_trylock_for_printk
       return
                                         /*
                                          * go to the beginning of the loop,
                                          * find out there are new messages,
                                          * lose it
                                          */
                                         }
      
      console_trylock()/console_lock() call on CPU1 may come from cpu
      notifiers registered on that CPU.  Since notifiers are not getting
      unregistered when CPU is going DOWN, all of the notifiers receive
      notifications during CPU UP.  For example, on my x86_64, I see around 50
      notification sent from offline CPU to itself
      
       [swapper/2] from cpu:2 to:2 action:CPU_STARTING hotplug_hrtick
       [swapper/2] from cpu:2 to:2 action:CPU_STARTING blk_mq_main_cpu_notify
       [swapper/2] from cpu:2 to:2 action:CPU_STARTING blk_mq_queue_reinit_notify
       [swapper/2] from cpu:2 to:2 action:CPU_STARTING console_cpu_notify
      
      while doing
        echo 0 > /sys/devices/system/cpu/cpu2/online
        echo 1 > /sys/devices/system/cpu/cpu2/online
      
      So grabbing the console_sem lock while CPU is !online is possible,
      in theory.
      
      This patch moves can_use_console() check out of
      console_trylock_for_printk().  Instead it calls it in console_unlock(),
      so now console_lock()/console_unlock() are also 'protected' by
      can_use_console().  This also means that console_trylock_for_printk() is
      not really needed anymore and can be removed.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: default avatarPetr Mladek <pmladek@suse.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kyle McMartin <kyle@kernel.org>
      Cc: Dave Jones <davej@codemonkey.org.uk>
      Cc: Calvin Owens <calvinowens@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8199371
    • Rob Landley's avatar
      include/uapi/linux/elf-em.h: remove v850 · faeb50b9
      Rob Landley authored
      The v850 port was removed by commits f606ddf4 and 07a887d3 in
      2008.  These #defines are not used in the current kernel.
      Signed-off-by: default avatarRob Landley <rob@landley.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      faeb50b9
    • Christoph Lameter's avatar
      fix Christoph's email addresses · 93e205a7
      Christoph Lameter authored
      There are various email addresses for me throughout the kernel.  Use the
      one that will always be valid.
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93e205a7
    • Steven Rostedt's avatar
      bug: set warn variable before calling WARN() · dfbf2897
      Steven Rostedt authored
      This has hit me a couple of times already.  I would be debugging code
      and the system would simply hang and then reboot.  Finally, I found that
      the problem was caused by WARN_ON_ONCE() and friends.
      
      The macro WARN_ON_ONCE(condition) is defined as:
      
      	static bool __section(.data.unlikely) __warned;
      	int __ret_warn_once = !!(condition);
      
      	if (unlikely(__ret_warn_once))
      		if (WARN_ON(!__warned))
      			__warned = true;
      
      	unlikely(__ret_warn_once);
      
      Which looks great and all.  But what I have hit, is an issue when
      WARN_ON() itself hits the same WARN_ON_ONCE() code.  Because, the
      variable __warned is not yet set.  Then it too calls WARN_ON() and that
      triggers the warning again.  It keeps doing this until the stack is
      overflowed and the system crashes.
      
      By setting __warned first before calling WARN_ON() makes the original
      WARN_ON_ONCE() really only warn once, and not an infinite amount of
      times if the WARN_ON() also triggers the warning.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dfbf2897
    • Andrew Morton's avatar
      arch/mn10300/kernel/fpu-nofpu.c: needs asm/elf.h · c60f1692
      Andrew Morton authored
      arch/mn10300/kernel/fpu-nofpu.c:27:36: error: unknown type name 'elf_fpregset_t'
          int dump_fpu(struct pt_regs *regs, elf_fpregset_t *fpreg)
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c60f1692
    • Andrew Morton's avatar
      mn10300, c6x: CONFIG_GENERIC_BUG must depend on CONFIG_BUG · 8b9e6d58
      Andrew Morton authored
      CONFIG_BUG=n && CONFIG_GENERIC_BUG=y make no sense and things break:
      
         In file included from include/linux/page-flags.h:9:0,
                          from kernel/bounds.c:9:
         include/linux/bug.h:91:47: warning: 'struct bug_entry' declared inside parameter list
          static inline int is_warning_bug(const struct bug_entry *bug)
                                                        ^
         include/linux/bug.h:91:47: warning: its scope is only this definition or declaration, which is probably not what you want
         include/linux/bug.h: In function 'is_warning_bug':
      >> include/linux/bug.h:93:12: error: dereferencing pointer to incomplete type
           return bug->flags & BUGFLAG_WARNING;
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b9e6d58
    • Dave Young's avatar
      proc-vmcore: wrong data type casting fix · 0b50a2d8
      Dave Young authored
      On i686 PAE enabled machine the contiguous physical area could be large
      and it can cause trimming down variables in below calculation in
      read_vmcore() and mmap_vmcore():
      
      	tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
      
      That is, the types being used is like below on i686:
      m->offset: unsigned long long int
      m->size:   unsigned long long int
      *fpos:     loff_t (long long int)
      buflen:    size_t (unsigned int)
      
      So casting (m->offset + m->size - *fpos) by size_t means truncating a
      given value by 4GB.
      
      Suppose (m->offset + m->size - *fpos) being truncated to 0, buflen >0
      then we will get tsz = 0.  It is of course not an expected result.
      Similarly we could also get other truncated values less than buflen.
      Then the real size passed down is not correct any more.
      
      If (m->offset + m->size - *fpos) is above 4GB, read_vmcore or
      mmap_vmcore use the min_t result with truncated values being compared to
      buflen.  Then, fpos proceeds with the wrong value so that we reach below
      bugs:
      
      1) read_vmcore will refuse to continue so makedumpfile fails.
      2) mmap_vmcore will trigger BUG_ON() in remap_pfn_range().
      
      Use unsigned long long in min_t instead so that the variables in are not
      truncated.
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarDave Young <dyoung@redhat.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jianyu Zhan <nasa4836@gmail.com>
      Cc: Minfei Huang <mhuang@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b50a2d8
    • Minfei Huang's avatar
      proc/base: make prompt shell start from new line after executing "cat /proc/$pid/wchan" · 7e2bc81d
      Minfei Huang authored
      It is not elegant that prompt shell does not start from new line after
      executing "cat /proc/$pid/wchan".  Make prompt shell start from new
      line.
      Signed-off-by: default avatarMinfei Huang <mnfhuang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e2bc81d
    • Eric Engestrom's avatar
      procfs: add conditional compilation check · b5946bea
      Eric Engestrom authored
      `proc_timers_operations` is only used when CONFIG_CHECKPOINT_RESTORE is
      enabled.
      Signed-off-by: default avatarEric Engestrom <eric.engestrom@imgtec.com>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5946bea
    • John Stultz's avatar
      proc: add /proc/<pid>/timerslack_ns interface · 5de23d43
      John Stultz authored
      This patch provides a proc/PID/timerslack_ns interface which exposes a
      task's timerslack value in nanoseconds and allows it to be changed.
      
      This allows power/performance management software to set timer slack for
      other threads according to its policy for the thread (such as when the
      thread is designated foreground vs.  background activity)
      
      If the value written is non-zero, slack is set to that value.  Otherwise
      sets it to the default for the thread.
      
      This interface checks that the calling task has permissions to to use
      PTRACE_MODE_ATTACH_FSCREDS on the target task, so that we can ensure
      arbitrary apps do not change the timer slack for other apps.
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Oren Laadan <orenl@cellrox.com>
      Cc: Ruchi Kandoi <kandoiruchi@google.com>
      Cc: Rom Lemarchand <romlem@android.com>
      Cc: Android Kernel Team <kernel-team@android.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5de23d43
    • John Stultz's avatar
      timer: convert timer_slack_ns from unsigned long to u64 · da8b44d5
      John Stultz authored
      This patchset introduces a /proc/<pid>/timerslack_ns interface which
      would allow controlling processes to be able to set the timerslack value
      on other processes in order to save power by avoiding wakeups (Something
      Android currently does via out-of-tree patches).
      
      The first patch tries to fix the internal timer_slack_ns usage which was
      defined as a long, which limits the slack range to ~4 seconds on 32bit
      systems.  It converts it to a u64, which provides the same basically
      unlimited slack (500 years) on both 32bit and 64bit machines.
      
      The second patch introduces the /proc/<pid>/timerslack_ns interface
      which allows the full 64bit slack range for a task to be read or set on
      both 32bit and 64bit machines.
      
      With these two patches, on a 32bit machine, after setting the slack on
      bash to 10 seconds:
      
      $ time sleep 1
      
      real    0m10.747s
      user    0m0.001s
      sys     0m0.005s
      
      The first patch is a little ugly, since I had to chase the slack delta
      arguments through a number of functions converting them to u64s.  Let me
      know if it makes sense to break that up more or not.
      
      Other than that things are fairly straightforward.
      
      This patch (of 2):
      
      The timer_slack_ns value in the task struct is currently a unsigned
      long.  This means that on 32bit applications, the maximum slack is just
      over 4 seconds.  However, on 64bit machines, its much much larger (~500
      years).
      
      This disparity could make application development a little (as well as
      the default_slack) to a u64.  This means both 32bit and 64bit systems
      have the same effective internal slack range.
      
      Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
      the interface as a unsigned long, so we preserve that limitation on
      32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
      long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
      actually larger then what can be stored by an unsigned long.
      
      This patch also modifies hrtimer functions which specified the slack
      delta as a unsigned long.
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Oren Laadan <orenl@cellrox.com>
      Cc: Ruchi Kandoi <kandoiruchi@google.com>
      Cc: Rom Lemarchand <romlem@android.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Android Kernel Team <kernel-team@android.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da8b44d5
    • Tetsuo Handa's avatar
      mm,oom: do not loop !__GFP_FS allocation if the OOM killer is disabled · 0a687aac
      Tetsuo Handa authored
      After the OOM killer is disabled during suspend operation, any
      !__GFP_NOFAIL && __GFP_FS allocations are forced to fail.  Thus, any
      !__GFP_NOFAIL && !__GFP_FS allocations should be forced to fail as well.
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a687aac
    • Tetsuo Handa's avatar
      mm,oom: make oom_killer_disable() killable · 6afcf289
      Tetsuo Handa authored
      While oom_killer_disable() is called by freeze_processes() after all
      user threads except the current thread are frozen, it is possible that
      kernel threads invoke the OOM killer and sends SIGKILL to the current
      thread due to sharing the thawed victim's memory.  Therefore, checking
      for SIGKILL is preferable than TIF_MEMDIE.
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6afcf289
    • Sergey Senozhatsky's avatar
      mm/zsmalloc: add `freeable' column to pool stat · 1120ed54
      Sergey Senozhatsky authored
      Add a new column to pool stats, which will tell how many pages ideally
      can be freed by class compaction, so it will be easier to analyze
      zsmalloc fragmentation.
      
      At the moment, we have only numbers of FULL and ALMOST_EMPTY classes,
      but they don't tell us how badly the class is fragmented internally.
      
      The new /sys/kernel/debug/zsmalloc/zramX/classes output look as follows:
      
       class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
      [..]
          12   224           0            2           146          5          8                4        4
          13   240           0            0             0          0          0                1        0
          14   256           1           13          1840       1672        115                1       10
          15   272           0            0             0          0          0                1        0
      [..]
          49   816           0            3           745        735        149                1        2
          51   848           3            4           361        306         76                4        8
          52   864          12           14           378        268         81                3       21
          54   896           1           12           117         57         26                2       12
          57   944           0            0             0          0          0                3        0
      [..]
       Total                26          131         12709      10994       1071                       134
      
      For example, from this particular output we can easily conclude that
      class-896 is heavily fragmented -- it occupies 26 pages, 12 can be freed
      by compaction.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1120ed54
    • YiPing Xu's avatar
      zsmalloc: drop unused member 'mapping_area->huge' · a82cbf07
      YiPing Xu authored
      When unmapping a huge class page in zs_unmap_object, the page will be
      unmapped by kmap_atomic.  the "!area->huge" branch in __zs_unmap_object
      is alway true, and no code set "area->huge" now, so we can drop it.
      Signed-off-by: default avatarYiPing Xu <xuyiping@huawei.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a82cbf07
    • Shawn Lin's avatar
      mm/vmalloc: use PAGE_ALIGNED() to check PAGE_SIZE alignment · a1c0b1a0
      Shawn Lin authored
      We have PAGE_ALIGNED() in mm.h, so let's use it instead of IS_ALIGNED()
      for checking PAGE_SIZE aligned case.
      Signed-off-by: default avatarShawn Lin <shawn.lin@rock-chips.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1c0b1a0
    • Vladimir Davydov's avatar
      mm: memcontrol: zap oom_info_lock · e0775d10
      Vladimir Davydov authored
      mem_cgroup_print_oom_info is always called under oom_lock, so
      oom_info_lock is redundant.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0775d10
    • Johannes Weiner's avatar
      mm: memcontrol: clarify the uncharge_list() loop · 8b592656
      Johannes Weiner authored
      uncharge_list() does an unusual list walk because the function can take
      regular lists with dedicated list_heads as well as singleton lists where
      a single page is passed via the page->lru list node.
      
      This can sometimes lead to confusion as well as suggestions to replace
      the loop with a list_for_each_entry(), which wouldn't work.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b592656
    • Johannes Weiner's avatar
      mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage · b6e6edcf
      Johannes Weiner authored
      Setting the original memory.limit_in_bytes hardlimit is subject to a
      race condition when the desired value is below the current usage.  The
      code tries a few times to first reclaim and then see if the usage has
      dropped to where we would like it to be, but there is no locking, and
      the workload is free to continue making new charges up to the old limit.
      Thus, attempting to shrink a workload relies on pure luck and hope that
      the workload happens to cooperate.
      
      To fix this in the cgroup2 memory.max knob, do it the other way round:
      set the limit first, then try enforcement.  And if reclaim is not able
      to succeed, trigger OOM kills in the group.  Keep going until the new
      limit is met, we run out of OOM victims and there's only unreclaimable
      memory left, or the task writing to memory.max is killed.  This allows
      users to shrink groups reliably, and the behavior is consistent with
      what happens when new charges are attempted in excess of memory.max.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6e6edcf
    • Johannes Weiner's avatar
      mm: memcontrol: reclaim when shrinking memory.high below usage · 588083bb
      Johannes Weiner authored
      When setting memory.high below usage, nothing happens until the next
      charge comes along, and then it will only reclaim its own charge and not
      the now potentially huge excess of the new memory.high.  This can cause
      groups to stay in excess of their memory.high indefinitely.
      
      To fix that, when shrinking memory.high, kick off a reclaim cycle that
      goes after the delta.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      588083bb
    • Naoya Horiguchi's avatar
      tools/vm/page-types.c: avoid memset() in walk_pfn() when count == 1 · d9b2ddf8
      Naoya Horiguchi authored
      I found that page-types is very slow and my testing shows many timeout
      errors.  Here's an example with a simple program allocating 1000 thps.
      
        $ time ./page-types -p $(pgrep -f test_alloc)
        ...
        real    0m17.201s
        user    0m16.889s
        sys     0m0.312s
      
      Most of time is spent in memset().  Currently memset() clears over whole
      buffer for every walk_pfn() call, which is inefficient when walk_pfn()
      is called from walk_vma(), because in that case walk_pfn() is called for
      each pfn.  So this patch limits the zero initialization only for the
      first element.
      
        $ time ./page-types.patched -p $(pgrep -f test_alloc)
        ...
        real    0m0.182s
        user    0m0.046s
        sys     0m0.135s
      
      Fixes: 954e95584579 ("tools/vm/page-types.c: add memory cgroup dumping and filtering")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Suggested-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9b2ddf8
    • Li Zhang's avatar
      powerpc/mm: enable page parallel initialisation · 7f2bd006
      Li Zhang authored
      Parallel initialisation has been enabled for X86, boot time is improved
      greatly.  On Power8, it is improved greatly for small memory.  Here is
      the result from my test on Power8 platform:
      
      For 4GB of memory, boot time is improved by 59%, from 24.5s to 10s.
      
      For 50GB memory, boot time is improved by 22%, from 56.8s to 43.8s.
      Signed-off-by: default avatarLi Zhang <zhlcindy@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f2bd006
    • Li Zhang's avatar
      mm: meminit: initialise more memory for inode/dentry hash tables in early boot · 987b3095
      Li Zhang authored
      Upstream has supported page parallel initialisation for X86 and the boot
      time is improved greately.  Some tests have been done for Power.
      
      Here is the result I have done with different memory size.
      
      * 4GB memory:
          boot time is as the following:
          with patch vs without patch: 10.4s vs 24.5s
          boot time is improved 57%
      * 200GB memory:
          boot time looks the same with and without patches.
          boot time is about 38s
      * 32TB memory:
          boot time looks the same with and without patches
          boot time is about 160s.
          The boot time is much shorter than X86 with 24TB memory.
          From community discussion, it costs about 694s for X86 24T system.
      
      Parallel initialisation improves the performance by deferring memory
      initilisation to kswap with N kthreads, it should improve the performance
      therotically.
      
      In testing on X86, performance is improved greatly with huge memory.  But
      on Power platform, it is improved greatly with less than 100GB memory.
      For huge memory, it is not improved greatly.  But it saves the time with
      several threads at least, as the following information shows(32TB system
      log):
      
      [   22.648169] node 9 initialised, 16607461 pages in 280ms
      [   22.783772] node 3 initialised, 23937243 pages in 410ms
      [   22.858877] node 6 initialised, 29179347 pages in 490ms
      [   22.863252] node 2 initialised, 29179347 pages in 490ms
      [   22.907545] node 0 initialised, 32049614 pages in 540ms
      [   22.920891] node 15 initialised, 32212280 pages in 550ms
      [   22.923236] node 4 initialised, 32306127 pages in 550ms
      [   22.923384] node 12 initialised, 32314319 pages in 550ms
      [   22.924754] node 8 initialised, 32314319 pages in 550ms
      [   22.940780] node 13 initialised, 33353677 pages in 570ms
      [   22.940796] node 11 initialised, 33353677 pages in 570ms
      [   22.941700] node 5 initialised, 33353677 pages in 570ms
      [   22.941721] node 10 initialised, 33353677 pages in 570ms
      [   22.941876] node 7 initialised, 33353677 pages in 570ms
      [   22.944946] node 14 initialised, 33353677 pages in 570ms
      [   22.946063] node 1 initialised, 33345485 pages in 580ms
      
      It saves the time about 550*16 ms at least, although it can be ignore to
      compare the boot time about 160 seconds.  What's more, the boot time is
      much shorter on Power even without patches than x86 for huge memory
      machine.
      
      So this patchset is still necessary to be enabled for Power.
      
      This patch (of 2):
      
      This patch is based on Mel Gorman's old patch in the mailing list,
      https://lkml.org/lkml/2015/5/5/280 which is discussed but it is fixed with
      a completion to wait for all memory initialised in page_alloc_init_late().
      It is to fix the OOM problem on X86 with 24TB memory which allocates
      memory in late initialisation.  But for Power platform with 32TB memory,
      it causes a call trace in vfs_caches_init->inode_init() and inode hash
      table needs more memory.  So this patch allocates 1GB for 0.25TB/node for
      large system as it is mentioned in https://lkml.org/lkml/2015/5/1/627
      
      This call trace is found on Power with 32TB memory, 1024CPUs, 16nodes.
      Currently, it only allocates 2GB*16=32GB for early initialisation.  But
      Dentry cache hash table needes 16GB and Inode cache hash table needs 16GB.
      So the system have no enough memory for it.  The log from dmesg as the
      following:
      
        Dentry cache hash table entries: 2147483648 (order: 18,17179869184 bytes)
        vmalloc: allocation failure, allocated 16021913600 of 17179934720 bytes
        swapper/0: page allocation failure: order:0,mode:0x2080020
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-0-ppc64
        Call Trace:
          .dump_stack+0xb4/0xb664 (unreliable)
          .warn_alloc_failed+0x114/0x160
          .__vmalloc_area_node+0x1a4/0x2b0
          .__vmalloc_node_range+0xe4/0x110
          .__vmalloc_node+0x40/0x50
          .alloc_large_system_hash+0x134/0x2a4
          .inode_init+0xa4/0xf0
          .vfs_caches_init+0x80/0x144
          .start_kernel+0x40c/0x4e0
          start_here_common+0x20/0x4a4
      Signed-off-by: default avatarLi Zhang <zhlcindy@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      987b3095
    • Kirill A. Shutemov's avatar
      thp: fix deadlock in split_huge_pmd() · 5f737714
      Kirill A. Shutemov authored
      split_huge_pmd() tries to munlock page with munlock_vma_page().  That
      requires the page to locked.
      
      If the is locked by caller, we would get a deadlock:
      
      	Unable to find swap-space signature
      	INFO: task trinity-c85:1907 blocked for more than 120 seconds.
      	      Not tainted 4.4.0-00032-gf19d0bdced41-dirty #1606
      	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      	trinity-c85     D ffff88084d997608     0  1907    309 0x00000000
      	Call Trace:
      	  schedule+0x9f/0x1c0
      	  schedule_timeout+0x48e/0x600
      	  io_schedule_timeout+0x1c3/0x390
      	  bit_wait_io+0x29/0xd0
      	  __wait_on_bit_lock+0x94/0x140
      	  __lock_page+0x1d4/0x280
      	  __split_huge_pmd+0x5a8/0x10f0
      	  split_huge_pmd_address+0x1d9/0x230
      	  try_to_unmap_one+0x540/0xc70
      	  rmap_walk_anon+0x284/0x810
      	  rmap_walk_locked+0x11e/0x190
      	  try_to_unmap+0x1b1/0x4b0
      	  split_huge_page_to_list+0x49d/0x18a0
      	  follow_page_mask+0xa36/0xea0
      	  SyS_move_pages+0xaf3/0x1570
      	  entry_SYSCALL_64_fastpath+0x12/0x6b
      	2 locks held by trinity-c85/1907:
      	 #0:  (&mm->mmap_sem){++++++}, at:  SyS_move_pages+0x933/0x1570
      	 #1:  (&anon_vma->rwsem){++++..}, at:  split_huge_page_to_list+0x402/0x18a0
      
      I don't think the deadlock is triggerable without split_huge_page()
      simplifilcation patchset.
      
      But munlock_vma_page() here is wrong: we want to munlock the page
      unconditionally, no need in rmap lookup, that munlock_vma_page() does.
      
      Let's use clear_page_mlock() instead.  It can be called under ptl.
      
      Fixes: e90309c9 ("thp: allow mlocked THP again")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f737714
    • Kirill A. Shutemov's avatar
      thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers · fec89c10
      Kirill A. Shutemov authored
      freeze_page() and unfreeze_page() helpers evolved in rather complex
      beasts.  It would be nice to cut complexity of this code.
      
      This patch rewrites freeze_page() using standard try_to_unmap().
      unfreeze_page() is rewritten with remove_migration_ptes().
      
      The result is much simpler.
      
      But the new variant is somewhat slower for PTE-mapped THPs.  Current
      helpers iterates over VMAs the compound page is mapped to, and then over
      ptes within this VMA.  New helpers iterates over small page, then over
      VMA the small page mapped to, and only then find relevant pte.
      
      We have short cut for PMD-mapped THP: we directly install migration
      entries on PMD split.
      
      I don't think the slowdown is critical, considering how much simpler
      result is and that split_huge_page() is quite rare nowadays.  It only
      happens due memory pressure or migration.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fec89c10