1. 06 Nov, 2019 10 commits
    • Johannes Weiner's avatar
      mm/page_alloc.c: ratelimit allocation failure warnings more aggressively · 1be334e5
      Johannes Weiner authored
      While investigating a bug related to higher atomic allocation failures,
      we noticed the failure warnings positively drowning the console, and in
      our case trigger lockup warnings because of a serial console too slow to
      handle all that output.
      
      But even if we had a faster console, it's unclear what additional
      information the current level of repetition provides.
      
      Allocation failures happen for three reasons: The machine is OOM, the VM
      is failing to handle reasonable requests, or somebody is making
      unreasonable requests (and didn't acknowledge their opportunism with
      __GFP_NOWARN).  Having the memory dump, a callstack, and the ratelimit
      stats on skipped failure warnings should provide enough information to
      let users/admins/developers know whether something is wrong and point
      them in the right direction for debugging, bpftracing etc.
      
      Limit allocation failure warnings to one spew every ten seconds.
      
      Link: http://lkml.kernel.org/r/20191028194906.26899-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1be334e5
    • Ville Syrjälä's avatar
      mm/khugepaged: fix might_sleep() warn with CONFIG_HIGHPTE=y · ec649c9d
      Ville Syrjälä authored
      I got some khugepaged spew on a 32bit x86:
      
        BUG: sleeping function called from invalid context at include/linux/mmu_notifier.h:346
        in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 25, name: khugepaged
        INFO: lockdep is turned off.
        CPU: 1 PID: 25 Comm: khugepaged Not tainted 5.4.0-rc5-elk+ #206
        Hardware name: System manufacturer P5Q-EM/P5Q-EM, BIOS 2203    07/08/2009
        Call Trace:
         dump_stack+0x66/0x8e
         ___might_sleep.cold.96+0x95/0xa6
         __might_sleep+0x2e/0x80
         collapse_huge_page.isra.51+0x5ac/0x1360
         khugepaged+0x9a9/0x20f0
         kthread+0xf5/0x110
         ret_from_fork+0x2e/0x38
      
      Looks like it's due to CONFIG_HIGHPTE=y pte_offset_map()->kmap_atomic()
      vs.  mmu_notifier_invalidate_range_start().  Let's do the naive approach
      and just reorder the two operations.
      
      Link: http://lkml.kernel.org/r/20191029201513.GG1208@intel.com
      Fixes: 810e24e0 ("mm/mmu_notifiers: annotate with might_sleep()")
      Signed-off-by: default avatarVille Syrjl <ville.syrjala@linux.intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec649c9d
    • Michal Hocko's avatar
      mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo · 93b3a674
      Michal Hocko authored
      pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
      This is not really nice because it blocks both any interrupts on that
      cpu and the page allocator.  On large machines this might even trigger
      the hard lockup detector.
      
      Considering the pagetypeinfo is a debugging tool we do not really need
      exact numbers here.  The primary reason to look at the outuput is to see
      how pageblocks are spread among different migratetypes and low number of
      pages is much more interesting therefore putting a bound on the number
      of pages on the free_list sounds like a reasonable tradeoff.
      
      The new output will simply tell
        [...]
        Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
      
      instead of
        Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
      
      The limit has been chosen arbitrary and it is a subject of a future
      change should there be a need for that.
      
      While we are at it, also drop the zone lock after each free_list
      iteration which will help with the IRQ and page allocator responsiveness
      even further as the IRQ lock held time is always bound to those 100k
      pages.
      
      [akpm@linux-foundation.org: tweak comment text, per David Hildenbrand]
      Link: http://lkml.kernel.org/r/20191025072610.18526-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93b3a674
    • Michal Hocko's avatar
      mm, vmstat: hide /proc/pagetypeinfo from normal users · abaed011
      Michal Hocko authored
      /proc/pagetypeinfo is a debugging tool to examine internal page
      allocator state wrt to fragmentation.  It is not very useful for any
      other use so normal users really do not need to read this file.
      
      Waiman Long has noticed that reading this file can have negative side
      effects because zone->lock is necessary for gathering data and that a)
      interferes with the page allocator and its users and b) can lead to hard
      lockups on large machines which have very long free_list.
      
      Reduce both issues by simply not exporting the file to regular users.
      
      Link: http://lkml.kernel.org/r/20191025072610.18526-2-mhocko@kernel.org
      Fixes: 467c996c ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Jann Horn <jannh@google.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      abaed011
    • Jason Gunthorpe's avatar
      mm/mmu_notifiers: use the right return code for WARN_ON · df2ec764
      Jason Gunthorpe authored
      The return code from the op callback is actually in _ret, while the
      WARN_ON was checking ret which causes it to misfire.
      
      Link: http://lkml.kernel.org/r/20191025175502.GA31127@ziepe.ca
      Fixes: 8402ce61 ("mm/mmu_notifiers: check if mmu notifier callbacks are allowed to fail")
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df2ec764
    • Shuning Zhang's avatar
      ocfs2: protect extent tree in ocfs2_prepare_inode_for_write() · e74540b2
      Shuning Zhang authored
      When the extent tree is modified, it should be protected by inode
      cluster lock and ip_alloc_sem.
      
      The extent tree is accessed and modified in the
      ocfs2_prepare_inode_for_write, but isn't protected by ip_alloc_sem.
      
      The following is a case.  The function ocfs2_fiemap is accessing the
      extent tree, which is modified at the same time.
      
        kernel BUG at fs/ocfs2/extent_map.c:475!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: tun ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue [...]
        CPU: 16 PID: 14047 Comm: o2info Not tainted 4.1.12-124.23.1.el6uek.x86_64 #2
        Hardware name: Oracle Corporation ORACLE SERVER X7-2L/ASM, MB MECH, X7-2L, BIOS 42040600 10/19/2018
        task: ffff88019487e200 ti: ffff88003daa4000 task.ti: ffff88003daa4000
        RIP: ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
        Call Trace:
          ocfs2_fiemap+0x1e3/0x430 [ocfs2]
          do_vfs_ioctl+0x155/0x510
          SyS_ioctl+0x81/0xa0
          system_call_fastpath+0x18/0xd8
        Code: 18 48 c7 c6 60 7f 65 a0 31 c0 bb e2 ff ff ff 48 8b 4a 40 48 8b 7a 28 48 c7 c2 78 2d 66 a0 e8 38 4f 05 00 e9 28 fe ff ff 0f 1f 00 <0f> 0b 66 0f 1f 44 00 00 bb 86 ff ff ff e9 13 fe ff ff 66 0f 1f
        RIP  ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
        ---[ end trace c8aa0c8180e869dc ]---
        Kernel panic - not syncing: Fatal exception
        Kernel Offset: disabled
      
      This issue can be reproduced every week in a production environment.
      
      This issue is related to the usage mode.  If others use ocfs2 in this
      mode, the kernel will panic frequently.
      
      [akpm@linux-foundation.org: coding style fixes]
      [Fix new warning due to unused function by removing said function - Linus ]
      Link: http://lkml.kernel.org/r/1568772175-2906-2-git-send-email-sunny.s.zhang@oracle.comSigned-off-by: default avatarShuning Zhang <sunny.s.zhang@oracle.com>
      Reviewed-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarGang He <ghe@suse.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e74540b2
    • Yang Shi's avatar
      mm: thp: handle page cache THP correctly in PageTransCompoundMap · 169226f7
      Yang Shi authored
      We have a usecase to use tmpfs as QEMU memory backend and we would like
      to take the advantage of THP as well.  But, our test shows the EPT is
      not PMD mapped even though the underlying THP are PMD mapped on host.
      The number showed by /sys/kernel/debug/kvm/largepage is much less than
      the number of PMD mapped shmem pages as the below:
      
        7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted)
        Size:            4194304 kB
        [snip]
        AnonHugePages:         0 kB
        ShmemPmdMapped:   579584 kB
        [snip]
        Locked:                0 kB
      
        cat /sys/kernel/debug/kvm/largepages
        12
      
      And some benchmarks do worse than with anonymous THPs.
      
      By digging into the code we figured out that commit 127393fb ("mm:
      thp: kvm: fix memory corruption in KVM with THP enabled") checks if
      there is a single PTE mapping on the page for anonymous THP when setting
      up EPT map.  But the _mapcount < 0 check doesn't work for page cache THP
      since every subpage of page cache THP would get _mapcount inc'ed once it
      is PMD mapped, so PageTransCompoundMap() always returns false for page
      cache THP.  This would prevent KVM from setting up PMD mapped EPT entry.
      
      So we need handle page cache THP correctly.  However, when page cache
      THP's PMD gets split, kernel just remove the map instead of setting up
      PTE map like what anonymous THP does.  Before KVM calls get_user_pages()
      the subpages may get PTE mapped even though it is still a THP since the
      page cache THP may be mapped by other processes at the mean time.
      
      Checking its _mapcount and whether the THP has PTE mapped or not.
      Although this may report some false negative cases (PTE mapped by other
      processes), it looks not trivial to make this accurate.
      
      With this fix /sys/kernel/debug/kvm/largepage would show reasonable
      pages are PMD mapped by EPT as the below:
      
        7fbeaee00000-7fbfaee00000 rw-s 00000000 00:14 275464 /dev/shm/qemu_back_mem.mem.SKUvat (deleted)
        Size:            4194304 kB
        [snip]
        AnonHugePages:         0 kB
        ShmemPmdMapped:   557056 kB
        [snip]
        Locked:                0 kB
      
        cat /sys/kernel/debug/kvm/largepages
        271
      
      And the benchmarks are as same as anonymous THPs.
      
      [yang.shi@linux.alibaba.com: v4]
        Link: http://lkml.kernel.org/r/1571865575-42913-1-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1571769577-89735-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: dd78fedd ("rmap: support file thp")
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Reported-by: default avatarGang Deng <gavin.dg@linux.alibaba.com>
      Tested-by: default avatarGang Deng <gavin.dg@linux.alibaba.com>
      Suggested-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      169226f7
    • Mel Gorman's avatar
      mm, meminit: recalculate pcpu batch and high limits after init completes · 3e8fc007
      Mel Gorman authored
      Deferred memory initialisation updates zone->managed_pages during the
      initialisation phase but before that finishes, the per-cpu page
      allocator (pcpu) calculates the number of pages allocated/freed in
      batches as well as the maximum number of pages allowed on a per-cpu
      list.  As zone->managed_pages is not up to date yet, the pcpu
      initialisation calculates inappropriately low batch and high values.
      
      This increases zone lock contention quite severely in some cases with
      the degree of severity depending on how many CPUs share a local zone and
      the size of the zone.  A private report indicated that kernel build
      times were excessive with extremely high system CPU usage.  A perf
      profile indicated that a large chunk of time was lost on zone->lock
      contention.
      
      This patch recalculates the pcpu batch and high values after deferred
      initialisation completes for every populated zone in the system.  It was
      tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
      workload -- allmodconfig and all available CPUs.
      
      mmtests configuration: config-workload-kernbench-max Configuration was
      modified to build on a fresh XFS partition.
      
      kernbench
                                      5.4.0-rc3              5.4.0-rc3
                                        vanilla           resetpcpu-v2
      Amean     user-256    13249.50 (   0.00%)    16401.31 * -23.79%*
      Amean     syst-256    14760.30 (   0.00%)     4448.39 *  69.86%*
      Amean     elsp-256      162.42 (   0.00%)      119.13 *  26.65%*
      Stddev    user-256       42.97 (   0.00%)       19.15 (  55.43%)
      Stddev    syst-256      336.87 (   0.00%)        6.71 (  98.01%)
      Stddev    elsp-256        2.46 (   0.00%)        0.39 (  84.03%)
      
                         5.4.0-rc3    5.4.0-rc3
                           vanilla resetpcpu-v2
      Duration User       39766.24     49221.79
      Duration System     44298.10     13361.67
      Duration Elapsed      519.11       388.87
      
      The patch reduces system CPU usage by 69.86% and total build time by
      26.65%.  The variance of system CPU usage is also much reduced.
      
      Before, this was the breakdown of batch and high values over all zones
      was:
      
          256               batch: 1
          256               batch: 63
          512               batch: 7
          256               high:  0
          256               high:  378
          512               high:  42
      
      512 pcpu pagesets had a batch limit of 7 and a high limit of 42.  After
      the patch:
      
          256               batch: 1
          768               batch: 63
          256               high:  0
          768               high:  378
      
      [mgorman@techsingularity.net: fix merge/linkage snafu]
        Link: http://lkml.kernel.org/r/20191023084705.GD3016@techsingularity.netLink: http://lkml.kernel.org/r/20191021094808.28824-2-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Qian Cai <cai@lca.pw>
      Cc: <stable@vger.kernel.org>	[4.1+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e8fc007
    • John Hubbard's avatar
      mm/gup_benchmark: fix MAP_HUGETLB case · 64801d19
      John Hubbard authored
      The MAP_HUGETLB ("-H" option) of gup_benchmark fails:
      
        $ sudo ./gup_benchmark -H
        mmap: Invalid argument
      
      This is because gup_benchmark.c is passing in a file descriptor to
      mmap(), but the fd came from opening up the /dev/zero file.  This
      confuses the mmap syscall implementation, which thinks that, if the
      caller did not specify MAP_ANONYMOUS, then the file must be a huge page
      file.  So it attempts to verify that the file really is a huge page
      file, as you can see here:
      
      ksys_mmap_pgoff()
      {
          if (!(flags & MAP_ANONYMOUS)) {
              retval = -EINVAL;
              if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
                  goto out_fput; /* THIS IS WHERE WE END UP */
      
          else if (flags & MAP_HUGETLB) {
              ...proceed normally, /dev/zero is ok here...
      
      ...and of course is_file_hugepages() returns "false" for the /dev/zero
      file.
      
      The problem is that the user space program, gup_benchmark.c, really just
      wants anonymous memory here.  The simplest way to get that is to pass
      MAP_ANONYMOUS whenever MAP_HUGETLB is specified, so that's what this
      patch does.
      
      Link: http://lkml.kernel.org/r/20191021212435.398153-2-jhubbard@nvidia.comSigned-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64801d19
    • Shakeel Butt's avatar
      mm: memcontrol: fix NULL-ptr deref in percpu stats flush · 7961eee3
      Shakeel Butt authored
      __mem_cgroup_free() can be called on the failure path in
      mem_cgroup_alloc().  However memcg_flush_percpu_vmstats() and
      memcg_flush_percpu_vmevents() which are called from __mem_cgroup_free()
      access the fields of memcg which can potentially be null if called from
      failure path from mem_cgroup_alloc().  Indeed syzbot has reported the
      following crash:
      
      	kasan: CONFIG_KASAN_INLINE enabled
      	kasan: GPF could be caused by NULL-ptr deref or user memory access
      	general protection fault: 0000 [#1] PREEMPT SMP KASAN
      	CPU: 0 PID: 30393 Comm: syz-executor.1 Not tainted 5.4.0-rc2+ #0
      	Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      	RIP: 0010:memcg_flush_percpu_vmstats+0x4ae/0x930 mm/memcontrol.c:3436
      	Code: 05 41 89 c0 41 0f b6 04 24 41 38 c7 7c 08 84 c0 0f 85 5d 03 00 00 44 3b 05 33 d5 12 08 0f 83 e2 00 00 00 4c 89 f0 48 c1 e8 03 <42> 80 3c 28 00 0f 85 91 03 00 00 48 8b 85 10 fe ff ff 48 8b b0 90
      	RSP: 0018:ffff888095c27980 EFLAGS: 00010206
      	RAX: 0000000000000012 RBX: ffff888095c27b28 RCX: ffffc90008192000
      	RDX: 0000000000040000 RSI: ffffffff8340fae7 RDI: 0000000000000007
      	RBP: ffff888095c27be0 R08: 0000000000000000 R09: ffffed1013f0da33
      	R10: ffffed1013f0da32 R11: ffff88809f86d197 R12: fffffbfff138b760
      	R13: dffffc0000000000 R14: 0000000000000090 R15: 0000000000000007
      	FS:  00007f5027170700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      	CR2: 0000000000710158 CR3: 00000000a7b18000 CR4: 00000000001406f0
      	DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      	DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      	Call Trace:
      	__mem_cgroup_free+0x1a/0x190 mm/memcontrol.c:5021
      	mem_cgroup_free mm/memcontrol.c:5033 [inline]
      	mem_cgroup_css_alloc+0x3a1/0x1ae0 mm/memcontrol.c:5160
      	css_create kernel/cgroup/cgroup.c:5156 [inline]
      	cgroup_apply_control_enable+0x44d/0xc40 kernel/cgroup/cgroup.c:3119
      	cgroup_mkdir+0x899/0x11b0 kernel/cgroup/cgroup.c:5401
      	kernfs_iop_mkdir+0x14d/0x1d0 fs/kernfs/dir.c:1124
      	vfs_mkdir+0x42e/0x670 fs/namei.c:3807
      	do_mkdirat+0x234/0x2a0 fs/namei.c:3830
      	__do_sys_mkdir fs/namei.c:3846 [inline]
      	__se_sys_mkdir fs/namei.c:3844 [inline]
      	__x64_sys_mkdir+0x5c/0x80 fs/namei.c:3844
      	do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
      	entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fixing this by moving the flush to mem_cgroup_free as there is no need
      to flush anything if we see failure in mem_cgroup_alloc().
      
      Link: http://lkml.kernel.org/r/20191018165231.249872-1-shakeelb@google.com
      Fixes: bb65f89b ("mm: memcontrol: flush percpu vmevents before releasing memcg")
      Fixes: c350a99e ("mm: memcontrol: flush percpu vmstats before releasing memcg")
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reported-by: syzbot+515d5bcfe179cdf049b2@syzkaller.appspotmail.com
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7961eee3
  2. 03 Nov, 2019 2 commits
    • Linus Torvalds's avatar
      Linux 5.4-rc6 · a99d8080
      Linus Torvalds authored
      a99d8080
    • Linus Torvalds's avatar
      Merge tag 'usb-5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 3a69c9e5
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "The USB sub-maintainers woke up this past week and sent a bunch of
        tiny fixes. Here are a lot of small patches that that resolve a bunch
        of reported issues in the USB core, drivers, serial drivers, gadget
        drivers, and of course, xhci :)
      
        All of these have been in linux-next with no reported issues"
      
      * tag 'usb-5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (31 commits)
        usb: dwc3: gadget: fix race when disabling ep with cancelled xfers
        usb: cdns3: gadget: Fix g_audio use case when connected to Super-Speed host
        usb: cdns3: gadget: reset EP_CLAIMED flag while unloading
        USB: serial: whiteheat: fix line-speed endianness
        USB: serial: whiteheat: fix potential slab corruption
        USB: gadget: Reject endpoints with 0 maxpacket value
        UAS: Revert commit 3ae62a42 ("UAS: fix alignment of scatter/gather segments")
        usb-storage: Revert commit 747668db ("usb-storage: Set virt_boundary_mask to avoid SG overflows")
        usbip: Fix free of unallocated memory in vhci tx
        usbip: tools: Fix read_usb_vudc_device() error path handling
        usb: xhci: fix __le32/__le64 accessors in debugfs code
        usb: xhci: fix Immediate Data Transfer endianness
        xhci: Fix use-after-free regression in xhci clear hub TT implementation
        USB: ldusb: fix control-message timeout
        USB: ldusb: use unsigned size format specifiers
        USB: ldusb: fix ring-buffer locking
        USB: Skip endpoints with 0 maxpacket length
        usb: cdns3: gadget: Don't manage pullups
        usb: dwc3: remove the call trace of USBx_GFLADJ
        usb: gadget: configfs: fix concurrent issue between composite APIs
        ...
      3a69c9e5
  3. 02 Nov, 2019 10 commits
    • Linus Torvalds's avatar
      Merge tag '5.4-rc6-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6 · 56cfd250
      Linus Torvalds authored
      Pull cifs fix from Steve French:
       "A small smb3 memleak fix"
      
      * tag '5.4-rc6-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6:
        fix memory leak in large read decrypt offload
      56cfd250
    • Linus Torvalds's avatar
      Merge tag 'hwmon-for-v5.4-rc6' of... · 9d234505
      Linus Torvalds authored
      Merge tag 'hwmon-for-v5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
      
      Pull hwmon fixes from Guenter Roeck:
      
       - Fix read timeout problem in ina3221 driver
      
       - Fix wrong bitmask in nct7904 driver
      
      * tag 'hwmon-for-v5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
        hwmon: (ina3221) Fix read timeout issue
        hwmon: (nct7904) Fix the incorrect value of vsen_mask & tcpu_mask & temp_mode in nct7904_data struct.
      9d234505
    • Linus Torvalds's avatar
      Merge tag 'pwm/for-5.4-rc6' of... · e935842a
      Linus Torvalds authored
      Merge tag 'pwm/for-5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm
      
      Pull pwm fixes from Thierry Reding:
       "It turned out that relying solely on drivers storing all the PWM state
        in hardware was a little premature and causes a number of subtle (and
        some not so subtle) regressions. Revert the offending patch for now"
      
      * tag 'pwm/for-5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm:
        Revert "pwm: Let pwm_get_state() return the last implemented state"
      e935842a
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · f83e148a
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Nine changes, eight in drivers [ufs, target, lpfc x 2, qla2xxx x 4]
        and one core change in sd that fixes an I/O failure on DIF type 3
        devices"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: qla2xxx: stop timer in shutdown path
        scsi: sd: define variable dif as unsigned int instead of bool
        scsi: target: cxgbit: Fix cxgbit_fw4_ack()
        scsi: qla2xxx: Fix partial flash write of MBI
        scsi: qla2xxx: Initialized mailbox to prevent driver load failure
        scsi: lpfc: Honor module parameter lpfc_use_adisc
        scsi: ufs-bsg: Wake the device before sending raw upiu commands
        scsi: lpfc: Check queue pointer before use
        scsi: qla2xxx: fixup incorrect usage of host_byte
      f83e148a
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 8194c28e
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
       "Our recent cleanup of EEH led to an oops on bare metal machines when
        the cxl (CAPI) driver creates virtual devices for an attached FPGA
        accelerator.
      
        The "secure virtual machine" support we added in v5.4 had a bug if the
        kernel was relocated (moved during boot), in those cases the signature
        of the kernel text wouldn't verify and the Ultravisor would refuse to
        run the VM.
      
        A recent change to disable interrupts before calling
        arch_cpu_idle_dead() caused a WARN_ON() in our bare metal CPU offline
        code to always trigger.
      
        The KUAP (SMAP) support we added for 32-bit Book3S had a bug if the
        address range crossed a segment (256MB) boundary which could lead to
        spurious faults.
      
        Thanks to: Christophe Leroy, Frederic Barrat, Michael Anderson,
        Nicholas Piggin, Sam Bobroff, Thiago Jung Bauermann"
      
      * tag 'powerpc-5.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/powernv: Fix CPU idle to be called with IRQs disabled
        powerpc/prom_init: Undo relocation before entering secure mode
        powerpc/powernv/eeh: Fix oops when probing cxl devices
        powerpc/32s: fix allow/prevent_user_access() when crossing segment boundaries.
      8194c28e
    • Linus Torvalds's avatar
      Merge tag 's390-5.4-6' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 969a5197
      Linus Torvalds authored
      Pull s390 fixes from Vasily Gorbik:
      
       - Fix cpu idle time accounting
      
       - Fix stack unwinder case when both pt_regs and sp are specified
      
       - Fix information leak via cmm timeout proc handler
      
      * tag 's390-5.4-6' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390/idle: fix cpu idle time calculation
        s390/unwind: fix mixing regs and sp
        s390/cmm: fix information leak in cmm_timeout_handler()
      969a5197
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 1204c70d
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix free/alloc races in batmanadv, from Sven Eckelmann.
      
       2) Several leaks and other fixes in kTLS support of mlx5 driver, from
          Tariq Toukan.
      
       3) BPF devmap_hash cost calculation can overflow on 32-bit, from Toke
          Høiland-Jørgensen.
      
       4) Add an r8152 device ID, from Kazutoshi Noguchi.
      
       5) Missing include in ipv6's addrconf.c, from Ben Dooks.
      
       6) Use siphash in flow dissector, from Eric Dumazet. Attackers can
          easily infer the 32-bit secret otherwise etc.
      
       7) Several netdevice nesting depth fixes from Taehee Yoo.
      
       8) Fix several KCSAN reported errors, from Eric Dumazet. For example,
          when doing lockless skb_queue_empty() checks, and accessing
          sk_napi_id/sk_incoming_cpu lockless as well.
      
       9) Fix jumbo packet handling in RXRPC, from David Howells.
      
      10) Bump SOMAXCONN and tcp_max_syn_backlog values, from Eric Dumazet.
      
      11) Fix DMA synchronization in gve driver, from Yangchun Fu.
      
      12) Several bpf offload fixes, from Jakub Kicinski.
      
      13) Fix sk_page_frag() recursion during memory reclaim, from Tejun Heo.
      
      14) Fix ping latency during high traffic rates in hisilicon driver, from
          Jiangfent Xiao.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (146 commits)
        net: fix installing orphaned programs
        net: cls_bpf: fix NULL deref on offload filter removal
        selftests: bpf: Skip write only files in debugfs
        selftests: net: reuseport_dualstack: fix uninitalized parameter
        r8169: fix wrong PHY ID issue with RTL8168dp
        net: dsa: bcm_sf2: Fix IMP setup for port different than 8
        net: phylink: Fix phylink_dbg() macro
        gve: Fixes DMA synchronization.
        inet: stop leaking jiffies on the wire
        ixgbe: Remove duplicate clear_bit() call
        Documentation: networking: device drivers: Remove stray asterisks
        e1000: fix memory leaks
        i40e: Fix receive buffer starvation for AF_XDP
        igb: Fix constant media auto sense switching when no cable is connected
        net: ethernet: arc: add the missed clk_disable_unprepare
        igb: Enable media autosense for the i350.
        igb/igc: Don't warn on fatal read failures when the device is removed
        tcp: increase tcp_max_syn_backlog max value
        net: increase SOMAXCONN to 4096
        netdevsim: Fix use-after-free during device dismantle
        ...
      1204c70d
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-5.4-3' of git://git.linux-nfs.org/projects/anna/linux-nfs · 372bf6c1
      Linus Torvalds authored
      Pull NFS client bugfixes from Anna Schumaker:
       "This contains two delegation fixes (with the RCU lock leak fix marked
        for stable), and three patches to fix destroying the the sunrpc back
        channel.
      
        Stable bugfixes:
      
         - Fix an RCU lock leak in nfs4_refresh_delegation_stateid()
      
        Other fixes:
      
         - The TCP back channel mustn't disappear while requests are
           outstanding
      
         - The RDMA back channel mustn't disappear while requests are
           outstanding
      
         - Destroy the back channel when we destroy the host transport
      
         - Don't allow a cached open with a revoked delegation"
      
      * tag 'nfs-for-5.4-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
        NFS: Fix an RCU lock leak in nfs4_refresh_delegation_stateid()
        NFSv4: Don't allow a cached open with a revoked delegation
        SUNRPC: Destroy the back channel when we destroy the host transport
        SUNRPC: The RDMA back channel mustn't disappear while requests are outstanding
        SUNRPC: The TCP back channel mustn't disappear while requests are outstanding
      372bf6c1
    • Linus Torvalds's avatar
      Merge tag 'for-linus-20191101' of git://git.kernel.dk/linux-block · 0821de28
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
      
       - Two small nvme fixes, one is a fabrics connection fix, the other one
         a cleanup made possible by that fix (Anton, via Keith)
      
       - Fix requeue handling in umb ubd (Anton)
      
       - Fix spin_lock_irq() nesting in blk-iocost (Dan)
      
       - Three small io_uring fixes:
           - Install io_uring fd after done with ctx (me)
           - Clear ->result before every poll issue (me)
           - Fix leak of shadow request on error (Pavel)
      
      * tag 'for-linus-20191101' of git://git.kernel.dk/linux-block:
        iocost: don't nest spin_lock_irq in ioc_weight_write()
        io_uring: ensure we clear io_kiocb->result before each issue
        um-ubd: Entrust re-queue to the upper layers
        nvme-multipath: remove unused groups_only mode in ana log
        nvme-multipath: fix possible io hang after ctrl reconnect
        io_uring: don't touch ctx in setup after ring fd install
        io_uring: Fix leaked shadow_req
      0821de28
    • Linus Torvalds's avatar
      Merge tag 'riscv/for-v5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · e5897c7d
      Linus Torvalds authored
      Pull RISC-V fixes from Paul Walmsley:
       "One fix for PCIe users:
      
         - Fix legacy PCI I/O port access emulation
      
        One set of cleanups:
      
         - Resolve most of the warnings generated by sparse across arch/riscv.
           No functional changes
      
        And one MAINTAINERS update:
      
         - Update Palmer's E-mail address"
      
      * tag 'riscv/for-v5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        MAINTAINERS: Change to my personal email address
        RISC-V: Add PCIe I/O BAR memory mapping
        riscv: for C functions called only from assembly, mark with __visible
        riscv: fp: add missing __user pointer annotations
        riscv: add missing header file includes
        riscv: mark some code and data as file-static
        riscv: init: merge split string literals in preprocessor directive
        riscv: add prototypes for assembly language functions from head.S
      e5897c7d
  4. 01 Nov, 2019 18 commits