1. 08 May, 2020 17 commits
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · af38553c
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "14 fixes and one selftest to verify the ipc fixes herein"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm: limit boost_watermark on small zones
        ubsan: disable UBSAN_ALIGNMENT under COMPILE_TEST
        mm/vmscan: remove unnecessary argument description of isolate_lru_pages()
        epoll: atomically remove wait entry on wake up
        kselftests: introduce new epoll60 testcase for catching lost wakeups
        percpu: make pcpu_alloc() aware of current gfp context
        mm/slub: fix incorrect interpretation of s->offset
        scripts/gdb: repair rb_first() and rb_last()
        eventpoll: fix missing wakeup for ovflist in ep_poll_callback
        arch/x86/kvm/svm/sev.c: change flag passed to GUP fast in sev_pin_memory()
        scripts/decodecode: fix trapping instruction formatting
        kernel/kcov.c: fix typos in kcov_remote_start documentation
        mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous()
        mm, memcg: fix error return value of mem_cgroup_css_alloc()
        ipc/mqueue.c: change __do_notify() to bypass check_kill_permission()
      af38553c
    • Linus Torvalds's avatar
      Merge branch 'for-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security · 79dede78
      Linus Torvalds authored
      Pull security subsystem fix from James Morris:
       "Fix the default value of fs_context_parse_param hook"
      
      * 'for-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
        security: Fix the default value of fs_context_parse_param hook
      79dede78
    • Henry Willard's avatar
      mm: limit boost_watermark on small zones · 14f69140
      Henry Willard authored
      Commit 1c30844d ("mm: reclaim small amounts of memory when an
      external fragmentation event occurs") adds a boost_watermark() function
      which increases the min watermark in a zone by at least
      pageblock_nr_pages or the number of pages in a page block.
      
      On Arm64, with 64K pages and 512M huge pages, this is 8192 pages or
      512M.  It does this regardless of the number of managed pages managed in
      the zone or the likelihood of success.
      
      This can put the zone immediately under water in terms of allocating
      pages from the zone, and can cause a small machine to fail immediately
      due to OoM.  Unlike set_recommended_min_free_kbytes(), which
      substantially increases min_free_kbytes and is tied to THP,
      boost_watermark() can be called even if THP is not active.
      
      The problem is most likely to appear on architectures such as Arm64
      where pageblock_nr_pages is very large.
      
      It is desirable to run the kdump capture kernel in as small a space as
      possible to avoid wasting memory.  In some architectures, such as Arm64,
      there are restrictions on where the capture kernel can run, and
      therefore, the space available.  A capture kernel running in 768M can
      fail due to OoM immediately after boost_watermark() sets the min in zone
      DMA32, where most of the memory is, to 512M.  It fails even though there
      is over 500M of free memory.  With boost_watermark() suppressed, the
      capture kernel can run successfully in 448M.
      
      This patch limits boost_watermark() to boosting a zone's min watermark
      only when there are enough pages that the boost will produce positive
      results.  In this case that is estimated to be four times as many pages
      as pageblock_nr_pages.
      
      Mel said:
      
      : There is no harm in marking it stable.  Clearly it does not happen very
      : often but it's not impossible.  32-bit x86 is a lot less common now
      : which would previously have been vulnerable to triggering this easily.
      : ppc64 has a larger base page size but typically only has one zone.
      : arm64 is likely the most vulnerable, particularly when CMA is
      : configured with a small movable zone.
      
      Fixes: 1c30844d ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Signed-off-by: default avatarHenry Willard <henry.willard@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1588294148-6586-1-git-send-email-henry.willard@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14f69140
    • Kees Cook's avatar
      ubsan: disable UBSAN_ALIGNMENT under COMPILE_TEST · 8d58f222
      Kees Cook authored
      The documentation for UBSAN_ALIGNMENT already mentions that it should
      not be used on all*config builds (and for efficient-unaligned-access
      architectures), so just refactor the Kconfig to correctly implement this
      so randconfigs will stop creating insane images that freak out objtool
      under CONFIG_UBSAN_TRAP (due to the false positives producing functions
      that never return, etc).
      
      Link: http://lkml.kernel.org/r/202005011433.C42EA3E2D@keescook
      Fixes: 0887a7eb ("ubsan: add trap instrumentation option")
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reported-by: default avatarRandy Dunlap <rdunlap@infradead.org>
        Link: https://lore.kernel.org/linux-next/202004231224.D6B3B650@keescook/Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d58f222
    • Qiwu Chen's avatar
      mm/vmscan: remove unnecessary argument description of isolate_lru_pages() · 17e34526
      Qiwu Chen authored
      Since commit a9e7c39f ("mm/vmscan.c: remove 7th argument of
      isolate_lru_pages()"), the explanation of 'mode' argument has been
      unnecessary.  Let's remove it.
      Signed-off-by: default avatarQiwu Chen <chenqiwu@xiaomi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200501090346.2894-1-chenqiwu@xiaomi.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      17e34526
    • Roman Penyaev's avatar
      epoll: atomically remove wait entry on wake up · 412895f0
      Roman Penyaev authored
      This patch does two things:
      
       - fixes a lost wakeup introduced by commit 339ddb53 ("fs/epoll:
         remove unnecessary wakeups of nested epoll")
      
       - improves performance for events delivery.
      
      The description of the problem is the following: if N (>1) threads are
      waiting on ep->wq for new events and M (>1) events come, it is quite
      likely that >1 wakeups hit the same wait queue entry, because there is
      quite a big window between __add_wait_queue_exclusive() and the
      following __remove_wait_queue() calls in ep_poll() function.
      
      This can lead to lost wakeups, because thread, which was woken up, can
      handle not all the events in ->rdllist.  (in better words the problem is
      described here: https://lkml.org/lkml/2019/10/7/905)
      
      The idea of the current patch is to use init_wait() instead of
      init_waitqueue_entry().
      
      Internally init_wait() sets autoremove_wake_function as a callback,
      which removes the wait entry atomically (under the wq locks) from the
      list, thus the next coming wakeup hits the next wait entry in the wait
      queue, thus preventing lost wakeups.
      
      Problem is very well reproduced by the epoll60 test case [1].
      
      Wait entry removal on wakeup has also performance benefits, because
      there is no need to take a ep->lock and remove wait entry from the queue
      after the successful wakeup.  Here is the timing output of the epoll60
      test case:
      
        With explicit wakeup from ep_scan_ready_list() (the state of the
        code prior 339ddb53):
      
          real    0m6.970s
          user    0m49.786s
          sys     0m0.113s
      
       After this patch:
      
         real    0m5.220s
         user    0m36.879s
         sys     0m0.019s
      
      The other testcase is the stress-epoll [2], where one thread consumes
      all the events and other threads produce many events:
      
        With explicit wakeup from ep_scan_ready_list() (the state of the
        code prior 339ddb53):
      
          threads  events/ms  run-time ms
                8       5427         1474
               16       6163         2596
               32       6824         4689
               64       7060         9064
              128       6991        18309
      
       After this patch:
      
          threads  events/ms  run-time ms
                8       5598         1429
               16       7073         2262
               32       7502         4265
               64       7640         8376
              128       7634        16767
      
       (number of "events/ms" represents event bandwidth, thus higher is
        better; number of "run-time ms" represents overall time spent
        doing the benchmark, thus lower is better)
      
      [1] tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
      [2] https://github.com/rouming/test-tools/blob/master/stress-epoll.cSigned-off-by: default avatarRoman Penyaev <rpenyaev@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJason Baron <jbaron@akamai.com>
      Cc: Khazhismel Kumykov <khazhy@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Heiher <r@hev.cc>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200430130326.1368509-2-rpenyaev@suse.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      412895f0
    • Roman Penyaev's avatar
      kselftests: introduce new epoll60 testcase for catching lost wakeups · 474328c0
      Roman Penyaev authored
      This test case catches lost wake up introduced by commit 339ddb53
      ("fs/epoll: remove unnecessary wakeups of nested epoll")
      
      The test is simple: we have 10 threads and 10 event fds.  Each thread
      can harvest only 1 event.  1 producer fires all 10 events at once and
      waits that all 10 events will be observed by 10 threads.
      
      In case of lost wakeup epoll_wait() will timeout and 0 will be returned.
      
      Test case catches two sort of problems: forgotten wakeup on event, which
      hits the ->ovflist list, this problem was fixed by:
      
        5a2513239750 ("eventpoll: fix missing wakeup for ovflist in ep_poll_callback")
      
      the other problem is when several sequential events hit the same waiting
      thread, thus other waiters get no wakeups.  Problem is fixed in the
      following patch.
      Signed-off-by: default avatarRoman Penyaev <rpenyaev@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Khazhismel Kumykov <khazhy@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Heiher <r@hev.cc>
      Cc: Jason Baron <jbaron@akamai.com>
      Link: http://lkml.kernel.org/r/20200430130326.1368509-1-rpenyaev@suse.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      474328c0
    • Filipe Manana's avatar
      percpu: make pcpu_alloc() aware of current gfp context · 28307d93
      Filipe Manana authored
      Since 5.7-rc1, on btrfs we have a percpu counter initialization for
      which we always pass a GFP_KERNEL gfp_t argument (this happens since
      commit 2992df73 ("btrfs: Implement DREW lock")).
      
      That is safe in some contextes but not on others where allowing fs
      reclaim could lead to a deadlock because we are either holding some
      btrfs lock needed for a transaction commit or holding a btrfs
      transaction handle open.  Because of that we surround the call to the
      function that initializes the percpu counter with a NOFS context using
      memalloc_nofs_save() (this is done at btrfs_init_fs_root()).
      
      However it turns out that this is not enough to prevent a possible
      deadlock because percpu_alloc() determines if it is in an atomic context
      by looking exclusively at the gfp flags passed to it (GFP_KERNEL in this
      case) and it is not aware that a NOFS context is set.
      
      Because percpu_alloc() thinks it is in a non atomic context it locks the
      pcpu_alloc_mutex.  This can result in a btrfs deadlock when
      pcpu_balance_workfn() is running, has acquired that mutex and is waiting
      for reclaim, while the btrfs task that called percpu_counter_init() (and
      therefore percpu_alloc()) is holding either the btrfs commit_root
      semaphore or a transaction handle (done fs/btrfs/backref.c:
      iterate_extent_inodes()), which prevents reclaim from finishing as an
      attempt to commit the current btrfs transaction will deadlock.
      
      Lockdep reports this issue with the following trace:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.6.0-rc7-btrfs-next-77 #1 Not tainted
        ------------------------------------------------------
        kswapd0/91 is trying to acquire lock:
        ffff8938a3b3fdc8 (&delayed_node->mutex){+.+.}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
      
        but task is already holding lock:
        ffffffffb4f0dbc0 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #4 (fs_reclaim){+.+.}:
               fs_reclaim_acquire.part.0+0x25/0x30
               __kmalloc+0x5f/0x3a0
               pcpu_create_chunk+0x19/0x230
               pcpu_balance_workfn+0x56a/0x680
               process_one_work+0x235/0x5f0
               worker_thread+0x50/0x3b0
               kthread+0x120/0x140
               ret_from_fork+0x3a/0x50
      
        -> #3 (pcpu_alloc_mutex){+.+.}:
               __mutex_lock+0xa9/0xaf0
               pcpu_alloc+0x480/0x7c0
               __percpu_counter_init+0x50/0xd0
               btrfs_drew_lock_init+0x22/0x70 [btrfs]
               btrfs_get_fs_root+0x29c/0x5c0 [btrfs]
               resolve_indirect_refs+0x120/0xa30 [btrfs]
               find_parent_nodes+0x50b/0xf30 [btrfs]
               btrfs_find_all_leafs+0x60/0xb0 [btrfs]
               iterate_extent_inodes+0x139/0x2f0 [btrfs]
               iterate_inodes_from_logical+0xa1/0xe0 [btrfs]
               btrfs_ioctl_logical_to_ino+0xb4/0x190 [btrfs]
               btrfs_ioctl+0x165a/0x3130 [btrfs]
               ksys_ioctl+0x87/0xc0
               __x64_sys_ioctl+0x16/0x20
               do_syscall_64+0x5c/0x260
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #2 (&fs_info->commit_root_sem){++++}:
               down_write+0x38/0x70
               btrfs_cache_block_group+0x2ec/0x500 [btrfs]
               find_free_extent+0xc6a/0x1600 [btrfs]
               btrfs_reserve_extent+0x9b/0x180 [btrfs]
               btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
               alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
               __btrfs_cow_block+0x122/0x5a0 [btrfs]
               btrfs_cow_block+0x106/0x240 [btrfs]
               commit_cowonly_roots+0x55/0x310 [btrfs]
               btrfs_commit_transaction+0x509/0xb20 [btrfs]
               sync_filesystem+0x74/0x90
               generic_shutdown_super+0x22/0x100
               kill_anon_super+0x14/0x30
               btrfs_kill_super+0x12/0x20 [btrfs]
               deactivate_locked_super+0x31/0x70
               cleanup_mnt+0x100/0x160
               task_work_run+0x93/0xc0
               exit_to_usermode_loop+0xf9/0x100
               do_syscall_64+0x20d/0x260
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #1 (&space_info->groups_sem){++++}:
               down_read+0x3c/0x140
               find_free_extent+0xef6/0x1600 [btrfs]
               btrfs_reserve_extent+0x9b/0x180 [btrfs]
               btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
               alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
               __btrfs_cow_block+0x122/0x5a0 [btrfs]
               btrfs_cow_block+0x106/0x240 [btrfs]
               btrfs_search_slot+0x50c/0xd60 [btrfs]
               btrfs_lookup_inode+0x3a/0xc0 [btrfs]
               __btrfs_update_delayed_inode+0x90/0x280 [btrfs]
               __btrfs_commit_inode_delayed_items+0x81f/0x870 [btrfs]
               __btrfs_run_delayed_items+0x8e/0x180 [btrfs]
               btrfs_commit_transaction+0x31b/0xb20 [btrfs]
               iterate_supers+0x87/0xf0
               ksys_sync+0x60/0xb0
               __ia32_sys_sync+0xa/0x10
               do_syscall_64+0x5c/0x260
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #0 (&delayed_node->mutex){+.+.}:
               __lock_acquire+0xef0/0x1c80
               lock_acquire+0xa2/0x1d0
               __mutex_lock+0xa9/0xaf0
               __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
               btrfs_evict_inode+0x40d/0x560 [btrfs]
               evict+0xd9/0x1c0
               dispose_list+0x48/0x70
               prune_icache_sb+0x54/0x80
               super_cache_scan+0x124/0x1a0
               do_shrink_slab+0x176/0x440
               shrink_slab+0x23a/0x2c0
               shrink_node+0x188/0x6e0
               balance_pgdat+0x31d/0x7f0
               kswapd+0x238/0x550
               kthread+0x120/0x140
               ret_from_fork+0x3a/0x50
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> pcpu_alloc_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(fs_reclaim);
                                       lock(pcpu_alloc_mutex);
                                       lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/91:
         #0: (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
         #1: (shrinker_rwsem){++++}, at: shrink_slab+0x12f/0x2c0
         #2: (&type->s_umount_key#43){++++}, at: trylock_super+0x16/0x50
      
        stack backtrace:
        CPU: 1 PID: 91 Comm: kswapd0 Not tainted 5.6.0-rc7-btrfs-next-77 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8f/0xd0
         check_noncircular+0x170/0x190
         __lock_acquire+0xef0/0x1c80
         lock_acquire+0xa2/0x1d0
         __mutex_lock+0xa9/0xaf0
         __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         btrfs_evict_inode+0x40d/0x560 [btrfs]
         evict+0xd9/0x1c0
         dispose_list+0x48/0x70
         prune_icache_sb+0x54/0x80
         super_cache_scan+0x124/0x1a0
         do_shrink_slab+0x176/0x440
         shrink_slab+0x23a/0x2c0
         shrink_node+0x188/0x6e0
         balance_pgdat+0x31d/0x7f0
         kswapd+0x238/0x550
         kthread+0x120/0x140
         ret_from_fork+0x3a/0x50
      
      This could be fixed by making btrfs pass GFP_NOFS instead of GFP_KERNEL
      to percpu_counter_init() in contextes where it is not reclaim safe,
      however that type of approach is discouraged since
      memalloc_[nofs|noio]_save() were introduced.  Therefore this change
      makes pcpu_alloc() look up into an existing nofs/noio context before
      deciding whether it is in an atomic context or not.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Link: http://lkml.kernel.org/r/20200430164356.15543-1-fdmanana@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28307d93
    • Waiman Long's avatar
      mm/slub: fix incorrect interpretation of s->offset · cbfc35a4
      Waiman Long authored
      In a couple of places in the slub memory allocator, the code uses
      "s->offset" as a check to see if the free pointer is put right after the
      object.  That check is no longer true with commit 3202fa62 ("slub:
      relocate freelist pointer to middle of object").
      
      As a result, echoing "1" into the validate sysfs file, e.g.  of dentry,
      may cause a bunch of "Freepointer corrupt" error reports like the
      following to appear with the system in panic afterwards.
      
        =============================================================================
        BUG dentry(666:pmcd.service) (Tainted: G    B): Freepointer corrupt
        -----------------------------------------------------------------------------
      
      To fix it, use the check "s->offset == s->inuse" in the new helper
      function freeptr_outside_object() instead.  Also add another helper
      function get_info_end() to return the end of info block (inuse + free
      pointer if not overlapping with object).
      
      Fixes: 3202fa62 ("slub: relocate freelist pointer to middle of object")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Vitaly Nikolenko <vnik@duasynt.com>
      Cc: Silvio Cesare <silvio.cesare@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Markus Elfring <Markus.Elfring@web.de>
      Cc: Changbin Du <changbin.du@gmail.com>
      Link: http://lkml.kernel.org/r/20200429135328.26976-1-longman@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cbfc35a4
    • Aymeric Agon-Rambosson's avatar
      scripts/gdb: repair rb_first() and rb_last() · 50e36be1
      Aymeric Agon-Rambosson authored
      The current implementations of the rb_first() and rb_last() gdb
      functions have a variable that references itself in its instanciation,
      which causes the function to throw an error if a specific condition on
      the argument is met.  The original author rather intended to reference
      the argument and made a typo.  Referring the argument instead makes the
      function work as intended.
      Signed-off-by: default avatarAymeric Agon-Rambosson <aymeric.agon@yandex.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarStephen Boyd <swboyd@chromium.org>
      Cc: Jan Kiszka <jan.kiszka@siemens.com>
      Cc: Kieran Bingham <kbingham@kernel.org>
      Cc: Douglas Anderson <dianders@chromium.org>
      Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
      Cc: Jackie Liu <liuyun01@kylinos.cn>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Link: http://lkml.kernel.org/r/20200427051029.354840-1-aymeric.agon@yandex.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50e36be1
    • Khazhismel Kumykov's avatar
      eventpoll: fix missing wakeup for ovflist in ep_poll_callback · 0c54a6a4
      Khazhismel Kumykov authored
      In the event that we add to ovflist, before commit 339ddb53
      ("fs/epoll: remove unnecessary wakeups of nested epoll") we would be
      woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.
      
      With that wakeup removed, if we add to ovflist here, we may never wake
      up.  Rather than adding back the ep_scan_ready_list wakeup - which was
      resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback.
      
      We noticed that one of our workloads was missing wakeups starting with
      339ddb53 and upon manual inspection, this wakeup seemed missing to me.
      With this patch added, we no longer see missing wakeups.  I haven't yet
      tried to make a small reproducer, but the existing kselftests in
      filesystem/epoll passed for me with this patch.
      
      [khazhy@google.com: use if/elif instead of goto + cleanup suggested by Roman]
        Link: http://lkml.kernel.org/r/20200424190039.192373-1-khazhy@google.com
      Fixes: 339ddb53 ("fs/epoll: remove unnecessary wakeups of nested epoll")
      Signed-off-by: default avatarKhazhismel Kumykov <khazhy@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarRoman Penyaev <rpenyaev@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Roman Penyaev <rpenyaev@suse.de>
      Cc: Heiher <r@hev.cc>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200424025057.118641-1-khazhy@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0c54a6a4
    • Janakarajan Natarajan's avatar
      arch/x86/kvm/svm/sev.c: change flag passed to GUP fast in sev_pin_memory() · 996ed22c
      Janakarajan Natarajan authored
      When trying to lock read-only pages, sev_pin_memory() fails because
      FOLL_WRITE is used as the flag for get_user_pages_fast().
      
      Commit 73b0140b ("mm/gup: change GUP fast to use flags rather than a
      write 'bool'") updated the get_user_pages_fast() call sites to use
      flags, but incorrectly updated the call in sev_pin_memory().  As the
      original coding of this call was correct, revert the change made by that
      commit.
      
      Fixes: 73b0140b ("mm/gup: change GUP fast to use flags rather than a write 'bool'")
      Signed-off-by: default avatarJanakarajan Natarajan <Janakarajan.Natarajan@amd.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Link: http://lkml.kernel.org/r/20200423152419.87202-1-Janakarajan.Natarajan@amd.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      996ed22c
    • Ivan Delalande's avatar
      scripts/decodecode: fix trapping instruction formatting · e08df079
      Ivan Delalande authored
      If the trapping instruction contains a ':', for a memory access through
      segment registers for example, the sed substitution will insert the '*'
      marker in the middle of the instruction instead of the line address:
      
      	2b:   65 48 0f c7 0f          cmpxchg16b %gs:*(%rdi)          <-- trapping instruction
      
      I started to think I had forgotten some quirk of the assembly syntax
      before noticing that it was actually coming from the script.  Fix it to
      add the address marker at the right place for these instructions:
      
      	28:   49 8b 06                mov    (%r14),%rax
      	2b:*  65 48 0f c7 0f          cmpxchg16b %gs:(%rdi)           <-- trapping instruction
      	30:   0f 94 c0                sete   %al
      
      Fixes: 18ff44b1 ("scripts/decodecode: make faulting insn ptr more robust")
      Signed-off-by: default avatarIvan Delalande <colona@arista.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarBorislav Petkov <bp@suse.de>
      Link: http://lkml.kernel.org/r/20200419223653.GA31248@visorSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e08df079
    • Maciej Grochowski's avatar
    • David Hildenbrand's avatar
      mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous() · e84fe99b
      David Hildenbrand authored
      Without CONFIG_PREEMPT, it can happen that we get soft lockups detected,
      e.g., while booting up.
      
        watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4
        Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
        RIP: __pageblock_pfn_to_page+0x134/0x1c0
        Call Trace:
         set_zone_contiguous+0x56/0x70
         page_alloc_init_late+0x166/0x176
         kernel_init_freeable+0xfa/0x255
         kernel_init+0xa/0x106
         ret_from_fork+0x35/0x40
      
      The issue becomes visible when having a lot of memory (e.g., 4TB)
      assigned to a single NUMA node - a system that can easily be created
      using QEMU.  Inside VMs on a hypervisor with quite some memory
      overcommit, this is fairly easy to trigger.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200416073417.5003-1-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e84fe99b
    • Yafang Shao's avatar
      mm, memcg: fix error return value of mem_cgroup_css_alloc() · 11d67612
      Yafang Shao authored
      When I run my memcg testcase which creates lots of memcgs, I found
      there're unexpected out of memory logs while there're still enough
      available free memory.  The error log is
      
        mkdir: cannot create directory 'foo.65533': Cannot allocate memory
      
      The reason is when we try to create more than MEM_CGROUP_ID_MAX memcgs,
      an -ENOMEM errno will be set by mem_cgroup_css_alloc(), but the right
      errno should be -ENOSPC "No space left on device", which is an
      appropriate errno for userspace's failed mkdir.
      
      As the errno really misled me, we should make it right.  After this
      patch, the error log will be
      
        mkdir: cannot create directory 'foo.65533': No space left on device
      
      [akpm@linux-foundation.org: s/EBUSY/ENOSPC/, per Michal]
      [akpm@linux-foundation.org: s/EBUSY/ENOSPC/, per Michal]
      Fixes: 73f576c0 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Link: http://lkml.kernel.org/r/20200407063621.GA18914@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/1586192163-20099-1-git-send-email-laoar.shao@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11d67612
    • Oleg Nesterov's avatar
      ipc/mqueue.c: change __do_notify() to bypass check_kill_permission() · b5f20061
      Oleg Nesterov authored
      Commit cc731525 ("signal: Remove kernel interal si_code magic")
      changed the value of SI_FROMUSER(SI_MESGQ), this means that mq_notify() no
      longer works if the sender doesn't have rights to send a signal.
      
      Change __do_notify() to use do_send_sig_info() instead of kill_pid_info()
      to avoid check_kill_permission().
      
      This needs the additional notify.sigev_signo != 0 check, shouldn't we
      change do_mq_notify() to deny sigev_signo == 0 ?
      
      Test-case:
      
      	#include <signal.h>
      	#include <mqueue.h>
      	#include <unistd.h>
      	#include <sys/wait.h>
      	#include <assert.h>
      
      	static int notified;
      
      	static void sigh(int sig)
      	{
      		notified = 1;
      	}
      
      	int main(void)
      	{
      		signal(SIGIO, sigh);
      
      		int fd = mq_open("/mq", O_RDWR|O_CREAT, 0666, NULL);
      		assert(fd >= 0);
      
      		struct sigevent se = {
      			.sigev_notify	= SIGEV_SIGNAL,
      			.sigev_signo	= SIGIO,
      		};
      		assert(mq_notify(fd, &se) == 0);
      
      		if (!fork()) {
      			assert(setuid(1) == 0);
      			mq_send(fd, "",1,0);
      			return 0;
      		}
      
      		wait(NULL);
      		mq_unlink("/mq");
      		assert(notified);
      		return 0;
      	}
      
      [manfred@colorfullife.com: 1) Add self_exec_id evaluation so that the implementation matches do_notify_parent 2) use PIDTYPE_TGID everywhere]
      Fixes: cc731525 ("signal: Remove kernel interal si_code magic")
      Reported-by: default avatarYoji <yoji.fujihar.min@gmail.com>
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Markus Elfring <elfring@users.sourceforge.net>
      Cc: <1vier1@web.de>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/e2a782e4-eab9-4f5c-c749-c07a8f7a4e66@colorfullife.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5f20061
  2. 07 May, 2020 23 commits
    • Linus Torvalds's avatar
      Merge tag 'trace-v5.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · 192ffb75
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
      
       - Fix bootconfig causing kernels to fail with CONFIG_BLK_DEV_RAM
         enabled
      
       - Fix allocation leaks in bootconfig tool
      
       - Fix a double initialization of a variable
      
       - Fix API bootconfig usage from kprobe boot time events
      
       - Reject NULL location for kprobes
      
       - Fix crash caused by preempt delay module not cleaning up kthread
         correctly
      
       - Add vmalloc_sync_mappings() to prevent x86_64 page faults from
         recursively faulting from tracing page faults
      
       - Fix comment in gpu/trace kerneldoc header
      
       - Fix documentation of how to create a trace event class
      
       - Make the local tracing_snapshot_instance_cond() function static
      
      * tag 'trace-v5.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        tools/bootconfig: Fix resource leak in apply_xbc()
        tracing: Make tracing_snapshot_instance_cond() static
        tracing: Fix doc mistakes in trace sample
        gpu/trace: Minor comment updates for gpu_mem_total tracepoint
        tracing: Add a vmalloc_sync_mappings() for safe measure
        tracing: Wait for preempt irq delay thread to finish
        tracing/kprobes: Reject new event if loc is NULL
        tracing/boottime: Fix kprobe event API usage
        tracing/kprobes: Fix a double initialization typo
        bootconfig: Fix to remove bootconfig data from initrd while boot
      192ffb75
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-5.7-rc5' of... · 9ecc4d77
      Linus Torvalds authored
      Merge tag 'linux-kselftest-5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull kselftest fixes from Shuah Khan:
       "ftrace test fixes and a fix to kvm Makefile for relocatable
        native/cross builds and installs"
      
      * tag 'linux-kselftest-5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        selftests: fix kvm relocatable native/cross builds and installs
        selftests/ftrace: Make XFAIL green color
        ftrace/selftest: make unresolved cases cause failure if --fail-unresolved set
        ftrace/selftests: workaround cgroup RT scheduling issues
      9ecc4d77
    • Yunfeng Ye's avatar
      tools/bootconfig: Fix resource leak in apply_xbc() · 88426044
      Yunfeng Ye authored
      Fix the @data and @fd allocations that are leaked in the error path of
      apply_xbc().
      
      Link: http://lkml.kernel.org/r/583a49c9-c27a-931d-e6c2-6f63a4b18bea@huawei.comAcked-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarYunfeng Ye <yeyunfeng@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      88426044
    • Zou Wei's avatar
      tracing: Make tracing_snapshot_instance_cond() static · 192b7993
      Zou Wei authored
      Fix the following sparse warning:
      
      kernel/trace/trace.c:950:6: warning: symbol 'tracing_snapshot_instance_cond'
      was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/1587614905-48692-1-git-send-email-zou_wei@huawei.comReported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarZou Wei <zou_wei@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      192b7993
    • Wei Yang's avatar
      tracing: Fix doc mistakes in trace sample · f094a233
      Wei Yang authored
      As the example below shows, DECLARE_EVENT_CLASS() is used instead of
      DEFINE_EVENT_CLASS().
      
      Link: http://lkml.kernel.org/r/20200428214959.11259-1-richard.weiyang@gmail.comSigned-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      f094a233
    • Yiwei Zhang's avatar
      gpu/trace: Minor comment updates for gpu_mem_total tracepoint · 386c82a7
      Yiwei Zhang authored
      This change updates the improper comment for the 'size' attribute in the
      tracepoint definition. Most gfx drivers pre-fault in physical pages
      instead of making virtual allocations. So we drop the 'Virtual' keyword
      here and leave this to the implementations.
      
      Link: http://lkml.kernel.org/r/20200428220825.169606-1-zzyiwei@google.comSigned-off-by: default avatarYiwei Zhang <zzyiwei@google.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      386c82a7
    • Steven Rostedt (VMware)'s avatar
      tracing: Add a vmalloc_sync_mappings() for safe measure · 11f5efc3
      Steven Rostedt (VMware) authored
      x86_64 lazily maps in the vmalloc pages, and the way this works with per_cpu
      areas can be complex, to say the least. Mappings may happen at boot up, and
      if nothing synchronizes the page tables, those page mappings may not be
      synced till they are used. This causes issues for anything that might touch
      one of those mappings in the path of the page fault handler. When one of
      those unmapped mappings is touched in the page fault handler, it will cause
      another page fault, which in turn will cause a page fault, and leave us in
      a loop of page faults.
      
      Commit 763802b5 ("x86/mm: split vmalloc_sync_all()") split
      vmalloc_sync_all() into vmalloc_sync_unmappings() and
      vmalloc_sync_mappings(), as on system exit, it did not need to do a full
      sync on x86_64 (although it still needed to be done on x86_32). By chance,
      the vmalloc_sync_all() would synchronize the page mappings done at boot up
      and prevent the per cpu area from being a problem for tracing in the page
      fault handler. But when that synchronization in the exit of a task became a
      nop, it caused the problem to appear.
      
      Link: https://lore.kernel.org/r/20200429054857.66e8e333@oasis.local.home
      
      Cc: stable@vger.kernel.org
      Fixes: 737223fb ("tracing: Consolidate buffer allocation code")
      Reported-by: default avatar"Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com>
      Suggested-by: default avatarJoerg Roedel <jroedel@suse.de>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      11f5efc3
    • Steven Rostedt (VMware)'s avatar
      tracing: Wait for preempt irq delay thread to finish · d16a8c31
      Steven Rostedt (VMware) authored
      Running on a slower machine, it is possible that the preempt delay kernel
      thread may still be executing if the module was immediately removed after
      added, and this can cause the kernel to crash as the kernel thread might be
      executing after its code has been removed.
      
      There's no reason that the caller of the code shouldn't just wait for the
      delay thread to finish, as the thread can also be created by a trigger in
      the sysfs code, which also has the same issues.
      
      Link: http://lore.kernel.org/r/5EA2B0C8.2080706@cn.fujitsu.com
      
      Cc: stable@vger.kernel.org
      Fixes: 79393723 ("lib: Add module for testing preemptoff/irqsoff latency tracers")
      Reported-by: default avatarXiao Yang <yangx.jy@cn.fujitsu.com>
      Reviewed-by: default avatarXiao Yang <yangx.jy@cn.fujitsu.com>
      Reviewed-by: default avatarJoel Fernandes <joel@joelfernandes.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      d16a8c31
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 6e7f2eac
      Linus Torvalds authored
      Pull arm64 fix from Catalin Marinas:
       "Avoid potential NULL dereference in huge_pte_alloc() on pmd_alloc()
        failure"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: hugetlb: avoid potential NULL dereference
      6e7f2eac
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 8c16ec94
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
       "Bugfixes, mostly for ARM and AMD, and more documentation.
      
        Slightly bigger than usual because I couldn't send out what was
        pending for rc4, but there is nothing worrisome going on. I have more
        fixes pending for guest debugging support (gdbstub) but I will send
        them next week"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits)
        KVM: X86: Declare KVM_CAP_SET_GUEST_DEBUG properly
        KVM: selftests: Fix build for evmcs.h
        kvm: x86: Use KVM CPU capabilities to determine CR4 reserved bits
        KVM: VMX: Explicitly clear RFLAGS.CF and RFLAGS.ZF in VM-Exit RSB path
        docs/virt/kvm: Document configuring and running nested guests
        KVM: s390: Remove false WARN_ON_ONCE for the PQAP instruction
        kvm: ioapic: Restrict lazy EOI update to edge-triggered interrupts
        KVM: x86: Fixes posted interrupt check for IRQs delivery modes
        KVM: SVM: fill in kvm_run->debug.arch.dr[67]
        KVM: nVMX: Replace a BUG_ON(1) with BUG() to squash clang warning
        KVM: arm64: Fix 32bit PC wrap-around
        KVM: arm64: vgic-v4: Initialize GICv4.1 even in the absence of a virtual ITS
        KVM: arm64: Save/restore sp_el0 as part of __guest_enter
        KVM: arm64: Delete duplicated label in invalid_vector
        KVM: arm64: vgic-its: Fix memory leak on the error path of vgic_add_lpi()
        KVM: arm64: vgic-v3: Retire all pending LPIs on vcpu destroy
        KVM: arm: vgic-v2: Only use the virtual state when userspace accesses pending bits
        KVM: arm: vgic: Only use the virtual state when userspace accesses enable bits
        KVM: arm: vgic: Synchronize the whole guest on GIC{D,R}_I{S,C}ACTIVER read
        KVM: arm64: PSCI: Forbid 64bit functions for 32bit guests
        ...
      8c16ec94
    • Linus Torvalds's avatar
      Merge tag 'configfs-for-5.7' of git://git.infradead.org/users/hch/configfs · de268ccb
      Linus Torvalds authored
      Pull configfs fix from Christoph Hellwig:
       "Fix a refcount leak in configfs_rmdir (Xiyu Yang)"
      
      * tag 'configfs-for-5.7' of git://git.infradead.org/users/hch/configfs:
        configfs: fix config_item refcnt leak in configfs_rmdir()
      de268ccb
    • Mark Rutland's avatar
      arm64: hugetlb: avoid potential NULL dereference · 027d0c71
      Mark Rutland authored
      The static analyzer in GCC 10 spotted that in huge_pte_alloc() we may
      pass a NULL pmdp into pte_alloc_map() when pmd_alloc() returns NULL:
      
      |   CC      arch/arm64/mm/pageattr.o
      |   CC      arch/arm64/mm/hugetlbpage.o
      |                  from arch/arm64/mm/hugetlbpage.c:10:
      | arch/arm64/mm/hugetlbpage.c: In function ‘huge_pte_alloc’:
      | ./arch/arm64/include/asm/pgtable-types.h:28:24: warning: dereference of NULL ‘pmdp’ [CWE-690] [-Wanalyzer-null-dereference]
      | ./arch/arm64/include/asm/pgtable.h:436:26: note: in expansion of macro ‘pmd_val’
      | arch/arm64/mm/hugetlbpage.c:242:10: note: in expansion of macro ‘pte_alloc_map’
      |     |arch/arm64/mm/hugetlbpage.c:232:10:
      |     |./arch/arm64/include/asm/pgtable-types.h:28:24:
      | ./arch/arm64/include/asm/pgtable.h:436:26: note: in expansion of macro ‘pmd_val’
      | arch/arm64/mm/hugetlbpage.c:242:10: note: in expansion of macro ‘pte_alloc_map’
      
      This can only occur when the kernel cannot allocate a page, and so is
      unlikely to happen in practice before other systems start failing.
      
      We can avoid this by bailing out if pmd_alloc() fails, as we do earlier
      in the function if pud_alloc() fails.
      
      Fixes: 66b3923a ("arm64: hugetlb: add support for PTE contiguous bit")
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reported-by: default avatarKyrill Tkachov <kyrylo.tkachov@arm.com>
      Cc: <stable@vger.kernel.org> # 4.5.x-
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      027d0c71
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · a811c1fa
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix reference count leaks in various parts of batman-adv, from Xiyu
          Yang.
      
       2) Update NAT checksum even when it is zero, from Guillaume Nault.
      
       3) sk_psock reference count leak in tls code, also from Xiyu Yang.
      
       4) Sanity check TCA_FQ_CODEL_DROP_BATCH_SIZE netlink attribute in
          fq_codel, from Eric Dumazet.
      
       5) Fix panic in choke_reset(), also from Eric Dumazet.
      
       6) Fix VLAN accel handling in bnxt_fix_features(), from Michael Chan.
      
       7) Disallow out of range quantum values in sch_sfq, from Eric Dumazet.
      
       8) Fix crash in x25_disconnect(), from Yue Haibing.
      
       9) Don't pass pointer to local variable back to the caller in
          nf_osf_hdr_ctx_init(), from Arnd Bergmann.
      
      10) Wireguard should use the ECN decap helper functions, from Toke
          Høiland-Jørgensen.
      
      11) Fix command entry leak in mlx5 driver, from Moshe Shemesh.
      
      12) Fix uninitialized variable access in mptcp's
          subflow_syn_recv_sock(), from Paolo Abeni.
      
      13) Fix unnecessary out-of-order ingress frame ordering in macsec, from
          Scott Dial.
      
      14) IPv6 needs to use a global serial number for dst validation just
          like ipv4, from David Ahern.
      
      15) Fix up PTP_1588_CLOCK deps, from Clay McClure.
      
      16) Missing NLM_F_MULTI flag in gtp driver netlink messages, from
          Yoshiyuki Kurauchi.
      
      17) Fix a regression in that dsa user port errors should not be fatal,
          from Florian Fainelli.
      
      18) Fix iomap leak in enetc driver, from Dejin Zheng.
      
      19) Fix use after free in lec_arp_clear_vccs(), from Cong Wang.
      
      20) Initialize protocol value earlier in neigh code paths when
          generating events, from Roman Mashak.
      
      21) netdev_update_features() must be called with RTNL mutex in macsec
          driver, from Antoine Tenart.
      
      22) Validate untrusted GSO packets even more strictly, from Willem de
          Bruijn.
      
      23) Wireguard decrypt worker needs a cond_resched(), from Jason
          Donenfeld.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (111 commits)
        net: flow_offload: skip hw stats check for FLOW_ACTION_HW_STATS_DONT_CARE
        MAINTAINERS: put DYNAMIC INTERRUPT MODERATION in proper order
        wireguard: send/receive: use explicit unlikely branch instead of implicit coalescing
        wireguard: selftests: initalize ipv6 members to NULL to squelch clang warning
        wireguard: send/receive: cond_resched() when processing worker ringbuffers
        wireguard: socket: remove errant restriction on looping to self
        wireguard: selftests: use normal kernel stack size on ppc64
        net: ethernet: ti: am65-cpsw-nuss: fix irqs type
        ionic: Use debugfs_create_bool() to export bool
        net: dsa: Do not leave DSA master with NULL netdev_ops
        net: dsa: remove duplicate assignment in dsa_slave_add_cls_matchall_mirred
        net: stricter validation of untrusted gso packets
        seg6: fix SRH processing to comply with RFC8754
        net: mscc: ocelot: ANA_AUTOAGE_AGE_PERIOD holds a value in seconds, not ms
        net: dsa: ocelot: the MAC table on Felix is twice as large
        net: dsa: sja1105: the PTP_CLK extts input reacts on both edges
        selftests: net: tcp_mmap: fix SO_RCVLOWAT setting
        net: hsr: fix incorrect type usage for protocol variable
        net: macsec: fix rtnl locking issue
        net: mvpp2: cls: Prevent buffer overflow in mvpp2_ethtool_cls_rule_del()
        ...
      a811c1fa
    • Pablo Neira Ayuso's avatar
      net: flow_offload: skip hw stats check for FLOW_ACTION_HW_STATS_DONT_CARE · 16f80360
      Pablo Neira Ayuso authored
      This patch adds FLOW_ACTION_HW_STATS_DONT_CARE which tells the driver
      that the frontend does not need counters, this hw stats type request
      never fails. The FLOW_ACTION_HW_STATS_DISABLED type explicitly requests
      the driver to disable the stats, however, if the driver cannot disable
      counters, it bails out.
      
      TCA_ACT_HW_STATS_* maintains the 1:1 mapping with FLOW_ACTION_HW_STATS_*
      except by disabled which is mapped to FLOW_ACTION_HW_STATS_DISABLED
      (this is 0 in tc). Add tc_act_hw_stats() to perform the mapping between
      TCA_ACT_HW_STATS_* and FLOW_ACTION_HW_STATS_*.
      
      Fixes: 319a1d19 ("flow_offload: check for basic action hw stats type")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      16f80360
    • Lukas Bulwahn's avatar
      MAINTAINERS: put DYNAMIC INTERRUPT MODERATION in proper order · b0956956
      Lukas Bulwahn authored
      Commit 9b038086 ("docs: networking: convert DIM to RST") added a new
      file entry to DYNAMIC INTERRUPT MODERATION to the end, and not following
      alphabetical order.
      
      So, ./scripts/checkpatch.pl -f MAINTAINERS complains:
      
        WARNING: Misordered MAINTAINERS entry - list file patterns in alphabetic
        order
        #5966: FILE: MAINTAINERS:5966:
        +F:      lib/dim/
        +F:      Documentation/networking/net_dim.rst
      
      Reorder the file entries to keep MAINTAINERS nicely ordered.
      Signed-off-by: default avatarLukas Bulwahn <lukas.bulwahn@gmail.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0956956
    • David S. Miller's avatar
      Merge branch 'wireguard-fixes' · d3f3e6ac
      David S. Miller authored
      Jason A. Donenfeld says:
      
      ====================
      wireguard fixes for 5.7-rc5
      
      With Ubuntu and Debian having backported this into their kernels, we're
      finally seeing testing from places we hadn't seen prior, which is nice.
      With that comes more fixes:
      
      1) The CI for PPC64 was running with extremely small stacks for 64-bit,
         causing spurious crashes in surprising places.
      
      2) There's was an old leftover routing loop restriction, which no longer
         makes sense given the queueing architecture, and was causing problems
         for people who really did want nested routing.
      
      3) Not yielding our kthread on CONFIG_PREEMPT_VOLUNTARY systems caused
         RCU stalls and other issues, reported by Wang Jian, with the fix
         suggested by Sultan Alsawaf.
      
      4) Clang spewed warnings in a selftest for CONFIG_IPV6=n, reported by
         Arnd Bergmann.
      
      5) A complicated if statement was simplified to an assignment while also
         making the likely/unlikely hinting more correct and simple, and
         increasing readability, suggested by Sultan.
      
      Patches (2) and (3) have Fixes: lines and are probably good candidates
      for stable.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3f3e6ac
    • Jason A. Donenfeld's avatar
      wireguard: send/receive: use explicit unlikely branch instead of implicit coalescing · 243f2148
      Jason A. Donenfeld authored
      It's very unlikely that send will become true. It's nearly always false
      between 0 and 120 seconds of a session, and in most cases becomes true
      only between 120 and 121 seconds before becoming false again. So,
      unlikely(send) is clearly the right option here.
      
      What happened before was that we had this complex boolean expression
      with multiple likely and unlikely clauses nested. Since this is
      evaluated left-to-right anyway, the whole thing got converted to
      unlikely. So, we can clean this up to better represent what's going on.
      
      The generated code is the same.
      Suggested-by: default avatarSultan Alsawaf <sultan@kerneltoast.com>
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      243f2148
    • Jason A. Donenfeld's avatar
      wireguard: selftests: initalize ipv6 members to NULL to squelch clang warning · 4fed818e
      Jason A. Donenfeld authored
      Without setting these to NULL, clang complains in certain
      configurations that have CONFIG_IPV6=n:
      
      In file included from drivers/net/wireguard/ratelimiter.c:223:
      drivers/net/wireguard/selftest/ratelimiter.c:173:34: error: variable 'skb6' is uninitialized when used here [-Werror,-Wuninitialized]
                      ret = timings_test(skb4, hdr4, skb6, hdr6, &test_count);
                                                     ^~~~
      drivers/net/wireguard/selftest/ratelimiter.c:123:29: note: initialize the variable 'skb6' to silence this warning
              struct sk_buff *skb4, *skb6;
                                         ^
                                          = NULL
      drivers/net/wireguard/selftest/ratelimiter.c:173:40: error: variable 'hdr6' is uninitialized when used here [-Werror,-Wuninitialized]
                      ret = timings_test(skb4, hdr4, skb6, hdr6, &test_count);
                                                           ^~~~
      drivers/net/wireguard/selftest/ratelimiter.c:125:22: note: initialize the variable 'hdr6' to silence this warning
              struct ipv6hdr *hdr6;
                                  ^
      
      We silence this warning by setting the variables to NULL as the warning
      suggests.
      Reported-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4fed818e
    • Jason A. Donenfeld's avatar
      wireguard: send/receive: cond_resched() when processing worker ringbuffers · 4005f5c3
      Jason A. Donenfeld authored
      Users with pathological hardware reported CPU stalls on CONFIG_
      PREEMPT_VOLUNTARY=y, because the ringbuffers would stay full, meaning
      these workers would never terminate. That turned out not to be okay on
      systems without forced preemption, which Sultan observed. This commit
      adds a cond_resched() to the bottom of each loop iteration, so that
      these workers don't hog the core. Note that we don't need this on the
      napi poll worker, since that terminates after its budget is expended.
      Suggested-by: default avatarSultan Alsawaf <sultan@kerneltoast.com>
      Reported-by: default avatarWang Jian <larkwang@gmail.com>
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4005f5c3
    • Jason A. Donenfeld's avatar
      wireguard: socket: remove errant restriction on looping to self · b673e24a
      Jason A. Donenfeld authored
      It's already possible to create two different interfaces and loop
      packets between them. This has always been possible with tunnels in the
      kernel, and isn't specific to wireguard. Therefore, the networking stack
      already needs to deal with that. At the very least, the packet winds up
      exceeding the MTU and is discarded at that point. So, since this is
      already something that happens, there's no need to forbid the not very
      exceptional case of routing a packet back to the same interface; this
      loop is no different than others, and we shouldn't special case it, but
      rather rely on generic handling of loops in general. This also makes it
      easier to do interesting things with wireguard such as onion routing.
      
      At the same time, we add a selftest for this, ensuring that both onion
      routing works and infinite routing loops do not crash the kernel. We
      also add a test case for wireguard interfaces nesting packets and
      sending traffic between each other, as well as the loop in this case
      too. We make sure to send some throughput-heavy traffic for this use
      case, to stress out any possible recursion issues with the locks around
      workqueues.
      
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b673e24a
    • Jason A. Donenfeld's avatar
      wireguard: selftests: use normal kernel stack size on ppc64 · a0fd7cc8
      Jason A. Donenfeld authored
      While at some point it might have made sense to be running these tests
      on ppc64 with 4k stacks, the kernel hasn't actually used 4k stacks on
      64-bit powerpc in a long time, and more interesting things that we test
      don't really work when we deviate from the default (16k). So, we stop
      pushing our luck in this commit, and return to the default instead of
      the minimum.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a0fd7cc8
    • Grygorii Strashko's avatar
      net: ethernet: ti: am65-cpsw-nuss: fix irqs type · 6f5c27f9
      Grygorii Strashko authored
      The K3 INTA driver, which is source TX/RX IRQs for CPSW NUSS, defines IRQs
      triggering type as EDGE by default, but triggering type for CPSW NUSS TX/RX
      IRQs has to be LEVEL as the EDGE triggering type may cause unnecessary IRQs
      triggering and NAPI scheduling for empty queues. It was discovered with
      RT-kernel.
      
      Fix it by explicitly specifying CPSW NUSS TX/RX IRQ type as
      IRQF_TRIGGER_HIGH.
      
      Fixes: 93a76530 ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver")
      Signed-off-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6f5c27f9
    • Geert Uytterhoeven's avatar
      ionic: Use debugfs_create_bool() to export bool · 0735ccc9
      Geert Uytterhoeven authored
      Currently bool ionic_cq.done_color is exported using
      debugfs_create_u8(), which requires a cast, preventing further compiler
      checks.
      
      Fix this by switching to debugfs_create_bool(), and dropping the cast.
      Signed-off-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Acked-by: default avatarShannon Nelson <snelson@pensando.io>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0735ccc9