1. 08 May, 2020 10 commits
    • Filipe Manana's avatar
      percpu: make pcpu_alloc() aware of current gfp context · 28307d93
      Filipe Manana authored
      Since 5.7-rc1, on btrfs we have a percpu counter initialization for
      which we always pass a GFP_KERNEL gfp_t argument (this happens since
      commit 2992df73 ("btrfs: Implement DREW lock")).
      
      That is safe in some contextes but not on others where allowing fs
      reclaim could lead to a deadlock because we are either holding some
      btrfs lock needed for a transaction commit or holding a btrfs
      transaction handle open.  Because of that we surround the call to the
      function that initializes the percpu counter with a NOFS context using
      memalloc_nofs_save() (this is done at btrfs_init_fs_root()).
      
      However it turns out that this is not enough to prevent a possible
      deadlock because percpu_alloc() determines if it is in an atomic context
      by looking exclusively at the gfp flags passed to it (GFP_KERNEL in this
      case) and it is not aware that a NOFS context is set.
      
      Because percpu_alloc() thinks it is in a non atomic context it locks the
      pcpu_alloc_mutex.  This can result in a btrfs deadlock when
      pcpu_balance_workfn() is running, has acquired that mutex and is waiting
      for reclaim, while the btrfs task that called percpu_counter_init() (and
      therefore percpu_alloc()) is holding either the btrfs commit_root
      semaphore or a transaction handle (done fs/btrfs/backref.c:
      iterate_extent_inodes()), which prevents reclaim from finishing as an
      attempt to commit the current btrfs transaction will deadlock.
      
      Lockdep reports this issue with the following trace:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.6.0-rc7-btrfs-next-77 #1 Not tainted
        ------------------------------------------------------
        kswapd0/91 is trying to acquire lock:
        ffff8938a3b3fdc8 (&delayed_node->mutex){+.+.}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
      
        but task is already holding lock:
        ffffffffb4f0dbc0 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #4 (fs_reclaim){+.+.}:
               fs_reclaim_acquire.part.0+0x25/0x30
               __kmalloc+0x5f/0x3a0
               pcpu_create_chunk+0x19/0x230
               pcpu_balance_workfn+0x56a/0x680
               process_one_work+0x235/0x5f0
               worker_thread+0x50/0x3b0
               kthread+0x120/0x140
               ret_from_fork+0x3a/0x50
      
        -> #3 (pcpu_alloc_mutex){+.+.}:
               __mutex_lock+0xa9/0xaf0
               pcpu_alloc+0x480/0x7c0
               __percpu_counter_init+0x50/0xd0
               btrfs_drew_lock_init+0x22/0x70 [btrfs]
               btrfs_get_fs_root+0x29c/0x5c0 [btrfs]
               resolve_indirect_refs+0x120/0xa30 [btrfs]
               find_parent_nodes+0x50b/0xf30 [btrfs]
               btrfs_find_all_leafs+0x60/0xb0 [btrfs]
               iterate_extent_inodes+0x139/0x2f0 [btrfs]
               iterate_inodes_from_logical+0xa1/0xe0 [btrfs]
               btrfs_ioctl_logical_to_ino+0xb4/0x190 [btrfs]
               btrfs_ioctl+0x165a/0x3130 [btrfs]
               ksys_ioctl+0x87/0xc0
               __x64_sys_ioctl+0x16/0x20
               do_syscall_64+0x5c/0x260
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #2 (&fs_info->commit_root_sem){++++}:
               down_write+0x38/0x70
               btrfs_cache_block_group+0x2ec/0x500 [btrfs]
               find_free_extent+0xc6a/0x1600 [btrfs]
               btrfs_reserve_extent+0x9b/0x180 [btrfs]
               btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
               alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
               __btrfs_cow_block+0x122/0x5a0 [btrfs]
               btrfs_cow_block+0x106/0x240 [btrfs]
               commit_cowonly_roots+0x55/0x310 [btrfs]
               btrfs_commit_transaction+0x509/0xb20 [btrfs]
               sync_filesystem+0x74/0x90
               generic_shutdown_super+0x22/0x100
               kill_anon_super+0x14/0x30
               btrfs_kill_super+0x12/0x20 [btrfs]
               deactivate_locked_super+0x31/0x70
               cleanup_mnt+0x100/0x160
               task_work_run+0x93/0xc0
               exit_to_usermode_loop+0xf9/0x100
               do_syscall_64+0x20d/0x260
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #1 (&space_info->groups_sem){++++}:
               down_read+0x3c/0x140
               find_free_extent+0xef6/0x1600 [btrfs]
               btrfs_reserve_extent+0x9b/0x180 [btrfs]
               btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
               alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
               __btrfs_cow_block+0x122/0x5a0 [btrfs]
               btrfs_cow_block+0x106/0x240 [btrfs]
               btrfs_search_slot+0x50c/0xd60 [btrfs]
               btrfs_lookup_inode+0x3a/0xc0 [btrfs]
               __btrfs_update_delayed_inode+0x90/0x280 [btrfs]
               __btrfs_commit_inode_delayed_items+0x81f/0x870 [btrfs]
               __btrfs_run_delayed_items+0x8e/0x180 [btrfs]
               btrfs_commit_transaction+0x31b/0xb20 [btrfs]
               iterate_supers+0x87/0xf0
               ksys_sync+0x60/0xb0
               __ia32_sys_sync+0xa/0x10
               do_syscall_64+0x5c/0x260
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #0 (&delayed_node->mutex){+.+.}:
               __lock_acquire+0xef0/0x1c80
               lock_acquire+0xa2/0x1d0
               __mutex_lock+0xa9/0xaf0
               __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
               btrfs_evict_inode+0x40d/0x560 [btrfs]
               evict+0xd9/0x1c0
               dispose_list+0x48/0x70
               prune_icache_sb+0x54/0x80
               super_cache_scan+0x124/0x1a0
               do_shrink_slab+0x176/0x440
               shrink_slab+0x23a/0x2c0
               shrink_node+0x188/0x6e0
               balance_pgdat+0x31d/0x7f0
               kswapd+0x238/0x550
               kthread+0x120/0x140
               ret_from_fork+0x3a/0x50
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> pcpu_alloc_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(fs_reclaim);
                                       lock(pcpu_alloc_mutex);
                                       lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/91:
         #0: (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
         #1: (shrinker_rwsem){++++}, at: shrink_slab+0x12f/0x2c0
         #2: (&type->s_umount_key#43){++++}, at: trylock_super+0x16/0x50
      
        stack backtrace:
        CPU: 1 PID: 91 Comm: kswapd0 Not tainted 5.6.0-rc7-btrfs-next-77 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8f/0xd0
         check_noncircular+0x170/0x190
         __lock_acquire+0xef0/0x1c80
         lock_acquire+0xa2/0x1d0
         __mutex_lock+0xa9/0xaf0
         __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         btrfs_evict_inode+0x40d/0x560 [btrfs]
         evict+0xd9/0x1c0
         dispose_list+0x48/0x70
         prune_icache_sb+0x54/0x80
         super_cache_scan+0x124/0x1a0
         do_shrink_slab+0x176/0x440
         shrink_slab+0x23a/0x2c0
         shrink_node+0x188/0x6e0
         balance_pgdat+0x31d/0x7f0
         kswapd+0x238/0x550
         kthread+0x120/0x140
         ret_from_fork+0x3a/0x50
      
      This could be fixed by making btrfs pass GFP_NOFS instead of GFP_KERNEL
      to percpu_counter_init() in contextes where it is not reclaim safe,
      however that type of approach is discouraged since
      memalloc_[nofs|noio]_save() were introduced.  Therefore this change
      makes pcpu_alloc() look up into an existing nofs/noio context before
      deciding whether it is in an atomic context or not.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Link: http://lkml.kernel.org/r/20200430164356.15543-1-fdmanana@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28307d93
    • Waiman Long's avatar
      mm/slub: fix incorrect interpretation of s->offset · cbfc35a4
      Waiman Long authored
      In a couple of places in the slub memory allocator, the code uses
      "s->offset" as a check to see if the free pointer is put right after the
      object.  That check is no longer true with commit 3202fa62 ("slub:
      relocate freelist pointer to middle of object").
      
      As a result, echoing "1" into the validate sysfs file, e.g.  of dentry,
      may cause a bunch of "Freepointer corrupt" error reports like the
      following to appear with the system in panic afterwards.
      
        =============================================================================
        BUG dentry(666:pmcd.service) (Tainted: G    B): Freepointer corrupt
        -----------------------------------------------------------------------------
      
      To fix it, use the check "s->offset == s->inuse" in the new helper
      function freeptr_outside_object() instead.  Also add another helper
      function get_info_end() to return the end of info block (inuse + free
      pointer if not overlapping with object).
      
      Fixes: 3202fa62 ("slub: relocate freelist pointer to middle of object")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Vitaly Nikolenko <vnik@duasynt.com>
      Cc: Silvio Cesare <silvio.cesare@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Markus Elfring <Markus.Elfring@web.de>
      Cc: Changbin Du <changbin.du@gmail.com>
      Link: http://lkml.kernel.org/r/20200429135328.26976-1-longman@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cbfc35a4
    • Aymeric Agon-Rambosson's avatar
      scripts/gdb: repair rb_first() and rb_last() · 50e36be1
      Aymeric Agon-Rambosson authored
      The current implementations of the rb_first() and rb_last() gdb
      functions have a variable that references itself in its instanciation,
      which causes the function to throw an error if a specific condition on
      the argument is met.  The original author rather intended to reference
      the argument and made a typo.  Referring the argument instead makes the
      function work as intended.
      Signed-off-by: default avatarAymeric Agon-Rambosson <aymeric.agon@yandex.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarStephen Boyd <swboyd@chromium.org>
      Cc: Jan Kiszka <jan.kiszka@siemens.com>
      Cc: Kieran Bingham <kbingham@kernel.org>
      Cc: Douglas Anderson <dianders@chromium.org>
      Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
      Cc: Jackie Liu <liuyun01@kylinos.cn>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Link: http://lkml.kernel.org/r/20200427051029.354840-1-aymeric.agon@yandex.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50e36be1
    • Khazhismel Kumykov's avatar
      eventpoll: fix missing wakeup for ovflist in ep_poll_callback · 0c54a6a4
      Khazhismel Kumykov authored
      In the event that we add to ovflist, before commit 339ddb53
      ("fs/epoll: remove unnecessary wakeups of nested epoll") we would be
      woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.
      
      With that wakeup removed, if we add to ovflist here, we may never wake
      up.  Rather than adding back the ep_scan_ready_list wakeup - which was
      resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback.
      
      We noticed that one of our workloads was missing wakeups starting with
      339ddb53 and upon manual inspection, this wakeup seemed missing to me.
      With this patch added, we no longer see missing wakeups.  I haven't yet
      tried to make a small reproducer, but the existing kselftests in
      filesystem/epoll passed for me with this patch.
      
      [khazhy@google.com: use if/elif instead of goto + cleanup suggested by Roman]
        Link: http://lkml.kernel.org/r/20200424190039.192373-1-khazhy@google.com
      Fixes: 339ddb53 ("fs/epoll: remove unnecessary wakeups of nested epoll")
      Signed-off-by: default avatarKhazhismel Kumykov <khazhy@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarRoman Penyaev <rpenyaev@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Roman Penyaev <rpenyaev@suse.de>
      Cc: Heiher <r@hev.cc>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200424025057.118641-1-khazhy@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0c54a6a4
    • Janakarajan Natarajan's avatar
      arch/x86/kvm/svm/sev.c: change flag passed to GUP fast in sev_pin_memory() · 996ed22c
      Janakarajan Natarajan authored
      When trying to lock read-only pages, sev_pin_memory() fails because
      FOLL_WRITE is used as the flag for get_user_pages_fast().
      
      Commit 73b0140b ("mm/gup: change GUP fast to use flags rather than a
      write 'bool'") updated the get_user_pages_fast() call sites to use
      flags, but incorrectly updated the call in sev_pin_memory().  As the
      original coding of this call was correct, revert the change made by that
      commit.
      
      Fixes: 73b0140b ("mm/gup: change GUP fast to use flags rather than a write 'bool'")
      Signed-off-by: default avatarJanakarajan Natarajan <Janakarajan.Natarajan@amd.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Link: http://lkml.kernel.org/r/20200423152419.87202-1-Janakarajan.Natarajan@amd.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      996ed22c
    • Ivan Delalande's avatar
      scripts/decodecode: fix trapping instruction formatting · e08df079
      Ivan Delalande authored
      If the trapping instruction contains a ':', for a memory access through
      segment registers for example, the sed substitution will insert the '*'
      marker in the middle of the instruction instead of the line address:
      
      	2b:   65 48 0f c7 0f          cmpxchg16b %gs:*(%rdi)          <-- trapping instruction
      
      I started to think I had forgotten some quirk of the assembly syntax
      before noticing that it was actually coming from the script.  Fix it to
      add the address marker at the right place for these instructions:
      
      	28:   49 8b 06                mov    (%r14),%rax
      	2b:*  65 48 0f c7 0f          cmpxchg16b %gs:(%rdi)           <-- trapping instruction
      	30:   0f 94 c0                sete   %al
      
      Fixes: 18ff44b1 ("scripts/decodecode: make faulting insn ptr more robust")
      Signed-off-by: default avatarIvan Delalande <colona@arista.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarBorislav Petkov <bp@suse.de>
      Link: http://lkml.kernel.org/r/20200419223653.GA31248@visorSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e08df079
    • Maciej Grochowski's avatar
    • David Hildenbrand's avatar
      mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous() · e84fe99b
      David Hildenbrand authored
      Without CONFIG_PREEMPT, it can happen that we get soft lockups detected,
      e.g., while booting up.
      
        watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4
        Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
        RIP: __pageblock_pfn_to_page+0x134/0x1c0
        Call Trace:
         set_zone_contiguous+0x56/0x70
         page_alloc_init_late+0x166/0x176
         kernel_init_freeable+0xfa/0x255
         kernel_init+0xa/0x106
         ret_from_fork+0x35/0x40
      
      The issue becomes visible when having a lot of memory (e.g., 4TB)
      assigned to a single NUMA node - a system that can easily be created
      using QEMU.  Inside VMs on a hypervisor with quite some memory
      overcommit, this is fairly easy to trigger.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200416073417.5003-1-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e84fe99b
    • Yafang Shao's avatar
      mm, memcg: fix error return value of mem_cgroup_css_alloc() · 11d67612
      Yafang Shao authored
      When I run my memcg testcase which creates lots of memcgs, I found
      there're unexpected out of memory logs while there're still enough
      available free memory.  The error log is
      
        mkdir: cannot create directory 'foo.65533': Cannot allocate memory
      
      The reason is when we try to create more than MEM_CGROUP_ID_MAX memcgs,
      an -ENOMEM errno will be set by mem_cgroup_css_alloc(), but the right
      errno should be -ENOSPC "No space left on device", which is an
      appropriate errno for userspace's failed mkdir.
      
      As the errno really misled me, we should make it right.  After this
      patch, the error log will be
      
        mkdir: cannot create directory 'foo.65533': No space left on device
      
      [akpm@linux-foundation.org: s/EBUSY/ENOSPC/, per Michal]
      [akpm@linux-foundation.org: s/EBUSY/ENOSPC/, per Michal]
      Fixes: 73f576c0 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Link: http://lkml.kernel.org/r/20200407063621.GA18914@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/1586192163-20099-1-git-send-email-laoar.shao@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11d67612
    • Oleg Nesterov's avatar
      ipc/mqueue.c: change __do_notify() to bypass check_kill_permission() · b5f20061
      Oleg Nesterov authored
      Commit cc731525 ("signal: Remove kernel interal si_code magic")
      changed the value of SI_FROMUSER(SI_MESGQ), this means that mq_notify() no
      longer works if the sender doesn't have rights to send a signal.
      
      Change __do_notify() to use do_send_sig_info() instead of kill_pid_info()
      to avoid check_kill_permission().
      
      This needs the additional notify.sigev_signo != 0 check, shouldn't we
      change do_mq_notify() to deny sigev_signo == 0 ?
      
      Test-case:
      
      	#include <signal.h>
      	#include <mqueue.h>
      	#include <unistd.h>
      	#include <sys/wait.h>
      	#include <assert.h>
      
      	static int notified;
      
      	static void sigh(int sig)
      	{
      		notified = 1;
      	}
      
      	int main(void)
      	{
      		signal(SIGIO, sigh);
      
      		int fd = mq_open("/mq", O_RDWR|O_CREAT, 0666, NULL);
      		assert(fd >= 0);
      
      		struct sigevent se = {
      			.sigev_notify	= SIGEV_SIGNAL,
      			.sigev_signo	= SIGIO,
      		};
      		assert(mq_notify(fd, &se) == 0);
      
      		if (!fork()) {
      			assert(setuid(1) == 0);
      			mq_send(fd, "",1,0);
      			return 0;
      		}
      
      		wait(NULL);
      		mq_unlink("/mq");
      		assert(notified);
      		return 0;
      	}
      
      [manfred@colorfullife.com: 1) Add self_exec_id evaluation so that the implementation matches do_notify_parent 2) use PIDTYPE_TGID everywhere]
      Fixes: cc731525 ("signal: Remove kernel interal si_code magic")
      Reported-by: default avatarYoji <yoji.fujihar.min@gmail.com>
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Markus Elfring <elfring@users.sourceforge.net>
      Cc: <1vier1@web.de>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/e2a782e4-eab9-4f5c-c749-c07a8f7a4e66@colorfullife.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5f20061
  2. 07 May, 2020 18 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · a811c1fa
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix reference count leaks in various parts of batman-adv, from Xiyu
          Yang.
      
       2) Update NAT checksum even when it is zero, from Guillaume Nault.
      
       3) sk_psock reference count leak in tls code, also from Xiyu Yang.
      
       4) Sanity check TCA_FQ_CODEL_DROP_BATCH_SIZE netlink attribute in
          fq_codel, from Eric Dumazet.
      
       5) Fix panic in choke_reset(), also from Eric Dumazet.
      
       6) Fix VLAN accel handling in bnxt_fix_features(), from Michael Chan.
      
       7) Disallow out of range quantum values in sch_sfq, from Eric Dumazet.
      
       8) Fix crash in x25_disconnect(), from Yue Haibing.
      
       9) Don't pass pointer to local variable back to the caller in
          nf_osf_hdr_ctx_init(), from Arnd Bergmann.
      
      10) Wireguard should use the ECN decap helper functions, from Toke
          Høiland-Jørgensen.
      
      11) Fix command entry leak in mlx5 driver, from Moshe Shemesh.
      
      12) Fix uninitialized variable access in mptcp's
          subflow_syn_recv_sock(), from Paolo Abeni.
      
      13) Fix unnecessary out-of-order ingress frame ordering in macsec, from
          Scott Dial.
      
      14) IPv6 needs to use a global serial number for dst validation just
          like ipv4, from David Ahern.
      
      15) Fix up PTP_1588_CLOCK deps, from Clay McClure.
      
      16) Missing NLM_F_MULTI flag in gtp driver netlink messages, from
          Yoshiyuki Kurauchi.
      
      17) Fix a regression in that dsa user port errors should not be fatal,
          from Florian Fainelli.
      
      18) Fix iomap leak in enetc driver, from Dejin Zheng.
      
      19) Fix use after free in lec_arp_clear_vccs(), from Cong Wang.
      
      20) Initialize protocol value earlier in neigh code paths when
          generating events, from Roman Mashak.
      
      21) netdev_update_features() must be called with RTNL mutex in macsec
          driver, from Antoine Tenart.
      
      22) Validate untrusted GSO packets even more strictly, from Willem de
          Bruijn.
      
      23) Wireguard decrypt worker needs a cond_resched(), from Jason
          Donenfeld.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (111 commits)
        net: flow_offload: skip hw stats check for FLOW_ACTION_HW_STATS_DONT_CARE
        MAINTAINERS: put DYNAMIC INTERRUPT MODERATION in proper order
        wireguard: send/receive: use explicit unlikely branch instead of implicit coalescing
        wireguard: selftests: initalize ipv6 members to NULL to squelch clang warning
        wireguard: send/receive: cond_resched() when processing worker ringbuffers
        wireguard: socket: remove errant restriction on looping to self
        wireguard: selftests: use normal kernel stack size on ppc64
        net: ethernet: ti: am65-cpsw-nuss: fix irqs type
        ionic: Use debugfs_create_bool() to export bool
        net: dsa: Do not leave DSA master with NULL netdev_ops
        net: dsa: remove duplicate assignment in dsa_slave_add_cls_matchall_mirred
        net: stricter validation of untrusted gso packets
        seg6: fix SRH processing to comply with RFC8754
        net: mscc: ocelot: ANA_AUTOAGE_AGE_PERIOD holds a value in seconds, not ms
        net: dsa: ocelot: the MAC table on Felix is twice as large
        net: dsa: sja1105: the PTP_CLK extts input reacts on both edges
        selftests: net: tcp_mmap: fix SO_RCVLOWAT setting
        net: hsr: fix incorrect type usage for protocol variable
        net: macsec: fix rtnl locking issue
        net: mvpp2: cls: Prevent buffer overflow in mvpp2_ethtool_cls_rule_del()
        ...
      a811c1fa
    • Pablo Neira Ayuso's avatar
      net: flow_offload: skip hw stats check for FLOW_ACTION_HW_STATS_DONT_CARE · 16f80360
      Pablo Neira Ayuso authored
      This patch adds FLOW_ACTION_HW_STATS_DONT_CARE which tells the driver
      that the frontend does not need counters, this hw stats type request
      never fails. The FLOW_ACTION_HW_STATS_DISABLED type explicitly requests
      the driver to disable the stats, however, if the driver cannot disable
      counters, it bails out.
      
      TCA_ACT_HW_STATS_* maintains the 1:1 mapping with FLOW_ACTION_HW_STATS_*
      except by disabled which is mapped to FLOW_ACTION_HW_STATS_DISABLED
      (this is 0 in tc). Add tc_act_hw_stats() to perform the mapping between
      TCA_ACT_HW_STATS_* and FLOW_ACTION_HW_STATS_*.
      
      Fixes: 319a1d19 ("flow_offload: check for basic action hw stats type")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      16f80360
    • Lukas Bulwahn's avatar
      MAINTAINERS: put DYNAMIC INTERRUPT MODERATION in proper order · b0956956
      Lukas Bulwahn authored
      Commit 9b038086 ("docs: networking: convert DIM to RST") added a new
      file entry to DYNAMIC INTERRUPT MODERATION to the end, and not following
      alphabetical order.
      
      So, ./scripts/checkpatch.pl -f MAINTAINERS complains:
      
        WARNING: Misordered MAINTAINERS entry - list file patterns in alphabetic
        order
        #5966: FILE: MAINTAINERS:5966:
        +F:      lib/dim/
        +F:      Documentation/networking/net_dim.rst
      
      Reorder the file entries to keep MAINTAINERS nicely ordered.
      Signed-off-by: default avatarLukas Bulwahn <lukas.bulwahn@gmail.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0956956
    • David S. Miller's avatar
      Merge branch 'wireguard-fixes' · d3f3e6ac
      David S. Miller authored
      Jason A. Donenfeld says:
      
      ====================
      wireguard fixes for 5.7-rc5
      
      With Ubuntu and Debian having backported this into their kernels, we're
      finally seeing testing from places we hadn't seen prior, which is nice.
      With that comes more fixes:
      
      1) The CI for PPC64 was running with extremely small stacks for 64-bit,
         causing spurious crashes in surprising places.
      
      2) There's was an old leftover routing loop restriction, which no longer
         makes sense given the queueing architecture, and was causing problems
         for people who really did want nested routing.
      
      3) Not yielding our kthread on CONFIG_PREEMPT_VOLUNTARY systems caused
         RCU stalls and other issues, reported by Wang Jian, with the fix
         suggested by Sultan Alsawaf.
      
      4) Clang spewed warnings in a selftest for CONFIG_IPV6=n, reported by
         Arnd Bergmann.
      
      5) A complicated if statement was simplified to an assignment while also
         making the likely/unlikely hinting more correct and simple, and
         increasing readability, suggested by Sultan.
      
      Patches (2) and (3) have Fixes: lines and are probably good candidates
      for stable.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3f3e6ac
    • Jason A. Donenfeld's avatar
      wireguard: send/receive: use explicit unlikely branch instead of implicit coalescing · 243f2148
      Jason A. Donenfeld authored
      It's very unlikely that send will become true. It's nearly always false
      between 0 and 120 seconds of a session, and in most cases becomes true
      only between 120 and 121 seconds before becoming false again. So,
      unlikely(send) is clearly the right option here.
      
      What happened before was that we had this complex boolean expression
      with multiple likely and unlikely clauses nested. Since this is
      evaluated left-to-right anyway, the whole thing got converted to
      unlikely. So, we can clean this up to better represent what's going on.
      
      The generated code is the same.
      Suggested-by: default avatarSultan Alsawaf <sultan@kerneltoast.com>
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      243f2148
    • Jason A. Donenfeld's avatar
      wireguard: selftests: initalize ipv6 members to NULL to squelch clang warning · 4fed818e
      Jason A. Donenfeld authored
      Without setting these to NULL, clang complains in certain
      configurations that have CONFIG_IPV6=n:
      
      In file included from drivers/net/wireguard/ratelimiter.c:223:
      drivers/net/wireguard/selftest/ratelimiter.c:173:34: error: variable 'skb6' is uninitialized when used here [-Werror,-Wuninitialized]
                      ret = timings_test(skb4, hdr4, skb6, hdr6, &test_count);
                                                     ^~~~
      drivers/net/wireguard/selftest/ratelimiter.c:123:29: note: initialize the variable 'skb6' to silence this warning
              struct sk_buff *skb4, *skb6;
                                         ^
                                          = NULL
      drivers/net/wireguard/selftest/ratelimiter.c:173:40: error: variable 'hdr6' is uninitialized when used here [-Werror,-Wuninitialized]
                      ret = timings_test(skb4, hdr4, skb6, hdr6, &test_count);
                                                           ^~~~
      drivers/net/wireguard/selftest/ratelimiter.c:125:22: note: initialize the variable 'hdr6' to silence this warning
              struct ipv6hdr *hdr6;
                                  ^
      
      We silence this warning by setting the variables to NULL as the warning
      suggests.
      Reported-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4fed818e
    • Jason A. Donenfeld's avatar
      wireguard: send/receive: cond_resched() when processing worker ringbuffers · 4005f5c3
      Jason A. Donenfeld authored
      Users with pathological hardware reported CPU stalls on CONFIG_
      PREEMPT_VOLUNTARY=y, because the ringbuffers would stay full, meaning
      these workers would never terminate. That turned out not to be okay on
      systems without forced preemption, which Sultan observed. This commit
      adds a cond_resched() to the bottom of each loop iteration, so that
      these workers don't hog the core. Note that we don't need this on the
      napi poll worker, since that terminates after its budget is expended.
      Suggested-by: default avatarSultan Alsawaf <sultan@kerneltoast.com>
      Reported-by: default avatarWang Jian <larkwang@gmail.com>
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4005f5c3
    • Jason A. Donenfeld's avatar
      wireguard: socket: remove errant restriction on looping to self · b673e24a
      Jason A. Donenfeld authored
      It's already possible to create two different interfaces and loop
      packets between them. This has always been possible with tunnels in the
      kernel, and isn't specific to wireguard. Therefore, the networking stack
      already needs to deal with that. At the very least, the packet winds up
      exceeding the MTU and is discarded at that point. So, since this is
      already something that happens, there's no need to forbid the not very
      exceptional case of routing a packet back to the same interface; this
      loop is no different than others, and we shouldn't special case it, but
      rather rely on generic handling of loops in general. This also makes it
      easier to do interesting things with wireguard such as onion routing.
      
      At the same time, we add a selftest for this, ensuring that both onion
      routing works and infinite routing loops do not crash the kernel. We
      also add a test case for wireguard interfaces nesting packets and
      sending traffic between each other, as well as the loop in this case
      too. We make sure to send some throughput-heavy traffic for this use
      case, to stress out any possible recursion issues with the locks around
      workqueues.
      
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b673e24a
    • Jason A. Donenfeld's avatar
      wireguard: selftests: use normal kernel stack size on ppc64 · a0fd7cc8
      Jason A. Donenfeld authored
      While at some point it might have made sense to be running these tests
      on ppc64 with 4k stacks, the kernel hasn't actually used 4k stacks on
      64-bit powerpc in a long time, and more interesting things that we test
      don't really work when we deviate from the default (16k). So, we stop
      pushing our luck in this commit, and return to the default instead of
      the minimum.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a0fd7cc8
    • Grygorii Strashko's avatar
      net: ethernet: ti: am65-cpsw-nuss: fix irqs type · 6f5c27f9
      Grygorii Strashko authored
      The K3 INTA driver, which is source TX/RX IRQs for CPSW NUSS, defines IRQs
      triggering type as EDGE by default, but triggering type for CPSW NUSS TX/RX
      IRQs has to be LEVEL as the EDGE triggering type may cause unnecessary IRQs
      triggering and NAPI scheduling for empty queues. It was discovered with
      RT-kernel.
      
      Fix it by explicitly specifying CPSW NUSS TX/RX IRQ type as
      IRQF_TRIGGER_HIGH.
      
      Fixes: 93a76530 ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver")
      Signed-off-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6f5c27f9
    • Geert Uytterhoeven's avatar
      ionic: Use debugfs_create_bool() to export bool · 0735ccc9
      Geert Uytterhoeven authored
      Currently bool ionic_cq.done_color is exported using
      debugfs_create_u8(), which requires a cast, preventing further compiler
      checks.
      
      Fix this by switching to debugfs_create_bool(), and dropping the cast.
      Signed-off-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Acked-by: default avatarShannon Nelson <snelson@pensando.io>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0735ccc9
    • Florian Fainelli's avatar
      net: dsa: Do not leave DSA master with NULL netdev_ops · 050569fc
      Florian Fainelli authored
      When ndo_get_phys_port_name() for the CPU port was added we introduced
      an early check for when the DSA master network device in
      dsa_master_ndo_setup() already implements ndo_get_phys_port_name(). When
      we perform the teardown operation in dsa_master_ndo_teardown() we would
      not be checking that cpu_dp->orig_ndo_ops was successfully allocated and
      non-NULL initialized.
      
      With network device drivers such as virtio_net, this leads to a NPD as
      soon as the DSA switch hanging off of it gets torn down because we are
      now assigning the virtio_net device's netdev_ops a NULL pointer.
      
      Fixes: da7b9e9b ("net: dsa: Add ndo_get_phys_port_name() for CPU port")
      Reported-by: default avatarAllen Pais <allen.pais@oracle.com>
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Tested-by: default avatarAllen Pais <allen.pais@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      050569fc
    • Vladimir Oltean's avatar
      net: dsa: remove duplicate assignment in dsa_slave_add_cls_matchall_mirred · 65722159
      Vladimir Oltean authored
      This was caused by a poor merge conflict resolution on my side. The
      "act = &cls->rule->action.entries[0];" assignment was already present in
      the code prior to the patch mentioned below.
      
      Fixes: e13c2075 ("net: dsa: refactor matchall mirred action to separate function")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      65722159
    • Willem de Bruijn's avatar
      net: stricter validation of untrusted gso packets · 9274124f
      Willem de Bruijn authored
      Syzkaller again found a path to a kernel crash through bad gso input:
      a packet with transport header extending beyond skb_headlen(skb).
      
      Tighten validation at kernel entry:
      
      - Verify that the transport header lies within the linear section.
      
          To avoid pulling linux/tcp.h, verify just sizeof tcphdr.
          tcp_gso_segment will call pskb_may_pull (th->doff * 4) before use.
      
      - Match the gso_type against the ip_proto found by the flow dissector.
      
      Fixes: bfd5f4a3 ("packet: Add GSO/csum offload support.")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9274124f
    • Ahmed Abdelsalam's avatar
      seg6: fix SRH processing to comply with RFC8754 · 0cb7498f
      Ahmed Abdelsalam authored
      The Segment Routing Header (SRH) which defines the SRv6 dataplane is defined
      in RFC8754.
      
      RFC8754 (section 4.1) defines the SR source node behavior which encapsulates
      packets into an outer IPv6 header and SRH. The SR source node encodes the
      full list of Segments that defines the packet path in the SRH. Then, the
      first segment from list of Segments is copied into the Destination address
      of the outer IPv6 header and the packet is sent to the first hop in its path
      towards the destination.
      
      If the Segment list has only one segment, the SR source node can omit the SRH
      as he only segment is added in the destination address.
      
      RFC8754 (section 4.1.1) defines the Reduced SRH, when a source does not
      require the entire SID list to be preserved in the SRH. A reduced SRH does
      not contain the first segment of the related SR Policy (the first segment is
      the one already in the DA of the IPv6 header), and the Last Entry field is
      set to n-2, where n is the number of elements in the SR Policy.
      
      RFC8754 (section 4.3.1.1) defines the SRH processing and the logic to
      validate the SRH (S09, S10, S11) which works for both reduced and
      non-reduced behaviors.
      
      This patch updates seg6_validate_srh() to validate the SRH as per RFC8754.
      Signed-off-by: default avatarAhmed Abdelsalam <ahabdels@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0cb7498f
    • David S. Miller's avatar
      Merge branch 'FDB-fixes-for-Felix-and-Ocelot-switches' · 6e0ddb65
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      FDB fixes for Felix and Ocelot switches
      
      This series fixes the following problems:
      - Dynamically learnt addresses never expiring (neither for Ocelot nor
        for Felix)
      - Half of the FDB not visible in 'bridge fdb show' (for Felix only)
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6e0ddb65
    • Vladimir Oltean's avatar
      net: mscc: ocelot: ANA_AUTOAGE_AGE_PERIOD holds a value in seconds, not ms · c0d7eccb
      Vladimir Oltean authored
      One may notice that automatically-learnt entries 'never' expire, even
      though the bridge configures the address age period at 300 seconds.
      
      Actually the value written to hardware corresponds to a time interval
      1000 times higher than intended, i.e. 83 hours.
      
      Fixes: a556c76a ("net: mscc: Add initial Ocelot switch support")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Faineli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c0d7eccb
    • Vladimir Oltean's avatar
      net: dsa: ocelot: the MAC table on Felix is twice as large · 21ce7f3e
      Vladimir Oltean authored
      When running 'bridge fdb dump' on Felix, sometimes learnt and static MAC
      addresses would appear, sometimes they wouldn't.
      
      Turns out, the MAC table has 4096 entries on VSC7514 (Ocelot) and 8192
      entries on VSC9959 (Felix), so the existing code from the Ocelot common
      library only dumped half of Felix's MAC table. They are both organized
      as a 4-way set-associative TCAM, so we just need a single variable
      indicating the correct number of rows.
      
      Fixes: 56051948 ("net: dsa: ocelot: add driver for Felix switch family")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      21ce7f3e
  3. 06 May, 2020 9 commits
  4. 05 May, 2020 3 commits