1. 19 Aug, 2018 3 commits
    • Jian-Hong Pan's avatar
      r8169: don't use MSI-X on RTL8106e · 7bb05b85
      Jian-Hong Pan authored
      Found the ethernet network on ASUS X441UAR doesn't come back on resume
      from suspend when using MSI-X.  The chip is RTL8106e - version 39.
      
      [   21.848357] libphy: r8169: probed
      [   21.848473] r8169 0000:02:00.0 eth0: RTL8106e, 0c:9d:92:32:67:b4, XID
      44900000, IRQ 127
      [   22.518860] r8169 0000:02:00.0 enp2s0: renamed from eth0
      [   29.458041] Generic PHY r8169-200:00: attached PHY driver [Generic
      PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
      [   63.227398] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full -
      flow control off
      [  124.514648] Generic PHY r8169-200:00: attached PHY driver [Generic
      PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
      
      Here is the ethernet controller in detail:
      
      02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
      RTL8101/2/6E PCI Express Fast/Gigabit Ethernet controller [10ec:8136]
      (rev 07)
      	Subsystem: ASUSTeK Computer Inc. RTL810xE PCI Express Fast
      Ethernet controller [1043:200f]
      	Flags: bus master, fast devsel, latency 0, IRQ 16
      	I/O ports at e000 [size=256]
      	Memory at ef100000 (64-bit, non-prefetchable) [size=4K]
      	Memory at e0000000 (64-bit, prefetchable) [size=16K]
      	Capabilities: <access denied>
      	Kernel driver in use: r8169
      	Kernel modules: r8169
      
      Falling back to MSI fixes the issue.
      
      Fixes: 6c6aa15f ("r8169: improve interrupt handling")
      Signed-off-by: default avatarJian-Hong Pan <jian-hong@endlessm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7bb05b85
    • Arnd Bergmann's avatar
      net: lan743x_ptp: convert to ktime_get_clocktai_ts64 · 0b3e776e
      Arnd Bergmann authored
      timekeeping_clocktai64() has been renamed to ktime_get_clocktai_ts64()
      for consistency with the other ktime_get_* access functions.
      
      Rename the new caller that has come up as well.
      
      Question: this is the only ptp driver that sets the hardware time
      to the current system time in TAI. Why does it do that?
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b3e776e
    • Vlad Buslov's avatar
      net: sched: always disable bh when taking tcf_lock · 653cd284
      Vlad Buslov authored
      Recently, ops->init() and ops->dump() of all actions were modified to
      always obtain tcf_lock when accessing private action state. Actions that
      don't depend on tcf_lock for synchronization with their data path use
      non-bh locking API. However, tcf_lock is also used to protect rate
      estimator stats in softirq context by timer callback.
      
      Change ops->init() and ops->dump() of all actions to disable bh when using
      tcf_lock to prevent deadlock reported by following lockdep warning:
      
      [  105.470398] ================================
      [  105.475014] WARNING: inconsistent lock state
      [  105.479628] 4.18.0-rc8+ #664 Not tainted
      [  105.483897] --------------------------------
      [  105.488511] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
      [  105.494871] swapper/16/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
      [  105.500449] 00000000f86c012e (&(&p->tcfa_lock)->rlock){+.?.}, at: est_fetch_counters+0x3c/0xa0
      [  105.509696] {SOFTIRQ-ON-W} state was registered at:
      [  105.514925]   _raw_spin_lock+0x2c/0x40
      [  105.519022]   tcf_bpf_init+0x579/0x820 [act_bpf]
      [  105.523990]   tcf_action_init_1+0x4e4/0x660
      [  105.528518]   tcf_action_init+0x1ce/0x2d0
      [  105.532880]   tcf_exts_validate+0x1d8/0x200
      [  105.537416]   fl_change+0x55a/0x268b [cls_flower]
      [  105.542469]   tc_new_tfilter+0x748/0xa20
      [  105.546738]   rtnetlink_rcv_msg+0x56a/0x6d0
      [  105.551268]   netlink_rcv_skb+0x18d/0x200
      [  105.555628]   netlink_unicast+0x2d0/0x370
      [  105.559990]   netlink_sendmsg+0x3b9/0x6a0
      [  105.564349]   sock_sendmsg+0x6b/0x80
      [  105.568271]   ___sys_sendmsg+0x4a1/0x520
      [  105.572547]   __sys_sendmsg+0xd7/0x150
      [  105.576655]   do_syscall_64+0x72/0x2c0
      [  105.580757]   entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  105.586243] irq event stamp: 489296
      [  105.590084] hardirqs last  enabled at (489296): [<ffffffffb507e639>] _raw_spin_unlock_irq+0x29/0x40
      [  105.599765] hardirqs last disabled at (489295): [<ffffffffb507e745>] _raw_spin_lock_irq+0x15/0x50
      [  105.609277] softirqs last  enabled at (489292): [<ffffffffb413a6a3>] irq_enter+0x83/0xa0
      [  105.618001] softirqs last disabled at (489293): [<ffffffffb413a800>] irq_exit+0x140/0x190
      [  105.626813]
                     other info that might help us debug this:
      [  105.633976]  Possible unsafe locking scenario:
      
      [  105.640526]        CPU0
      [  105.643325]        ----
      [  105.646125]   lock(&(&p->tcfa_lock)->rlock);
      [  105.650747]   <Interrupt>
      [  105.653717]     lock(&(&p->tcfa_lock)->rlock);
      [  105.658514]
                      *** DEADLOCK ***
      
      [  105.665349] 1 lock held by swapper/16/0:
      [  105.669629]  #0: 00000000a640ad99 ((&est->timer)){+.-.}, at: call_timer_fn+0x10b/0x550
      [  105.678200]
                     stack backtrace:
      [  105.683194] CPU: 16 PID: 0 Comm: swapper/16 Not tainted 4.18.0-rc8+ #664
      [  105.690249] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
      [  105.698626] Call Trace:
      [  105.701421]  <IRQ>
      [  105.703791]  dump_stack+0x92/0xeb
      [  105.707461]  print_usage_bug+0x336/0x34c
      [  105.711744]  mark_lock+0x7c9/0x980
      [  105.715500]  ? print_shortest_lock_dependencies+0x2e0/0x2e0
      [  105.721424]  ? check_usage_forwards+0x230/0x230
      [  105.726315]  __lock_acquire+0x923/0x26f0
      [  105.730597]  ? debug_show_all_locks+0x240/0x240
      [  105.735478]  ? mark_lock+0x493/0x980
      [  105.739412]  ? check_chain_key+0x140/0x1f0
      [  105.743861]  ? __lock_acquire+0x836/0x26f0
      [  105.748323]  ? lock_acquire+0x12e/0x290
      [  105.752516]  lock_acquire+0x12e/0x290
      [  105.756539]  ? est_fetch_counters+0x3c/0xa0
      [  105.761084]  _raw_spin_lock+0x2c/0x40
      [  105.765099]  ? est_fetch_counters+0x3c/0xa0
      [  105.769633]  est_fetch_counters+0x3c/0xa0
      [  105.773995]  est_timer+0x87/0x390
      [  105.777670]  ? est_fetch_counters+0xa0/0xa0
      [  105.782210]  ? lock_acquire+0x12e/0x290
      [  105.786410]  call_timer_fn+0x161/0x550
      [  105.790512]  ? est_fetch_counters+0xa0/0xa0
      [  105.795055]  ? del_timer_sync+0xd0/0xd0
      [  105.799249]  ? __lock_is_held+0x93/0x110
      [  105.803531]  ? mark_held_locks+0x20/0xe0
      [  105.807813]  ? _raw_spin_unlock_irq+0x29/0x40
      [  105.812525]  ? est_fetch_counters+0xa0/0xa0
      [  105.817069]  ? est_fetch_counters+0xa0/0xa0
      [  105.821610]  run_timer_softirq+0x3c4/0x9f0
      [  105.826064]  ? lock_acquire+0x12e/0x290
      [  105.830257]  ? __bpf_trace_timer_class+0x10/0x10
      [  105.835237]  ? __lock_is_held+0x25/0x110
      [  105.839517]  __do_softirq+0x11d/0x7bf
      [  105.843542]  irq_exit+0x140/0x190
      [  105.847208]  smp_apic_timer_interrupt+0xac/0x3b0
      [  105.852182]  apic_timer_interrupt+0xf/0x20
      [  105.856628]  </IRQ>
      [  105.859081] RIP: 0010:cpuidle_enter_state+0xd8/0x4d0
      [  105.864395] Code: 46 ff 48 89 44 24 08 0f 1f 44 00 00 31 ff e8 cf ec 46 ff 80 7c 24 07 00 0f 85 1d 02 00 00 e8 9f 90 4b ff fb 66 0f 1f 44 00 00 <4c> 8b 6c 24 08 4d 29 fd 0f 80 36 03 00 00 4c 89 e8 48 ba cf f7 53
      [  105.884288] RSP: 0018:ffff8803ad94fd20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
      [  105.892494] RAX: 0000000000000000 RBX: ffffe8fb300829c0 RCX: ffffffffb41e19e1
      [  105.899988] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8803ad9358ac
      [  105.907503] RBP: ffffffffb6636300 R08: 0000000000000004 R09: 0000000000000000
      [  105.914997] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004
      [  105.922487] R13: ffffffffb6636140 R14: ffffffffb66362d8 R15: 000000188d36091b
      [  105.929988]  ? trace_hardirqs_on_caller+0x141/0x2d0
      [  105.935232]  do_idle+0x28e/0x320
      [  105.938817]  ? arch_cpu_idle_exit+0x40/0x40
      [  105.943361]  ? mark_lock+0x8c1/0x980
      [  105.947295]  ? _raw_spin_unlock_irqrestore+0x32/0x60
      [  105.952619]  cpu_startup_entry+0xc2/0xd0
      [  105.956900]  ? cpu_in_idle+0x20/0x20
      [  105.960830]  ? _raw_spin_unlock_irqrestore+0x32/0x60
      [  105.966146]  ? trace_hardirqs_on_caller+0x141/0x2d0
      [  105.971391]  start_secondary+0x2b5/0x360
      [  105.975669]  ? set_cpu_sibling_map+0x1330/0x1330
      [  105.980654]  secondary_startup_64+0xa5/0xb0
      
      Taking tcf_lock in sample action with bh disabled causes lockdep to issue a
      warning regarding possible irq lock inversion dependency between tcf_lock,
      and psample_groups_lock that is taken when holding tcf_lock in sample init:
      
      [  162.108959]  Possible interrupt unsafe locking scenario:
      
      [  162.116386]        CPU0                    CPU1
      [  162.121277]        ----                    ----
      [  162.126162]   lock(psample_groups_lock);
      [  162.130447]                                local_irq_disable();
      [  162.136772]                                lock(&(&p->tcfa_lock)->rlock);
      [  162.143957]                                lock(psample_groups_lock);
      [  162.150813]   <Interrupt>
      [  162.153808]     lock(&(&p->tcfa_lock)->rlock);
      [  162.158608]
                      *** DEADLOCK ***
      
      In order to prevent potential lock inversion dependency between tcf_lock
      and psample_groups_lock, extract call to psample_group_get() from tcf_lock
      protected section in sample action init function.
      
      Fixes: 4e232818 ("net: sched: act_mirred: remove dependency on rtnl lock")
      Fixes: 764e9a24 ("net: sched: act_vlan: remove dependency on rtnl lock")
      Fixes: 729e0126 ("net: sched: act_tunnel_key: remove dependency on rtnl lock")
      Fixes: d7728495 ("net: sched: act_sample: remove dependency on rtnl lock")
      Fixes: e8917f43 ("net: sched: act_gact: remove dependency on rtnl lock")
      Fixes: b6a2b971 ("net: sched: act_csum: remove dependency on rtnl lock")
      Fixes: 2142236b ("net: sched: act_bpf: remove dependency on rtnl lock")
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      653cd284
  2. 18 Aug, 2018 3 commits
    • Haishuang Yan's avatar
      ip6_vti: simplify stats handling in vti6_xmit · bb107456
      Haishuang Yan authored
      Same as ip_vti, use iptunnel_xmit_stats to updates stats in tunnel xmit
      code path.
      Signed-off-by: default avatarHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bb107456
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 6e3bf9b0
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2018-08-18
      
      The following pull-request contains BPF updates for your *net* tree.
      
      The main changes are:
      
      1) Fix a BPF selftest failure in test_cgroup_storage due to rlimit
         restrictions, from Yonghong.
      
      2) Fix a suspicious RCU rcu_dereference_check() warning triggered
         from removing a device's XDP memory allocator by using the correct
         rhashtable lookup function, from Tariq.
      
      3) A batch of BPF sockmap and ULP fixes mainly fixing leaks and races
         as well as enforcing module aliases for ULPs. Another fix for BPF
         map redirect to make them work again with tail calls, from Daniel.
      
      4) Fix XDP BPF samples to unload their programs upon SIGTERM, from Jesper.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6e3bf9b0
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · 3fe49d69
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter/IPVS fixes for net
      
      The following patchset contains Netfilter/IPVS fixes for your net tree:
      
      1) Infinite loop in IPVS when net namespace is released, from
         Tan Hu.
      
      2) Do not show negative timeouts in ip_vs_conn by using the new
         jiffies_delta_to_msecs(), patches from Matteo Croce.
      
      3) Set F_IFACE flag for linklocal addresses in ip6t_rpfilter,
         from Florian Westphal.
      
      4) Fix overflow in set size allocation, from Taehee Yoo.
      
      5) Use netlink_dump_start() from ctnetlink to fix memleak from
         the error path, again from Florian.
      
      6) Register nfnetlink_subsys in last place, otherwise netns
         init path may lose race and see net->nft uninitialized data.
         This also reverts previous attempt to fix this by increase
         netns refcount, patches from Florian.
      
      7) Remove conntrack entries on layer 4 protocol tracker module
         removal, from Florian.
      
      8) Use GFP_KERNEL_ACCOUNT for xtables blob allocation, from
         Michal Hocko.
      
      9) Get tproxy documentation in sync with existing codebase,
         from Mate Eckl.
      
      10) Honor preset layer 3 protocol via ctx->family in the new nft_ct
          timeout infrastructure, from Harsha Sharma.
      
      11) Let uapi nfnetlink_osf.h compile standalone with no errors,
          from Dmitry V. Levin.
      
      12) Missing braces compilation warning in nft_tproxy, patch from
          Mate Eclk.
      
      13) Disregard bogus check to bail out on non-anonymous sets from
          the dynamic set update extension.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3fe49d69
  3. 17 Aug, 2018 3 commits
    • Daniel Borkmann's avatar
      bpf: fix redirect to map under tail calls · f6069b9a
      Daniel Borkmann authored
      Commits 109980b8 ("bpf: don't select potentially stale ri->map
      from buggy xdp progs") and 7c300131 ("bpf: fix ri->map_owner
      pointer on bpf_prog_realloc") tried to mitigate that buggy programs
      using bpf_redirect_map() helper call do not leave stale maps behind.
      Idea was to add a map_owner cookie into the per CPU struct redirect_info
      which was set to prog->aux by the prog making the helper call as a
      proof that the map is not stale since the prog is implicitly holding
      a reference to it. This owner cookie could later on get compared with
      the program calling into BPF whether they match and therefore the
      redirect could proceed with processing the map safely.
      
      In (obvious) hindsight, this approach breaks down when tail calls are
      involved since the original caller's prog->aux pointer does not have
      to match the one from one of the progs out of the tail call chain,
      and therefore the xdp buffer will be dropped instead of redirected.
      A way around that would be to fix the issue differently (which also
      allows to remove related work in fast path at the same time): once
      the life-time of a redirect map has come to its end we use it's map
      free callback where we need to wait on synchronize_rcu() for current
      outstanding xdp buffers and remove such a map pointer from the
      redirect info if found to be present. At that time no program is
      using this map anymore so we simply invalidate the map pointers to
      NULL iff they previously pointed to that instance while making sure
      that the redirect path only reads out the map once.
      
      Fixes: 97f91a7c ("bpf: add bpf_redirect_map helper routine")
      Fixes: 109980b8 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
      Reported-by: default avatarSebastiano Miano <sebastiano.miano@polito.it>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f6069b9a
    • Heiner Kallweit's avatar
      r8169: add missing Kconfig dependency · bfdd19ad
      Heiner Kallweit authored
      Now that we switched the r8169 driver to use phylib, there's a
      dependency on the Realtek PHY drivers. This dependency was missing
      in Kconfig.
      Reported-by: default avatarJouni Mettälä <jtmettala@gmail.com>
      Fixes: f1e911d5 ("r8169: add basic phylib support")
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Acked-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bfdd19ad
    • Yonghong Song's avatar
      tools/bpf: fix bpf selftest test_cgroup_storage failure · a85da34e
      Yonghong Song authored
      The bpf selftest test_cgroup_storage failed in one of
      our production test servers.
        # sudo ./test_cgroup_storage
        Failed to create map: Operation not permitted
      
      It turns out this is due to insufficient locked memory
      with system default 16KB.
      
      Similar to other self tests, let us arm the process
      with unlimited locked memory. With this change,
      the test passed.
        # sudo ./test_cgroup_storage
        test_cgroup_storage:PASS
      
      Fixes: 68cfa3ac ("selftests/bpf: add a cgroup storage test")
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a85da34e
  4. 16 Aug, 2018 31 commits
    • Alexei Starovoitov's avatar
      Merge branch 'sockmap-ulp-fixes' · cbb2fb13
      Alexei Starovoitov authored
      Daniel Borkmann says:
      
      ====================
      Batch of various fixes related to BPF sockmap and ULP, including
      adding module alias to restrict module requests, races and memory
      leaks in sockmap code. For details please refer to the individual
      patches. Thanks!
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cbb2fb13
    • Daniel Borkmann's avatar
      bpf, sockmap: fix sock_map_ctx_update_elem race with exist/noexist · 585f5a62
      Daniel Borkmann authored
      The current code in sock_map_ctx_update_elem() allows for BPF_EXIST
      and BPF_NOEXIST map update flags. While on array-like maps this approach
      is rather uncommon, e.g. bpf_fd_array_map_update_elem() and others
      enforce map update flags to be BPF_ANY such that xchg() can be used
      directly, the current implementation in sock map does not guarantee
      that such operation with BPF_EXIST / BPF_NOEXIST is atomic.
      
      The initial test does a READ_ONCE(stab->sock_map[i]) to fetch the
      socket from the slot which is then tested for NULL / non-NULL. However
      later after __sock_map_ctx_update_elem(), the actual update is done
      through osock = xchg(&stab->sock_map[i], sock). Problem is that in
      the meantime a different CPU could have updated / deleted a socket
      on that specific slot and thus flag contraints won't hold anymore.
      
      I've been thinking whether best would be to just break UAPI and do
      an enforcement of BPF_ANY to check if someone actually complains,
      however trouble is that already in BPF kselftest we use BPF_NOEXIST
      for the map update, and therefore it might have been copied into
      applications already. The fix to keep the current behavior intact
      would be to add a map lock similar to the sock hash bucket lock only
      for covering the whole map.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      585f5a62
    • Daniel Borkmann's avatar
      bpf, sockmap: fix map elem deletion race with smap_stop_sock · 166ab6f0
      Daniel Borkmann authored
      The smap_start_sock() and smap_stop_sock() are each protected under
      the sock->sk_callback_lock from their call-sites except in the case
      of sock_map_delete_elem() where we drop the old socket from the map
      slot. This is racy because the same sock could be part of multiple
      sock maps, so we run smap_stop_sock() in parallel, and given at that
      point psock->strp_enabled might be true on both CPUs, we might for
      example wrongly restore the sk->sk_data_ready / sk->sk_write_space.
      Therefore, hold the sock->sk_callback_lock as well on delete. Looks
      like 2f857d04 ("bpf: sockmap, remove STRPARSER map_flags and add
      multi-map support") had this right, but later on e9db4ef6 ("bpf:
      sockhash fix omitted bucket lock in sock_close") removed it again
      from delete leaving this smap_stop_sock() instance unprotected.
      
      Fixes: e9db4ef6 ("bpf: sockhash fix omitted bucket lock in sock_close")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      166ab6f0
    • Daniel Borkmann's avatar
      bpf, sockmap: fix leakage of smap_psock_map_entry · d40b0116
      Daniel Borkmann authored
      While working on sockmap I noticed that we do not always kfree the
      struct smap_psock_map_entry list elements which track psocks attached
      to maps. In the case of sock_hash_ctx_update_elem(), these map entries
      are allocated outside of __sock_map_ctx_update_elem() with their
      linkage to the socket hash table filled. In the case of sock array,
      the map entries are allocated inside of __sock_map_ctx_update_elem()
      and added with their linkage to the psock->maps. Both additions are
      under psock->maps_lock each.
      
      Now, we drop these elements from their psock->maps list in a few
      occasions: i) in sock array via smap_list_map_remove() when an entry
      is either deleted from the map from user space, or updated via
      user space or BPF program where we drop the old socket at that map
      slot, or the sock array is freed via sock_map_free() and drops all
      its elements; ii) for sock hash via smap_list_hash_remove() in exactly
      the same occasions as just described for sock array; iii) in the
      bpf_tcp_close() where we remove the elements from the list via
      psock_map_pop() and iterate over them dropping themselves from either
      sock array or sock hash; and last but not least iv) once again in
      smap_gc_work() which is a callback for deferring the work once the
      psock refcount hit zero and thus the socket is being destroyed.
      
      Problem is that the only case where we kfree() the list entry is
      in case iv), which at that point should have an empty list in
      normal cases. So in cases from i) to iii) we unlink the elements
      without freeing where they go out of reach from us. Hence fix is
      to properly kfree() them as well to stop the leakage. Given these
      are all handled under psock->maps_lock there is no need for deferred
      RCU freeing.
      
      I later also ran with kmemleak detector and it confirmed the finding
      as well where in the state before the fix the object goes unreferenced
      while after the patch no kmemleak report related to BPF showed up.
      
        [...]
        unreferenced object 0xffff880378eadae0 (size 64):
          comm "test_sockmap", pid 2225, jiffies 4294720701 (age 43.504s)
          hex dump (first 32 bytes):
            00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  ................
            50 4d 75 5d 03 88 ff ff 00 00 00 00 00 00 00 00  PMu]............
          backtrace:
            [<000000005225ac3c>] sock_map_ctx_update_elem.isra.21+0xd8/0x210
            [<0000000045dd6d3c>] bpf_sock_map_update+0x29/0x60
            [<00000000877723aa>] ___bpf_prog_run+0x1e1f/0x4960
            [<000000002ef89e83>] 0xffffffffffffffff
        unreferenced object 0xffff880378ead240 (size 64):
          comm "test_sockmap", pid 2225, jiffies 4294720701 (age 43.504s)
          hex dump (first 32 bytes):
            00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  ................
            00 44 75 5d 03 88 ff ff 00 00 00 00 00 00 00 00  .Du]............
          backtrace:
            [<000000005225ac3c>] sock_map_ctx_update_elem.isra.21+0xd8/0x210
            [<0000000030e37a3a>] sock_map_update_elem+0x125/0x240
            [<000000002e5ce36e>] map_update_elem+0x4eb/0x7b0
            [<00000000db453cc9>] __x64_sys_bpf+0x1f9/0x360
            [<0000000000763660>] do_syscall_64+0x9a/0x300
            [<00000000422a2bb2>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
            [<000000002ef89e83>] 0xffffffffffffffff
        [...]
      
      Fixes: e9db4ef6 ("bpf: sockhash fix omitted bucket lock in sock_close")
      Fixes: 54fedb42 ("bpf: sockmap, fix smap_list_map_remove when psock is in many maps")
      Fixes: 2f857d04 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d40b0116
    • Daniel Borkmann's avatar
      tcp, ulp: fix leftover icsk_ulp_ops preventing sock from reattach · 90545cdc
      Daniel Borkmann authored
      I found that in BPF sockmap programs once we either delete a socket
      from the map or we updated a map slot and the old socket was purged
      from the map that these socket can never get reattached into a map
      even though their related psock has been dropped entirely at that
      point.
      
      Reason is that tcp_cleanup_ulp() leaves the old icsk->icsk_ulp_ops
      intact, so that on the next tcp_set_ulp_id() the kernel returns an
      -EEXIST thinking there is still some active ULP attached.
      
      BPF sockmap is the only one that has this issue as the other user,
      kTLS, only calls tcp_cleanup_ulp() from tcp_v4_destroy_sock() whereas
      sockmap semantics allow dropping the socket from the map with all
      related psock state being cleaned up.
      
      Fixes: 1aa12bdf ("bpf: sockmap, add sock close() hook to remove socks")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      90545cdc
    • Daniel Borkmann's avatar
      tcp, ulp: add alias for all ulp modules · 037b0b86
      Daniel Borkmann authored
      Lets not turn the TCP ULP lookup into an arbitrary module loader as
      we only intend to load ULP modules through this mechanism, not other
      unrelated kernel modules:
      
        [root@bar]# cat foo.c
        #include <sys/types.h>
        #include <sys/socket.h>
        #include <linux/tcp.h>
        #include <linux/in.h>
      
        int main(void)
        {
            int sock = socket(PF_INET, SOCK_STREAM, 0);
            setsockopt(sock, IPPROTO_TCP, TCP_ULP, "sctp", sizeof("sctp"));
            return 0;
        }
      
        [root@bar]# gcc foo.c -O2 -Wall
        [root@bar]# lsmod | grep sctp
        [root@bar]# ./a.out
        [root@bar]# lsmod | grep sctp
        sctp                 1077248  4
        libcrc32c              16384  3 nf_conntrack,nf_nat,sctp
        [root@bar]#
      
      Fix it by adding module alias to TCP ULP modules, so probing module
      via request_module() will be limited to tcp-ulp-[name]. The existing
      modules like kTLS will load fine given tcp-ulp-tls alias, but others
      will fail to load:
      
        [root@bar]# lsmod | grep sctp
        [root@bar]# ./a.out
        [root@bar]# lsmod | grep sctp
        [root@bar]#
      
      Sockmap is not affected from this since it's either built-in or not.
      
      Fixes: 734942cc ("tcp: ULP infrastructure")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      037b0b86
    • Yonghong Song's avatar
      bpf: fix a rcu usage warning in bpf_prog_array_copy_core() · 965931e3
      Yonghong Song authored
      Commit 394e40a2 ("bpf: extend bpf_prog_array to store pointers
      to the cgroup storage") refactored the bpf_prog_array_copy_core()
      to accommodate new structure bpf_prog_array_item which contains
      bpf_prog array itself.
      
      In the old code, we had
         perf_event_query_prog_array():
           mutex_lock(...)
           bpf_prog_array_copy_call():
             prog = rcu_dereference_check(array, 1)->progs
             bpf_prog_array_copy_core(prog, ...)
           mutex_unlock(...)
      
      With the above commit, we had
         perf_event_query_prog_array():
           mutex_lock(...)
           bpf_prog_array_copy_call():
             bpf_prog_array_copy_core(array, ...):
               item = rcu_dereference(array)->items;
               ...
           mutex_unlock(...)
      
      The new code will trigger a lockdep rcu checking warning.
      The fix is to change rcu_dereference() to rcu_dereference_check()
      to prevent such a warning.
      
      Reported-by: syzbot+6e72317008eef84a216b@syzkaller.appspotmail.com
      Fixes: 394e40a2 ("bpf: extend bpf_prog_array to store pointers to the cgroup storage")
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      965931e3
    • Jesper Dangaard Brouer's avatar
      samples/bpf: all XDP samples should unload xdp/bpf prog on SIGTERM · 817b89be
      Jesper Dangaard Brouer authored
      It is common XDP practice to unload/deattach the XDP bpf program,
      when the XDP sample program is Ctrl-C interrupted (SIGINT) or
      killed (SIGTERM).
      
      The samples/bpf programs xdp_redirect_cpu and xdp_rxq_info,
      forgot to trap signal SIGTERM (which is the default signal used
      by the kill command).
      
      This was discovered by Red Hat QA, which automated scripts depend
      on killing the XDP sample program after a timeout period.
      
      Fixes: fad3917e ("samples/bpf: add cpumap sample program xdp_redirect_cpu")
      Fixes: 0fca931a ("samples/bpf: program demonstrating access to xdp_rxq_info")
      Reported-by: default avatarJean-Tsung Hsiao <jhsiao@redhat.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      817b89be
    • Tariq Toukan's avatar
      net/xdp: Fix suspicious RCU usage warning · 21b172ee
      Tariq Toukan authored
      Fix the warning below by calling rhashtable_lookup_fast.
      Also, make some code movements for better quality and human
      readability.
      
      [  342.450870] WARNING: suspicious RCU usage
      [  342.455856] 4.18.0-rc2+ #17 Tainted: G           O
      [  342.462210] -----------------------------
      [  342.467202] ./include/linux/rhashtable.h:481 suspicious rcu_dereference_check() usage!
      [  342.476568]
      [  342.476568] other info that might help us debug this:
      [  342.476568]
      [  342.486978]
      [  342.486978] rcu_scheduler_active = 2, debug_locks = 1
      [  342.495211] 4 locks held by modprobe/3934:
      [  342.500265]  #0: 00000000e23116b2 (mlx5_intf_mutex){+.+.}, at:
      mlx5_unregister_interface+0x18/0x90 [mlx5_core]
      [  342.511953]  #1: 00000000ca16db96 (rtnl_mutex){+.+.}, at: unregister_netdev+0xe/0x20
      [  342.521109]  #2: 00000000a46e2c4b (&priv->state_lock){+.+.}, at: mlx5e_close+0x29/0x60
      [mlx5_core]
      [  342.531642]  #3: 0000000060c5bde3 (mem_id_lock){+.+.}, at: xdp_rxq_info_unreg+0x93/0x6b0
      [  342.541206]
      [  342.541206] stack backtrace:
      [  342.547075] CPU: 12 PID: 3934 Comm: modprobe Tainted: G           O      4.18.0-rc2+ #17
      [  342.556621] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
      [  342.565606] Call Trace:
      [  342.568861]  dump_stack+0x78/0xb3
      [  342.573086]  xdp_rxq_info_unreg+0x3f5/0x6b0
      [  342.578285]  ? __call_rcu+0x220/0x300
      [  342.582911]  mlx5e_free_rq+0x38/0xc0 [mlx5_core]
      [  342.588602]  mlx5e_close_channel+0x20/0x120 [mlx5_core]
      [  342.594976]  mlx5e_close_channels+0x26/0x40 [mlx5_core]
      [  342.601345]  mlx5e_close_locked+0x44/0x50 [mlx5_core]
      [  342.607519]  mlx5e_close+0x42/0x60 [mlx5_core]
      [  342.613005]  __dev_close_many+0xb1/0x120
      [  342.617911]  dev_close_many+0xa2/0x170
      [  342.622622]  rollback_registered_many+0x148/0x460
      [  342.628401]  ? __lock_acquire+0x48d/0x11b0
      [  342.633498]  ? unregister_netdev+0xe/0x20
      [  342.638495]  rollback_registered+0x56/0x90
      [  342.643588]  unregister_netdevice_queue+0x7e/0x100
      [  342.649461]  unregister_netdev+0x18/0x20
      [  342.654362]  mlx5e_remove+0x2a/0x50 [mlx5_core]
      [  342.659944]  mlx5_remove_device+0xe5/0x110 [mlx5_core]
      [  342.666208]  mlx5_unregister_interface+0x39/0x90 [mlx5_core]
      [  342.673038]  cleanup+0x5/0xbfc [mlx5_core]
      [  342.678094]  __x64_sys_delete_module+0x16b/0x240
      [  342.683725]  ? do_syscall_64+0x1c/0x210
      [  342.688476]  do_syscall_64+0x5a/0x210
      [  342.693025]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fixes: 8d5d8852 ("xdp: rhashtable with allocator ID to pointer mapping")
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Suggested-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      21b172ee
    • Yuval Shaia's avatar
      net/mlx5e: Delete unneeded function argument · 54c73f86
      Yuval Shaia authored
      priv argument is not used by the function, delete it.
      
      Fixes: a8984281 ("net/mlx5e: Merge per priority stats groups")
      Signed-off-by: default avatarYuval Shaia <yuval.shaia@oracle.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      54c73f86
    • Ivan Khoronzhuk's avatar
      Documentation: networking: ti-cpsw: correct cbs parameters for Eth1 100Mb · 70fd8036
      Ivan Khoronzhuk authored
      If set cbs parameters calculated for 1000Mb, but use on 100Mb port
      w/o h/w offload (for cpsw offload it doesn't matter), it works
      incorrectly. According to the example and testing board, second port
      is 100Mb interface. Correct them on recalculated for 100Mb interface.
      It allows to use the same command for CBS software implementation for
      board in example.
      Signed-off-by: default avatarIvan Khoronzhuk <ivan.khoronzhuk@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      70fd8036
    • Kees Cook's avatar
      isdn: Disable IIOCDBGVAR · 5e22002a
      Kees Cook authored
      It was possible to directly leak the kernel address where the isdn_dev
      structure pointer was stored. This is a kernel ASLR bypass for anyone
      with access to the ioctl. The code had been present since the beginning
      of git history, though this shouldn't ever be needed for normal operation,
      therefore remove it.
      Reported-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Karsten Keil <isdn@linux-pingi.de>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e22002a
    • Lad, Prabhakar's avatar
      net: dsa: add support for ksz9897 ethernet switch · 45316818
      Lad, Prabhakar authored
      ksz9477 is superset of ksz9xx series, driver just works
      out of the box for ksz9897 chip with this patch.
      Signed-off-by: default avatarLad, Prabhakar <prabhakar.csengg@gmail.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      45316818
    • Toshiaki Makita's avatar
      veth: Free queues on link delete · 7797b93b
      Toshiaki Makita authored
      David Ahern reported memory leak in veth.
      
      =======================================================================
      $ cat /sys/kernel/debug/kmemleak
      unreferenced object 0xffff8800354d5c00 (size 1024):
        comm "ip", pid 836, jiffies 4294722952 (age 25.904s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<(____ptrval____)>] kmemleak_alloc+0x70/0x94
          [<(____ptrval____)>] slab_post_alloc_hook+0x42/0x52
          [<(____ptrval____)>] __kmalloc+0x101/0x142
          [<(____ptrval____)>] kmalloc_array.constprop.20+0x1e/0x26 [veth]
          [<(____ptrval____)>] veth_newlink+0x147/0x3ac [veth]
          ...
      unreferenced object 0xffff88002e009c00 (size 1024):
        comm "ip", pid 836, jiffies 4294722958 (age 25.898s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<(____ptrval____)>] kmemleak_alloc+0x70/0x94
          [<(____ptrval____)>] slab_post_alloc_hook+0x42/0x52
          [<(____ptrval____)>] __kmalloc+0x101/0x142
          [<(____ptrval____)>] kmalloc_array.constprop.20+0x1e/0x26 [veth]
          [<(____ptrval____)>] veth_newlink+0x219/0x3ac [veth]
      =======================================================================
      
      veth_rq allocated in veth_newlink() was not freed on dellink.
      
      We need to free up them after veth_close() so that any packets will not
      reference the queues afterwards. Thus free them in veth_dev_free() in
      the same way as freeing stats structure (vstats).
      
      Also move queues allocation to veth_dev_init() to be in line with stats
      allocation.
      
      Fixes: 638264dc ("veth: Support per queue XDP ring")
      Reported-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Tested-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7797b93b
    • Cong Wang's avatar
      ila: make lockdep happy again · ff93bca7
      Cong Wang authored
      Previously, alloc_ila_locks() and bucket_table_alloc() call
      spin_lock_init() separately, therefore they have two different
      lock names and lock class keys. However, after commit b8932817
      ("ila: Call library function alloc_bucket_locks") they both call
      helper alloc_bucket_spinlocks() which now only has one lock
      name and lock class key. This causes a few bogus lockdep warnings
      as reported by syzbot.
      
      Fix this by making alloc_bucket_locks() a macro and pass declaration
      name as lock name and a static lock class key inside the macro.
      
      Fixes: b8932817 ("ila: Call library function alloc_bucket_locks")
      Reported-by: <syzbot+b66a5a554991a8ed027c@syzkaller.appspotmail.com>
      Cc: Tom Herbert <tom@quantonium.net>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ff93bca7
    • Vlad Buslov's avatar
      net: sched: act_ife: always release ife action on init error · 32039eac
      Vlad Buslov authored
      Action init API was changed to always take reference to action, even when
      overwriting existing action. Substitute conditional action release, which
      was executed only if action is newly created, with unconditional release in
      tcf_ife_init() error handling code to prevent double free or memory leak in
      case of overwrite.
      
      Fixes: 4e8ddd7f ("net: sched: don't release reference on action overwrite")
      Reported-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      32039eac
    • Fabrizio Castro's avatar
      5f34f69e
    • Hangbin Liu's avatar
      cls_matchall: fix tcf_unbind_filter missing · a51c76b4
      Hangbin Liu authored
      Fix tcf_unbind_filter missing in cls_matchall as this will trigger
      WARN_ON() in cbq_destroy_class().
      
      Fixes: fd62d9f5 ("net/sched: matchall: Fix configuration race")
      Reported-by: default avatarLi Shuang <shuali@redhat.com>
      Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a51c76b4
    • Pablo Neira Ayuso's avatar
      netfilter: nft_dynset: allow dynamic updates of non-anonymous set · feb9f55c
      Pablo Neira Ayuso authored
      This check is superfluous since it breaks valid configurations, remove it.
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      feb9f55c
    • Máté Eckl's avatar
      netfilter: nft_tproxy: Fix missing-braces warning · 90d827f0
      Máté Eckl authored
      This patch fixes a warning reported by the kbuild test robot (from linux-next
      tree):
         net/netfilter/nft_tproxy.c: In function 'nft_tproxy_eval_v6':
      >> net/netfilter/nft_tproxy.c:85:9: warning: missing braces around initializer [-Wmissing-braces]
           struct in6_addr taddr = {0};
                  ^
         net/netfilter/nft_tproxy.c:85:9: warning: (near initialization for 'taddr.in6_u') [-Wmissing-braces]
      
      This warning is actually caused by a gcc bug already resolved in newer
      versions (kbuild used 4.9) so this kind of initialization is omitted and
      memset is used instead.
      
      Fixes: 4ed8eb65 ("netfilter: nf_tables: Add native tproxy support")
      Signed-off-by: default avatarMáté Eckl <ecklm94@gmail.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      90d827f0
    • Dmitry V. Levin's avatar
      netfilter: uapi: fix linux/netfilter/nf_osf.h userspace compilation errors · cdb2f401
      Dmitry V. Levin authored
      Move inclusion of <linux/ip.h> and <linux/tcp.h> from
      linux/netfilter/xt_osf.h to linux/netfilter/nf_osf.h to fix
      the following linux/netfilter/nf_osf.h userspace compilation errors:
      
      /usr/include/linux/netfilter/nf_osf.h:59:24: error: 'MAX_IPOPTLEN' undeclared here (not in a function)
        struct nf_osf_opt opt[MAX_IPOPTLEN];
      /usr/include/linux/netfilter/nf_osf.h:64:17: error: field 'ip' has incomplete type
        struct iphdr   ip;
      /usr/include/linux/netfilter/nf_osf.h:65:18: error: field 'tcp' has incomplete type
        struct tcphdr   tcp;
      
      Fixes: bfb15f2a ("netfilter: extract Passive OS fingerprint infrastructure from xt_osf")
      Signed-off-by: default avatarDmitry V. Levin <ldv@altlinux.org>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      cdb2f401
    • Harsha Sharma's avatar
      netfilter: nft_ct: make l3 protocol field optional for timeout object · 3206c516
      Harsha Sharma authored
      If l3 protocol value is not specified for ct timeout object then use the
      value from nft_ctx protocol family.
      Signed-off-by: default avatarHarsha Sharma <harshasharmaiitr@gmail.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      3206c516
    • Máté Eckl's avatar
      netfilter: doc: Add nf_tables part in tproxy.txt · 1bfc2bc7
      Máté Eckl authored
      Recently, transparent proxy support has been added to nf_tables so that
      this document should be updated with the new information.
      
      - Nft commands are added as alternatives to iptables ones.
      - The link for a patched iptables is removed as it is already part of
        the mainline iptables implementation (and the link is dead).
      - tcprdr is added as an example implementation of a transparent proxy
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: KOVACS Krisztian <hidden@sch.bme.hu>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: linux-doc@vger.kernel.org
      Signed-off-by: default avatarMáté Eckl <ecklm94@gmail.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      1bfc2bc7
    • Michal Hocko's avatar
      netfilter: x_tables: do not fail xt_alloc_table_info too easilly · a148ce15
      Michal Hocko authored
      eacd86ca ("net/netfilter/x_tables.c: use kvmalloc()
      in xt_alloc_table_info()") has unintentionally fortified
      xt_alloc_table_info allocation when __GFP_RETRY has been dropped from
      the vmalloc fallback. Later on there was a syzbot report that this
      can lead to OOM killer invocations when tables are too large and
      0537250f ("netfilter: x_tables: make allocation less aggressive")
      has been merged to restore the original behavior. Georgi Nikolov however
      noticed that he is not able to install his iptables anymore so this can
      be seen as a regression.
      
      The primary argument for 0537250f was that this allocation path
      shouldn't really trigger the OOM killer and kill innocent tasks. On the
      other hand the interface requires root and as such should allow what the
      admin asks for. Root inside a namespaces makes this more complicated
      because those might be not trusted in general. If they are not then such
      namespaces should be restricted anyway. Therefore drop the __GFP_NORETRY
      and replace it by __GFP_ACCOUNT to enfore memcg constrains on it.
      
      Fixes: 0537250f ("netfilter: x_tables: make allocation less aggressive")
      Reported-by: default avatarGeorgi Nikolov <gnikolov@icdsoft.com>
      Suggested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      a148ce15
    • Florian Westphal's avatar
      netfilter: conntrack: fix removal of conntrack entries when l4tracker is removed · 1c117d3b
      Florian Westphal authored
      nf_ct_l4proto_unregister_one() leaves conntracks added by
      to-be-removed tracker behind, nf_ct_l4proto_unregister has to iterate
      for each protocol to be removed.
      
      v2: call nf_ct_iterate_destroy without holding nf_ct_proto_mutex.
      
      Fixes: 2c41f33c ("netfilter: move table iteration out of netns exit paths")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      1c117d3b
    • Florian Westphal's avatar
      netfilter: nf_tables: don't prevent event handler from device cleanup on netns exit · 6a48de01
      Florian Westphal authored
      When a netnsamespace exits, the nf_tables pernet_ops will remove all rules.
      However, there is one caveat:
      
      Base chains that register ingress hooks will cause use-after-free:
      device is already gone at that point.
      
      The device event handlers prevent this from happening:
      netns exit synthesizes unregister events for all devices.
      
      However, an improper fix for a race condition made the notifiers a no-op
      in case they get called from netns exit path, so revert that part.
      
      This is safe now as the previous patch fixed nf_tables pernet ops
      and device notifier initialisation ordering.
      
      Fixes: 0a2cf5ee ("netfilter: nf_tables: close race between netns exit and rmmod")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      6a48de01
    • Florian Westphal's avatar
      netfilter: nf_tables: fix register ordering · d209df3e
      Florian Westphal authored
      We must register nfnetlink ops last, as that exposes nf_tables to
      userspace.  Without this, we could theoretically get nfnetlink request
      before net->nft state has been initialized.
      
      Fixes: 99633ab2 ("netfilter: nf_tables: complete net namespace support")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      d209df3e
    • Florian Westphal's avatar
      netfilter: fix memory leaks on netlink_dump_start error · 3e673b23
      Florian Westphal authored
      Shaochun Chen points out we leak dumper filter state allocations
      stored in dump_control->data in case there is an error before netlink sets
      cb_running (after which ->done will be called at some point).
      
      In order to fix this, add .start functions and move allocations there.
      
      Same pattern as used in commit 90fd131a
      ("netfilter: nf_tables: move dumper state allocation into ->start").
      Reported-by: default avatarshaochun chen <cscnull@gmail.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      3e673b23
    • Taehee Yoo's avatar
      netfilter: nft_set: fix allocation size overflow in privsize callback. · 4ef360dd
      Taehee Yoo authored
      In order to determine allocation size of set, ->privsize is invoked.
      At this point, both desc->size and size of each data structure of set
      are used. desc->size means number of element that is given by user.
      desc->size is u32 type. so that upperlimit of set element is 4294967295.
      but return type of ->privsize is also u32. hence overflow can occurred.
      
      test commands:
         %nft add table ip filter
         %nft add set ip filter hash1 { type ipv4_addr \; size 4294967295 \; }
         %nft list ruleset
      
      splat looks like:
      [ 1239.202910] kasan: CONFIG_KASAN_INLINE enabled
      [ 1239.208788] kasan: GPF could be caused by NULL-ptr deref or user memory access
      [ 1239.217625] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
      [ 1239.219329] CPU: 0 PID: 1603 Comm: nft Not tainted 4.18.0-rc5+ #7
      [ 1239.229091] RIP: 0010:nft_hash_walk+0x1d2/0x310 [nf_tables_set]
      [ 1239.229091] Code: 84 d2 7f 10 4c 89 e7 89 44 24 38 e8 d8 5a 17 e0 8b 44 24 38 48 8d 7b 10 41 0f b6 0c 24 48 89 fa 48 89 fe 48 c1 ea 03 83 e6 07 <42> 0f b6 14 3a 40 38 f2 7f 1a 84 d2 74 16
      [ 1239.229091] RSP: 0018:ffff8801118cf358 EFLAGS: 00010246
      [ 1239.229091] RAX: 0000000000000000 RBX: 0000000000020400 RCX: 0000000000000001
      [ 1239.229091] RDX: 0000000000004082 RSI: 0000000000000000 RDI: 0000000000020410
      [ 1239.229091] RBP: ffff880114d5a988 R08: 0000000000007e94 R09: ffff880114dd8030
      [ 1239.229091] R10: ffff880114d5a988 R11: ffffed00229bb006 R12: ffff8801118cf4d0
      [ 1239.229091] R13: ffff8801118cf4d8 R14: 0000000000000000 R15: dffffc0000000000
      [ 1239.229091] FS:  00007f5a8fe0b700(0000) GS:ffff88011b600000(0000) knlGS:0000000000000000
      [ 1239.229091] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1239.229091] CR2: 00007f5a8ecc27b0 CR3: 000000010608e000 CR4: 00000000001006f0
      [ 1239.229091] Call Trace:
      [ 1239.229091]  ? nft_hash_remove+0xf0/0xf0 [nf_tables_set]
      [ 1239.229091]  ? memset+0x1f/0x40
      [ 1239.229091]  ? __nla_reserve+0x9f/0xb0
      [ 1239.229091]  ? memcpy+0x34/0x50
      [ 1239.229091]  nf_tables_dump_set+0x9a1/0xda0 [nf_tables]
      [ 1239.229091]  ? __kmalloc_reserve.isra.29+0x2e/0xa0
      [ 1239.229091]  ? nft_chain_hash_obj+0x630/0x630 [nf_tables]
      [ 1239.229091]  ? nf_tables_commit+0x2c60/0x2c60 [nf_tables]
      [ 1239.229091]  netlink_dump+0x470/0xa20
      [ 1239.229091]  __netlink_dump_start+0x5ae/0x690
      [ 1239.229091]  nft_netlink_dump_start_rcu+0xd1/0x160 [nf_tables]
      [ 1239.229091]  nf_tables_getsetelem+0x2e5/0x4b0 [nf_tables]
      [ 1239.229091]  ? nft_get_set_elem+0x440/0x440 [nf_tables]
      [ 1239.229091]  ? nft_chain_hash_obj+0x630/0x630 [nf_tables]
      [ 1239.229091]  ? nf_tables_dump_obj_done+0x70/0x70 [nf_tables]
      [ 1239.229091]  ? nla_parse+0xab/0x230
      [ 1239.229091]  ? nft_get_set_elem+0x440/0x440 [nf_tables]
      [ 1239.229091]  nfnetlink_rcv_msg+0x7f0/0xab0 [nfnetlink]
      [ 1239.229091]  ? nfnetlink_bind+0x1d0/0x1d0 [nfnetlink]
      [ 1239.229091]  ? debug_show_all_locks+0x290/0x290
      [ 1239.229091]  ? sched_clock_cpu+0x132/0x170
      [ 1239.229091]  ? find_held_lock+0x39/0x1b0
      [ 1239.229091]  ? sched_clock_local+0x10d/0x130
      [ 1239.229091]  netlink_rcv_skb+0x211/0x320
      [ 1239.229091]  ? nfnetlink_bind+0x1d0/0x1d0 [nfnetlink]
      [ 1239.229091]  ? netlink_ack+0x7b0/0x7b0
      [ 1239.229091]  ? ns_capable_common+0x6e/0x110
      [ 1239.229091]  nfnetlink_rcv+0x2d1/0x310 [nfnetlink]
      [ 1239.229091]  ? nfnetlink_rcv_batch+0x10f0/0x10f0 [nfnetlink]
      [ 1239.229091]  ? netlink_deliver_tap+0x829/0x930
      [ 1239.229091]  ? lock_acquire+0x265/0x2e0
      [ 1239.229091]  netlink_unicast+0x406/0x520
      [ 1239.509725]  ? netlink_attachskb+0x5b0/0x5b0
      [ 1239.509725]  ? find_held_lock+0x39/0x1b0
      [ 1239.509725]  netlink_sendmsg+0x987/0xa20
      [ 1239.509725]  ? netlink_unicast+0x520/0x520
      [ 1239.509725]  ? _copy_from_user+0xa9/0xc0
      [ 1239.509725]  __sys_sendto+0x21a/0x2c0
      [ 1239.509725]  ? __ia32_sys_getpeername+0xa0/0xa0
      [ 1239.509725]  ? retint_kernel+0x10/0x10
      [ 1239.509725]  ? sched_clock_cpu+0x132/0x170
      [ 1239.509725]  ? find_held_lock+0x39/0x1b0
      [ 1239.509725]  ? lock_downgrade+0x540/0x540
      [ 1239.509725]  ? up_read+0x1c/0x100
      [ 1239.509725]  ? __do_page_fault+0x763/0x970
      [ 1239.509725]  ? retint_user+0x18/0x18
      [ 1239.509725]  __x64_sys_sendto+0x177/0x180
      [ 1239.509725]  do_syscall_64+0xaa/0x360
      [ 1239.509725]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [ 1239.509725] RIP: 0033:0x7f5a8f468e03
      [ 1239.509725] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb d0 0f 1f 84 00 00 00 00 00 83 3d 49 c9 2b 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8
      [ 1239.509725] RSP: 002b:00007ffd78d0b778 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      [ 1239.509725] RAX: ffffffffffffffda RBX: 00007ffd78d0c890 RCX: 00007f5a8f468e03
      [ 1239.509725] RDX: 0000000000000034 RSI: 00007ffd78d0b7e0 RDI: 0000000000000003
      [ 1239.509725] RBP: 00007ffd78d0b7d0 R08: 00007f5a8f15c160 R09: 000000000000000c
      [ 1239.509725] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd78d0b7e0
      [ 1239.509725] R13: 0000000000000034 R14: 00007f5a8f9aff60 R15: 00005648040094b0
      [ 1239.509725] Modules linked in: nf_tables_set nf_tables nfnetlink ip_tables x_tables
      [ 1239.670713] ---[ end trace 39375adcda140f11 ]---
      [ 1239.676016] RIP: 0010:nft_hash_walk+0x1d2/0x310 [nf_tables_set]
      [ 1239.682834] Code: 84 d2 7f 10 4c 89 e7 89 44 24 38 e8 d8 5a 17 e0 8b 44 24 38 48 8d 7b 10 41 0f b6 0c 24 48 89 fa 48 89 fe 48 c1 ea 03 83 e6 07 <42> 0f b6 14 3a 40 38 f2 7f 1a 84 d2 74 16
      [ 1239.705108] RSP: 0018:ffff8801118cf358 EFLAGS: 00010246
      [ 1239.711115] RAX: 0000000000000000 RBX: 0000000000020400 RCX: 0000000000000001
      [ 1239.719269] RDX: 0000000000004082 RSI: 0000000000000000 RDI: 0000000000020410
      [ 1239.727401] RBP: ffff880114d5a988 R08: 0000000000007e94 R09: ffff880114dd8030
      [ 1239.735530] R10: ffff880114d5a988 R11: ffffed00229bb006 R12: ffff8801118cf4d0
      [ 1239.743658] R13: ffff8801118cf4d8 R14: 0000000000000000 R15: dffffc0000000000
      [ 1239.751785] FS:  00007f5a8fe0b700(0000) GS:ffff88011b600000(0000) knlGS:0000000000000000
      [ 1239.760993] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1239.767560] CR2: 00007f5a8ecc27b0 CR3: 000000010608e000 CR4: 00000000001006f0
      [ 1239.775679] Kernel panic - not syncing: Fatal exception
      [ 1239.776630] Kernel Offset: 0x1f000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      [ 1239.776630] Rebooting in 5 seconds..
      
      Fixes: 20a69341 ("netfilter: nf_tables: add netlink set API")
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      4ef360dd
    • Florian Westphal's avatar
      netfilter: ip6t_rpfilter: set F_IFACE for linklocal addresses · da786717
      Florian Westphal authored
      Roman reports that DHCPv6 client no longer sees replies from server
      due to
      
      ip6tables -t raw -A PREROUTING -m rpfilter --invert -j DROP
      
      rule.  We need to set the F_IFACE flag for linklocal addresses, they
      are scoped per-device.
      
      Fixes: 47b7e7f8 ("netfilter: don't set F_IFACE on ipv6 fib lookups")
      Reported-by: default avatarRoman Mamedov <rm@romanrm.net>
      Tested-by: default avatarRoman Mamedov <rm@romanrm.net>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      da786717
    • Matteo Croce's avatar
      ipvs: don't show negative times in ip_vs_conn · b71ed54d
      Matteo Croce authored
      Since commit 500462a9 ("timers: Switch to a non-cascading wheel"),
      timers duration can last even 12.5% more than the scheduled interval.
      
      IPVS has two handlers, /proc/net/ip_vs_conn and /proc/net/ip_vs_conn_sync,
      which shows the remaining time before that a connection expires.
      The default expire time for a connection is 60 seconds, and the
      expiration timer can fire even 4 seconds later than the scheduled time.
      The expiration time is calculated subtracting jiffies to the scheduled
      expiration time, and it's shown as a huge number when the timer fires late,
      since both values are unsigned.
      
      This can confuse script and tools which relies on it, like ipvsadm:
      
          root@mcroce-redhat:~# while ipvsadm -lc |grep SYN_RECV; do sleep 1 ; done
          TCP 00:05  SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
          TCP 00:04  SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
          TCP 00:03  SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
          TCP 00:02  SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
          TCP 00:01  SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
          TCP 00:00  SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
          TCP 68719476:44 SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
          TCP 68719476:43 SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
          TCP 68719476:42 SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
          TCP 68719476:41 SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
          TCP 68719476:40 SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
          TCP 68719476:39 SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
      Signed-off-by: default avatarMatteo Croce <mcroce@redhat.com>
      Acked-by: default avatarSimon Horman <horms@verge.net.au>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      b71ed54d