1. 30 Jun, 2020 18 commits
  2. 28 Jun, 2020 4 commits
    • Alexei Starovoitov's avatar
      Merge branch 'fix-sockmap' · 2bdeb3ed
      Alexei Starovoitov authored
      John Fastabend says:
      
      ====================
      Fix a splat introduced by recent changes to avoid skipping ingress policy
      when kTLS is enabled. The RCU splat was introduced because in the non-TLS
      case the caller is wrapped in an rcu_read_lock/unlock. But, in the TLS
      case we have a reference to the psock and the caller did not wrap its
      call in rcu_read_lock/unlock.
      
      To fix extend the RCU section to include the redirect case which was
      missed. From v1->v2 I changed the location a bit to simplify the code
      some. See patch 1.
      
      But, then Martin asked why it was not needed in the non-TLS case. The
      answer for patch 1 was, as stated above, because the caller has the
      rcu read lock. However, there was still a missing case where a BPF
      user could in-theory line up a set of parameters to hit a case
      where the code was entered from strparser side from a different context
      then the initial caller. To hit this user would need a parser program
      to return value greater than skb->len then an ENOMEM error could happen
      in the strparser codepath triggering strparser to retry from a workqueue
      and without rcu_read_lock original caller used. See patch 2 for details.
      
      Finally, we don't actually have any selftests for parser returning a
      value geater than skb->len so add one in patch 3. This is especially
      needed because at least I don't have any code that uses the parser
      to return value greater than skb->len. So I wouldn't have caught any
      errors here in my own testing.
      
      Thanks, John
      
      v1->v2: simplify code in patch 1 some and add patches 2 and 3.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2bdeb3ed
    • John Fastabend's avatar
    • John Fastabend's avatar
      bpf, sockmap: RCU dereferenced psock may be used outside RCU block · 8025751d
      John Fastabend authored
      If an ingress verdict program specifies message sizes greater than
      skb->len and there is an ENOMEM error due to memory pressure we
      may call the rcv_msg handler outside the strp_data_ready() caller
      context. This is because on an ENOMEM error the strparser will
      retry from a workqueue. The caller currently protects the use of
      psock by calling the strp_data_ready() inside a rcu_read_lock/unlock
      block.
      
      But, in above workqueue error case the psock is accessed outside
      the read_lock/unlock block of the caller. So instead of using
      psock directly we must do a look up against the sk again to
      ensure the psock is available.
      
      There is an an ugly piece here where we must handle
      the case where we paused the strp and removed the psock. On
      psock removal we first pause the strparser and then remove
      the psock. If the strparser is paused while an skb is
      scheduled on the workqueue the skb will be dropped on the
      flow and kfree_skb() is called. If the workqueue manages
      to get called before we pause the strparser but runs the rcvmsg
      callback after the psock is removed we will hit the unlikely
      case where we run the sockmap rcvmsg handler but do not have
      a psock. For now we will follow strparser logic and drop the
      skb on the floor with skb_kfree(). This is ugly because the
      data is dropped. To date this has not caused problems in practice
      because either the application controlling the sockmap is
      coordinating with the datapath so that skbs are "flushed"
      before removal or we simply wait for the sock to be closed before
      removing it.
      
      This patch fixes the describe RCU bug and dropping the skb doesn't
      make things worse. Future patches will improve this by allowing
      the normal case where skbs are not merged to skip the strparser
      altogether. In practice many (most?) use cases have no need to
      merge skbs so its both a code complexity hit as seen above and
      a performance issue. For example, in the Cilium case we always
      set the strparser up to return sbks 1:1 without any merging and
      have avoided above issues.
      
      Fixes: e91de6af ("bpf: Fix running sk_skb program types with ktls")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/159312679888.18340.15248924071966273998.stgit@john-XPS-13-9370
      8025751d
    • John Fastabend's avatar
      bpf, sockmap: RCU splat with redirect and strparser error or TLS · 93dd5f18
      John Fastabend authored
      There are two paths to generate the below RCU splat the first and
      most obvious is the result of the BPF verdict program issuing a
      redirect on a TLS socket (This is the splat shown below). Unlike
      the non-TLS case the caller of the *strp_read() hooks does not
      wrap the call in a rcu_read_lock/unlock. Then if the BPF program
      issues a redirect action we hit the RCU splat.
      
      However, in the non-TLS socket case the splat appears to be
      relatively rare, because the skmsg caller into the strp_data_ready()
      is wrapped in a rcu_read_lock/unlock. Shown here,
      
       static void sk_psock_strp_data_ready(struct sock *sk)
       {
      	struct sk_psock *psock;
      
      	rcu_read_lock();
      	psock = sk_psock(sk);
      	if (likely(psock)) {
      		if (tls_sw_has_ctx_rx(sk)) {
      			psock->parser.saved_data_ready(sk);
      		} else {
      			write_lock_bh(&sk->sk_callback_lock);
      			strp_data_ready(&psock->parser.strp);
      			write_unlock_bh(&sk->sk_callback_lock);
      		}
      	}
      	rcu_read_unlock();
       }
      
      If the above was the only way to run the verdict program we
      would be safe. But, there is a case where the strparser may throw an
      ENOMEM error while parsing the skb. This is a result of a failed
      skb_clone, or alloc_skb_for_msg while building a new merged skb when
      the msg length needed spans multiple skbs. This will in turn put the
      skb on the strp_wrk workqueue in the strparser code. The skb will
      later be dequeued and verdict programs run, but now from a
      different context without the rcu_read_lock()/unlock() critical
      section in sk_psock_strp_data_ready() shown above. In practice
      I have not seen this yet, because as far as I know most users of the
      verdict programs are also only working on single skbs. In this case no
      merge happens which could trigger the above ENOMEM errors. In addition
      the system would need to be under memory pressure. For example, we
      can't hit the above case in selftests because we missed having tests
      to merge skbs. (Added in later patch)
      
      To fix the below splat extend the rcu_read_lock/unnlock block to
      include the call to sk_psock_tls_verdict_apply(). This will fix both
      TLS redirect case and non-TLS redirect+error case. Also remove
      psock from the sk_psock_tls_verdict_apply() function signature its
      not used there.
      
      [ 1095.937597] WARNING: suspicious RCU usage
      [ 1095.940964] 5.7.0-rc7-02911-g463bac5f #1 Tainted: G        W
      [ 1095.944363] -----------------------------
      [ 1095.947384] include/linux/skmsg.h:284 suspicious rcu_dereference_check() usage!
      [ 1095.950866]
      [ 1095.950866] other info that might help us debug this:
      [ 1095.950866]
      [ 1095.957146]
      [ 1095.957146] rcu_scheduler_active = 2, debug_locks = 1
      [ 1095.961482] 1 lock held by test_sockmap/15970:
      [ 1095.964501]  #0: ffff9ea6b25de660 (sk_lock-AF_INET){+.+.}-{0:0}, at: tls_sw_recvmsg+0x13a/0x840 [tls]
      [ 1095.968568]
      [ 1095.968568] stack backtrace:
      [ 1095.975001] CPU: 1 PID: 15970 Comm: test_sockmap Tainted: G        W         5.7.0-rc7-02911-g463bac5f #1
      [ 1095.977883] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
      [ 1095.980519] Call Trace:
      [ 1095.982191]  dump_stack+0x8f/0xd0
      [ 1095.984040]  sk_psock_skb_redirect+0xa6/0xf0
      [ 1095.986073]  sk_psock_tls_strp_read+0x1d8/0x250
      [ 1095.988095]  tls_sw_recvmsg+0x714/0x840 [tls]
      
      v2: Improve commit message to identify non-TLS redirect plus error case
          condition as well as more common TLS case. In the process I decided
          doing the rcu_read_unlock followed by the lock/unlock inside branches
          was unnecessarily complex. We can just extend the current rcu block
          and get the same effeective without the shuffling and branching.
          Thanks Martin!
      
      Fixes: e91de6af ("bpf: Fix running sk_skb program types with ktls")
      Reported-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Reported-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/159312677907.18340.11064813152758406626.stgit@john-XPS-13-9370
      93dd5f18
  3. 25 Jun, 2020 2 commits
  4. 24 Jun, 2020 4 commits
  5. 22 Jun, 2020 2 commits
  6. 21 Jun, 2020 7 commits
    • Rob Gill's avatar
      net: Add MODULE_DESCRIPTION entries to network modules · 67c20de3
      Rob Gill authored
      The user tool modinfo is used to get information on kernel modules, including a
      description where it is available.
      
      This patch adds a brief MODULE_DESCRIPTION to the following modules:
      
      9p
      drop_monitor
      esp4_offload
      esp6_offload
      fou
      fou6
      ila
      sch_fq
      sch_fq_codel
      sch_hhf
      Signed-off-by: default avatarRob Gill <rrobgill@protonmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      67c20de3
    • David Howells's avatar
      rxrpc: Fix notification call on completion of discarded calls · 0041cd5a
      David Howells authored
      When preallocated service calls are being discarded, they're passed to
      ->discard_new_call() to have the caller clean up any attached higher-layer
      preallocated pieces before being marked completed.  However, the act of
      marking them completed now invokes the call's notification function - which
      causes a problem because that function might assume that the previously
      freed pieces of memory are still there.
      
      Fix this by setting a dummy notification function on the socket after
      calling ->discard_new_call().
      
      This results in the following kasan message when the kafs module is
      removed.
      
      ==================================================================
      BUG: KASAN: use-after-free in afs_wake_up_async_call+0x6aa/0x770 fs/afs/rxrpc.c:707
      Write of size 1 at addr ffff8880946c39e4 by task kworker/u4:1/21
      
      CPU: 0 PID: 21 Comm: kworker/u4:1 Not tainted 5.8.0-rc1-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: netns cleanup_net
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x18f/0x20d lib/dump_stack.c:118
       print_address_description.constprop.0.cold+0xd3/0x413 mm/kasan/report.c:383
       __kasan_report mm/kasan/report.c:513 [inline]
       kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
       afs_wake_up_async_call+0x6aa/0x770 fs/afs/rxrpc.c:707
       rxrpc_notify_socket+0x1db/0x5d0 net/rxrpc/recvmsg.c:40
       __rxrpc_set_call_completion.part.0+0x172/0x410 net/rxrpc/recvmsg.c:76
       __rxrpc_call_completed net/rxrpc/recvmsg.c:112 [inline]
       rxrpc_call_completed+0xca/0xf0 net/rxrpc/recvmsg.c:111
       rxrpc_discard_prealloc+0x781/0xab0 net/rxrpc/call_accept.c:233
       rxrpc_listen+0x147/0x360 net/rxrpc/af_rxrpc.c:245
       afs_close_socket+0x95/0x320 fs/afs/rxrpc.c:110
       afs_net_exit+0x1bc/0x310 fs/afs/main.c:155
       ops_exit_list.isra.0+0xa8/0x150 net/core/net_namespace.c:186
       cleanup_net+0x511/0xa50 net/core/net_namespace.c:603
       process_one_work+0x965/0x1690 kernel/workqueue.c:2269
       worker_thread+0x96/0xe10 kernel/workqueue.c:2415
       kthread+0x3b5/0x4a0 kernel/kthread.c:291
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:293
      
      Allocated by task 6820:
       save_stack+0x1b/0x40 mm/kasan/common.c:48
       set_track mm/kasan/common.c:56 [inline]
       __kasan_kmalloc mm/kasan/common.c:494 [inline]
       __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:467
       kmem_cache_alloc_trace+0x153/0x7d0 mm/slab.c:3551
       kmalloc include/linux/slab.h:555 [inline]
       kzalloc include/linux/slab.h:669 [inline]
       afs_alloc_call+0x55/0x630 fs/afs/rxrpc.c:141
       afs_charge_preallocation+0xe9/0x2d0 fs/afs/rxrpc.c:757
       afs_open_socket+0x292/0x360 fs/afs/rxrpc.c:92
       afs_net_init+0xa6c/0xe30 fs/afs/main.c:125
       ops_init+0xaf/0x420 net/core/net_namespace.c:151
       setup_net+0x2de/0x860 net/core/net_namespace.c:341
       copy_net_ns+0x293/0x590 net/core/net_namespace.c:482
       create_new_namespaces+0x3fb/0xb30 kernel/nsproxy.c:110
       unshare_nsproxy_namespaces+0xbd/0x1f0 kernel/nsproxy.c:231
       ksys_unshare+0x43d/0x8e0 kernel/fork.c:2983
       __do_sys_unshare kernel/fork.c:3051 [inline]
       __se_sys_unshare kernel/fork.c:3049 [inline]
       __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3049
       do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:359
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Freed by task 21:
       save_stack+0x1b/0x40 mm/kasan/common.c:48
       set_track mm/kasan/common.c:56 [inline]
       kasan_set_free_info mm/kasan/common.c:316 [inline]
       __kasan_slab_free+0xf7/0x140 mm/kasan/common.c:455
       __cache_free mm/slab.c:3426 [inline]
       kfree+0x109/0x2b0 mm/slab.c:3757
       afs_put_call+0x585/0xa40 fs/afs/rxrpc.c:190
       rxrpc_discard_prealloc+0x764/0xab0 net/rxrpc/call_accept.c:230
       rxrpc_listen+0x147/0x360 net/rxrpc/af_rxrpc.c:245
       afs_close_socket+0x95/0x320 fs/afs/rxrpc.c:110
       afs_net_exit+0x1bc/0x310 fs/afs/main.c:155
       ops_exit_list.isra.0+0xa8/0x150 net/core/net_namespace.c:186
       cleanup_net+0x511/0xa50 net/core/net_namespace.c:603
       process_one_work+0x965/0x1690 kernel/workqueue.c:2269
       worker_thread+0x96/0xe10 kernel/workqueue.c:2415
       kthread+0x3b5/0x4a0 kernel/kthread.c:291
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:293
      
      The buggy address belongs to the object at ffff8880946c3800
       which belongs to the cache kmalloc-1k of size 1024
      The buggy address is located 484 bytes inside of
       1024-byte region [ffff8880946c3800, ffff8880946c3c00)
      The buggy address belongs to the page:
      page:ffffea000251b0c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0
      flags: 0xfffe0000000200(slab)
      raw: 00fffe0000000200 ffffea0002546508 ffffea00024fa248 ffff8880aa000c40
      raw: 0000000000000000 ffff8880946c3000 0000000100000002 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff8880946c3880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff8880946c3900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff8880946c3980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                             ^
       ffff8880946c3a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff8880946c3a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      ==================================================================
      
      Reported-by: syzbot+d3eccef36ddbd02713e9@syzkaller.appspotmail.com
      Fixes: 5ac0d622 ("rxrpc: Fix missing notification")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0041cd5a
    • David S. Miller's avatar
      Merge tag 'ieee802154-for-davem-2020-06-19' of... · 7fcaf731
      David S. Miller authored
      Merge tag 'ieee802154-for-davem-2020-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan
      
      Stefan Schmidt says:
      
      ====================
      pull-request: ieee802154 for net 2020-06-19
      
      An update from ieee802154 for your *net* tree.
      
      Just two small maintenance fixes to update references to the new project
      homepage.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7fcaf731
    • Hangbin Liu's avatar
      tc-testing: update geneve options match in tunnel_key unit tests · 54eeea0d
      Hangbin Liu authored
      Since iproute2 commit f72c3ad00f3b ("tc: m_tunnel_key: add options
      support for vxlan"), the geneve opt output use key word "geneve_opts"
      instead of "geneve_opt". To make compatibility for both old and new
      iproute2, let's accept both "geneve_opt" and "geneve_opts".
      Suggested-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@netronome.com>
      Tested-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      54eeea0d
    • Heiner Kallweit's avatar
      r8169: fix firmware not resetting tp->ocp_base · 89fbd26c
      Heiner Kallweit authored
      Typically the firmware takes care that tp->ocp_base is reset to its
      default value. That's not the case (at least) for RTL8117.
      As a result subsequent PHY access reads/writes the wrong page and
      the link is broken. Fix this be resetting tp->ocp_base explicitly.
      
      Fixes: 229c1e0d ("r8169: load firmware for RTL8168fp/RTL8117")
      Reported-by: default avatarAaron Ma <mapengyu@gmail.com>
      Tested-by: default avatarAaron Ma <mapengyu@gmail.com>
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      89fbd26c
    • Dany Madden's avatar
      ibmvnic: continue to init in CRQ reset returns H_CLOSED · 8b40eb73
      Dany Madden authored
      Continue the reset path when partner adapter is not ready or H_CLOSED is
      returned from reset crq. This patch allows the CRQ init to proceed to
      establish a valid CRQ for traffic to flow after reset.
      Signed-off-by: default avatarDany Madden <drt@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b40eb73
    • Shannon Nelson's avatar
      ionic: tame the watchdog timer on reconfig · b59eabd2
      Shannon Nelson authored
      Even with moving netif_tx_disable() to an earlier point when
      taking down the queues for a reconfiguration, we still end
      up with the occasional netdev watchdog Tx Timeout complaint.
      The old method of using netif_trans_update() works fine for
      queue 0, but has no effect on the remaining queues.  Using
      netif_device_detach() allows us to signal to the watchdog to
      ignore us for the moment.
      
      Fixes: beead698 ("ionic: Add the basic NDO callbacks for netdev support")
      Signed-off-by: default avatarShannon Nelson <snelson@pensando.io>
      Acked-by: default avatarJonathan Toppins <jtoppins@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b59eabd2
  7. 20 Jun, 2020 3 commits
    • Willem de Bruijn's avatar
      selftests/net: report etf errors correctly · ca882609
      Willem de Bruijn authored
      The ETF qdisc can queue skbs that it could not pace on the errqueue.
      
      Address a few issues in the selftest
      
      - recv buffer size was too small, and incorrectly calculated
      - compared errno to ee_code instead of ee_errno
      - missed invalid request error type
      
      v2:
        - fix a few checkpatch --strict indentation warnings
      
      Fixes: ea6a5476 ("selftests/net: make so_txtime more robust to timer variance")
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca882609
    • Thomas Falcon's avatar
      ibmveth: Fix max MTU limit · 5948378b
      Thomas Falcon authored
      The max MTU limit defined for ibmveth is not accounting for
      virtual ethernet buffer overhead, which is twenty-two additional
      bytes set aside for the ethernet header and eight additional bytes
      of an opaque handle reserved for use by the hypervisor. Update the
      max MTU to reflect this overhead.
      
      Fixes: d894be57 ("ethernet: use net core MTU range checking in more drivers")
      Fixes: 110447f8 ("ethernet: fix min/max MTU typos")
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5948378b
    • David S. Miller's avatar
      Merge branch 'several-fixes-for-indirect-flow_blocks-offload' · 95dcd892
      David S. Miller authored
      wenxu says:
      
      ====================
      several fixes for indirect flow_blocks offload
      
      v2:
      patch2: store the cb_priv of representor to the flow_block_cb->indr.cb_priv
      in the driver. And make the correct check with the statments
      this->indr.cb_priv == cb_priv
      
      patch4: del the driver list only in the indriect cleanup callbacks
      
      v3:
      add the cover letter and changlogs.
      
      v4:
      collapsed 1/4, 2/4, 4/4 in v3 to one fix
      Add the prepare patch 1 and 2
      
      v5:
      patch1: place flow_indr_block_cb_alloc() right before
      flow_indr_dev_setup_offload() to avoid moving flow_block_indr_init()
      
      This series fixes commit 1fac52da ("net: flow_offload: consolidate
      indirect flow_block infrastructure") that revists the flow_block
      infrastructure.
      
      patch #1 #2: prepare for fix patch #3
      add and use flow_indr_block_cb_alloc/remove function
      
      patch #3: fix flow_indr_dev_unregister path
      If the representor is removed, then identify the indirect flow_blocks
      that need to be removed by the release callback and the port representor
      structure. To identify the port representor structure, a new
      indr.cb_priv field needs to be introduced. The flow_block also needs to
      be removed from the driver list from the cleanup path
      
      patch#4 fix block->nooffloaddevcnt warning dmesg log.
      When a indr device add in offload success. The block->nooffloaddevcnt
      should be 0. After the representor go away. When the dir device go away
      the flow_block UNBIND operation with -EOPNOTSUPP which lead the warning
      demesg log.
      The block->nooffloaddevcnt should always count for indr block.
      even the indr block offload successful. The representor maybe
      gone away and the ingress qdisc can work in software mode.
      ====================
      Reviewed-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      95dcd892