1. 04 Sep, 2020 3 commits
    • Alexei Starovoitov's avatar
      Merge branch 'hashmap_iter_bucket_lock_fix' · e6135df4
      Alexei Starovoitov authored
      Yonghong Song says:
      
      ====================
      Currently, the bpf hashmap iterator takes a bucket_lock, a spin_lock,
      before visiting each element in the bucket. This will cause a deadlock
      if a map update/delete operates on an element with the same
      bucket id of the visited map.
      
      To avoid the deadlock, let us just use rcu_read_lock instead of
      bucket_lock. This may result in visiting stale elements, missing some elements,
      or repeating some elements, if concurrent map delete/update happens for the
      same map. I think using rcu_read_lock is a reasonable compromise.
      For users caring stale/missing/repeating element issues, bpf map batch
      access syscall interface can be used.
      
      Note that another approach is during bpf_iter link stage, we check
      whether the iter program might be able to do update/delete to the visited
      map. If it is, reject the link_create. Verifier needs to record whether
      an update/delete operation happens for each map for this approach.
      I just feel this checking is too specialized, hence still prefer
      rcu_read_lock approach.
      
      Patch #1 has the kernel implementation and Patch #2 added a selftest
      which can trigger deadlock without Patch #1.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e6135df4
    • Yonghong Song's avatar
      selftests/bpf: Add bpf_{update, delete}_map_elem in hashmap iter program · 4daab713
      Yonghong Song authored
      Added bpf_{updata,delete}_map_elem to the very map element the
      iter program is visiting. Due to rcu protection, the visited map
      elements, although stale, should still contain correct values.
        $ ./test_progs -n 4/18
        #4/18 bpf_hash_map:OK
        #4 bpf_iter:OK
        Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200902235341.2001534-1-yhs@fb.com
      4daab713
    • Yonghong Song's avatar
      bpf: Do not use bucket_lock for hashmap iterator · dc0988bb
      Yonghong Song authored
      Currently, for hashmap, the bpf iterator will grab a bucket lock, a
      spinlock, before traversing the elements in the bucket. This can ensure
      all bpf visted elements are valid. But this mechanism may cause
      deadlock if update/deletion happens to the same bucket of the
      visited map in the program. For example, if we added bpf_map_update_elem()
      call to the same visited element in selftests bpf_iter_bpf_hash_map.c,
      we will have the following deadlock:
      
        ============================================
        WARNING: possible recursive locking detected
        5.9.0-rc1+ #841 Not tainted
        --------------------------------------------
        test_progs/1750 is trying to acquire lock:
        ffff9a5bb73c5e70 (&htab->buckets[i].raw_lock){....}-{2:2}, at: htab_map_update_elem+0x1cf/0x410
      
        but task is already holding lock:
        ffff9a5bb73c5e20 (&htab->buckets[i].raw_lock){....}-{2:2}, at: bpf_hash_map_seq_find_next+0x94/0x120
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(&htab->buckets[i].raw_lock);
          lock(&htab->buckets[i].raw_lock);
      
         *** DEADLOCK ***
         ...
        Call Trace:
         dump_stack+0x78/0xa0
         __lock_acquire.cold.74+0x209/0x2e3
         lock_acquire+0xba/0x380
         ? htab_map_update_elem+0x1cf/0x410
         ? __lock_acquire+0x639/0x20c0
         _raw_spin_lock_irqsave+0x3b/0x80
         ? htab_map_update_elem+0x1cf/0x410
         htab_map_update_elem+0x1cf/0x410
         ? lock_acquire+0xba/0x380
         bpf_prog_ad6dab10433b135d_dump_bpf_hash_map+0x88/0xa9c
         ? find_held_lock+0x34/0xa0
         bpf_iter_run_prog+0x81/0x16e
         __bpf_hash_map_seq_show+0x145/0x180
         bpf_seq_read+0xff/0x3d0
         vfs_read+0xad/0x1c0
         ksys_read+0x5f/0xe0
         do_syscall_64+0x33/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ...
      
      The bucket_lock first grabbed in seq_ops->next() called by bpf_seq_read(),
      and then grabbed again in htab_map_update_elem() in the bpf program, causing
      deadlocks.
      
      Actually, we do not need bucket_lock here, we can just use rcu_read_lock()
      similar to netlink iterator where the rcu_read_{lock,unlock} likes below:
       seq_ops->start():
           rcu_read_lock();
       seq_ops->next():
           rcu_read_unlock();
           /* next element */
           rcu_read_lock();
       seq_ops->stop();
           rcu_read_unlock();
      
      Compared to old bucket_lock mechanism, if concurrent updata/delete happens,
      we may visit stale elements, miss some elements, or repeat some elements.
      I think this is a reasonable compromise. For users wanting to avoid
      stale, missing/repeated accesses, bpf_map batch access syscall interface
      can be used.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200902235340.2001375-1-yhs@fb.com
      dc0988bb
  2. 03 Sep, 2020 1 commit
  3. 01 Sep, 2020 1 commit
  4. 31 Aug, 2020 1 commit
  5. 28 Aug, 2020 6 commits
  6. 27 Aug, 2020 12 commits
  7. 26 Aug, 2020 16 commits
    • David S. Miller's avatar
      Merge branch 'net-fix-netpoll-crash-with-bnxt' · 5875568a
      David S. Miller authored
      Jakub Kicinski says:
      
      ====================
      net: fix netpoll crash with bnxt
      
      Rob run into crashes when using XDP on bnxt. Upon investigation
      it turns out that during driver reconfig irq core produces
      a warning message when IRQs are requested. This triggers netpoll,
      which in turn accesses uninitialized driver state. Same crash can
      also be triggered on this platform by changing the number of rings.
      
      Looks like we have two missing pieces here, netif_napi_add() has
      to make sure we start out with netpoll blocked. The driver also
      has to be more careful about when napi gets enabled.
      
      Tested XDP and channel count changes, the warning message no longer
      causes a crash. Not sure if the memory barriers added in patch 1
      are necessary, but it seems we should have them.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5875568a
    • Jakub Kicinski's avatar
      bnxt: don't enable NAPI until rings are ready · 96ecdcc9
      Jakub Kicinski authored
      Netpoll can try to poll napi as soon as napi_enable() is called.
      It crashes trying to access a doorbell which is still NULL:
      
       BUG: kernel NULL pointer dereference, address: 0000000000000000
       CPU: 59 PID: 6039 Comm: ethtool Kdump: loaded Tainted: G S                5.9.0-rc1-00469-g5fd99b5d-dirty #26
       RIP: 0010:bnxt_poll+0x121/0x1c0
       Code: c4 20 44 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 8b 86 a0 01 00 00 41 23 85 18 01 00 00 49 8b 96 a8 01 00 00 0d 00 00 00 24 <89> 02
      41 f6 45 77 02 74 cb 49 8b ae d8 01 00 00 31 c0 c7 44 24 1a
        netpoll_poll_dev+0xbd/0x1a0
        __netpoll_send_skb+0x1b2/0x210
        netpoll_send_udp+0x2c9/0x406
        write_ext_msg+0x1d7/0x1f0
        console_unlock+0x23c/0x520
        vprintk_emit+0xe0/0x1d0
        printk+0x58/0x6f
        x86_vector_activate.cold+0xf/0x46
        __irq_domain_activate_irq+0x50/0x80
        __irq_domain_activate_irq+0x32/0x80
        __irq_domain_activate_irq+0x32/0x80
        irq_domain_activate_irq+0x25/0x40
        __setup_irq+0x2d2/0x700
        request_threaded_irq+0xfb/0x160
        __bnxt_open_nic+0x3b1/0x750
        bnxt_open_nic+0x19/0x30
        ethtool_set_channels+0x1ac/0x220
        dev_ethtool+0x11ba/0x2240
        dev_ioctl+0x1cf/0x390
        sock_do_ioctl+0x95/0x130
      Reported-by: default avatarRob Sherwood <rsher@fb.com>
      Fixes: c0c050c5 ("bnxt_en: New Broadcom ethernet driver.")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      96ecdcc9
    • Jakub Kicinski's avatar
      net: disable netpoll on fresh napis · 96e97bc0
      Jakub Kicinski authored
      napi_disable() makes sure to set the NAPI_STATE_NPSVC bit to prevent
      netpoll from accessing rings before init is complete. However, the
      same is not done for fresh napi instances in netif_napi_add(),
      even though we expect NAPI instances to be added as disabled.
      
      This causes crashes during driver reconfiguration (enabling XDP,
      changing the channel count) - if there is any printk() after
      netif_napi_add() but before napi_enable().
      
      To ensure memory ordering is correct we need to use RCU accessors.
      Reported-by: default avatarRob Sherwood <rsher@fb.com>
      Fixes: 2d8bff12 ("netpoll: Close race condition between poll_one_napi and napi_disable")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      96e97bc0
    • Ido Schimmel's avatar
      ipv4: Silence suspicious RCU usage warning · 7f6f32bb
      Ido Schimmel authored
      fib_info_notify_update() is always called with RTNL held, but not from
      an RCU read-side critical section. This leads to the following warning
      [1] when the FIB table list is traversed with
      hlist_for_each_entry_rcu(), but without a proper lockdep expression.
      
      Since modification of the list is protected by RTNL, silence the warning
      by adding a lockdep expression which verifies RTNL is held.
      
      [1]
       =============================
       WARNING: suspicious RCU usage
       5.9.0-rc1-custom-14233-g2f26e122d62f #129 Not tainted
       -----------------------------
       net/ipv4/fib_trie.c:2124 RCU-list traversed in non-reader section!!
      
       other info that might help us debug this:
      
       rcu_scheduler_active = 2, debug_locks = 1
       1 lock held by ip/834:
        #0: ffffffff85a3b6b0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x49a/0xbd0
      
       stack backtrace:
       CPU: 0 PID: 834 Comm: ip Not tainted 5.9.0-rc1-custom-14233-g2f26e122d62f #129
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
       Call Trace:
        dump_stack+0x100/0x184
        lockdep_rcu_suspicious+0x143/0x14d
        fib_info_notify_update+0x8d1/0xa60
        __nexthop_replace_notify+0xd2/0x290
        rtm_new_nexthop+0x35e2/0x5946
        rtnetlink_rcv_msg+0x4f7/0xbd0
        netlink_rcv_skb+0x17a/0x480
        rtnetlink_rcv+0x22/0x30
        netlink_unicast+0x5ae/0x890
        netlink_sendmsg+0x98a/0xf40
        ____sys_sendmsg+0x879/0xa00
        ___sys_sendmsg+0x122/0x190
        __sys_sendmsg+0x103/0x1d0
        __x64_sys_sendmsg+0x7d/0xb0
        do_syscall_64+0x32/0x50
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7fde28c3be57
       Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51
      c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
      RSP: 002b:00007ffc09330028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fde28c3be57
      RDX: 0000000000000000 RSI: 00007ffc09330090 RDI: 0000000000000003
      RBP: 000000005f45f911 R08: 0000000000000001 R09: 00007ffc0933012c
      R10: 0000000000000076 R11: 0000000000000246 R12: 0000000000000001
      R13: 00007ffc09330290 R14: 00007ffc09330eee R15: 00005610e48ed020
      
      Fixes: 1bff1a0c ("ipv4: Add function to send route updates")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f6f32bb
    • Xie He's avatar
      drivers/net/wan/lapbether: Set network_header before transmitting · 91244d10
      Xie He authored
      Set the skb's network_header before it is passed to the underlying
      Ethernet device for transmission.
      
      This patch fixes the following issue:
      
      When we use this driver with AF_PACKET sockets, there would be error
      messages of:
         protocol 0805 is buggy, dev (Ethernet interface name)
      printed in the system "dmesg" log.
      
      This is because skbs passed down to the Ethernet device for transmission
      don't have their network_header properly set, and the dev_queue_xmit_nit
      function in net/core/dev.c complains about this.
      
      Reason of setting the network_header to this place (at the end of the
      Ethernet header, and at the beginning of the Ethernet payload):
      
      Because when this driver receives an skb from the Ethernet device, the
      network_header is also set at this place.
      
      Cc: Martin Schiller <ms@dev.tdt.de>
      Signed-off-by: default avatarXie He <xie.he.0141@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91244d10
    • Florian Westphal's avatar
      mptcp: free acked data before waiting for more memory · 1cec170d
      Florian Westphal authored
      After subflow lock is dropped, more wmem might have been made available.
      
      This fixes a deadlock in mptcp_connect.sh 'mmap' mode: wmem is exhausted.
      But as the mptcp socket holds on to already-acked data (for retransmit)
      no wakeup will occur.
      
      Using 'goto restart' calls mptcp_clean_una(sk) which will free pages
      that have been acked completely in the mean time.
      
      Fixes: fb529e62 ("mptcp: break and restart in case mptcp sndbuf is full")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1cec170d
    • Vinicius Costa Gomes's avatar
      taprio: Fix using wrong queues in gate mask · 09e31cf0
      Vinicius Costa Gomes authored
      Since commit 9c66d156 ("taprio: Add support for hardware
      offloading") there's a bit of inconsistency when offloading schedules
      to the hardware:
      
      In software mode, the gate masks are specified in terms of traffic
      classes, so if say "sched-entry S 03 20000", it means that the traffic
      classes 0 and 1 are open for 20us; when taprio is offloaded to
      hardware, the gate masks are specified in terms of hardware queues.
      
      The idea here is to fix hardware offloading, so schedules in hardware
      and software mode have the same behavior. What's needed to do is to
      map traffic classes to queues when applying the offload to the driver.
      
      Fixes: 9c66d156 ("taprio: Add support for hardware offloading")
      Signed-off-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      09e31cf0
    • YueHaibing's avatar
      net: cdc_ncm: Fix build error · 5fd99b5d
      YueHaibing authored
      If USB_NET_CDC_NCM is y and USB_NET_CDCETHER is m, build fails:
      
      drivers/net/usb/cdc_ncm.o:(.rodata+0x1d8): undefined reference to `usbnet_cdc_update_filter'
      
      Select USB_NET_CDCETHER for USB_NET_CDC_NCM to fix this.
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Fixes: e10dcb1b ("net: cdc_ncm: hook into set_rx_mode to admit multicast traffic")
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5fd99b5d
    • Yi Li's avatar
      net: hns3: Fix for geneve tx checksum bug · a156998f
      Yi Li authored
      when skb->encapsulation is 0, skb->ip_summed is CHECKSUM_PARTIAL
      and it is udp packet, which has a dest port as the IANA assigned.
      the hardware is expected to do the checksum offload, but the
      hardware will not do the checksum offload when udp dest port is
      6081.
      
      This patch fixes it by doing the checksum in software.
      Reported-by: default avatarLi Bing <libing@winhong.com>
      Signed-off-by: default avatarYi Li <yili@winhong.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a156998f
    • David S. Miller's avatar
      Merge branch 'bnxt_en-Bug-fixes' · 0a3445b8
      David S. Miller authored
      Michael Chan says:
      
      ====================
      bnxt_en: Bug fixes.
      
      This set of driver patches include bug fixes for ethtool get channels,
      ethtool statistics, ethtool NVRAM, AER recovery, a firmware reset issue
      that could potentially crash, hwmon temperature reporting issue on VF,
      and 2 fixes for regressions introduced by the recent user-defined RSS
      map feature.
      
      Please queue patches 1 to 6 for -stable.  Thanks.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a3445b8
    • Michael Chan's avatar
      bnxt_en: Setup default RSS map in all scenarios. · b43b9f53
      Michael Chan authored
      The recent changes to support user-defined RSS map assume that RX
      rings are always reserved and the default RSS map is set after the
      RX rings are successfully reserved.  If the firmware spec is older
      than 1.6.1, no ring reservations are required and the default RSS
      map is not setup at all.  In another scenario where the fw Resource
      Manager is older, RX rings are not reserved and we also end up with
      no valid RSS map.
      
      Fix both issues in bnxt_need_reserve_rings().  In both scenarios
      described above, we don't need to reserve RX rings so we need to
      call this new function bnxt_check_rss_map_no_rmgr() to setup the
      default RSS map when needed.
      
      Without valid RSS map, the NIC won't receive packets properly.
      
      Fixes: 1667cbf6 ("bnxt_en: Add logical RSS indirection table structure.")
      Reviewed-by: default avatarVasundhara Volam <vasundhara-v.volam@broadcom.com>
      Reviewed-by: default avatarEdwin Peer <edwin.peer@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b43b9f53
    • Edwin Peer's avatar
      bnxt_en: init RSS table for Minimal-Static VF reservation · 5fa65524
      Edwin Peer authored
      There are no VF rings available during probe when the device is configured
      using the Minimal-Static reservation strategy. In this case, the RSS
      indirection table can only be initialized later, during bnxt_open_nic().
      However, this was not happening because the rings will already have been
      reserved via bnxt_init_dflt_ring_mode(), causing bnxt_need_reserve_rings()
      to return false in bnxt_reserve_rings() and bypass the RSS table init.
      
      Solve this by pushing the call to bnxt_set_dflt_rss_indir_tbl() into
      __bnxt_reserve_rings(), which is common to both paths and is called
      whenever ring configuration is changed. After doing this, the RSS table
      init that must be called from bnxt_init_one() happens implicitly via
      bnxt_set_default_rings(), necessitating doing the allocation earlier in
      order to avoid a null pointer dereference.
      
      Fixes: bd3191b5 ("bnxt_en: Implement ethtool -X to set indirection table.")
      Signed-off-by: default avatarEdwin Peer <edwin.peer@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5fa65524
    • Edwin Peer's avatar
      bnxt_en: fix HWRM error when querying VF temperature · 12cce90b
      Edwin Peer authored
      Firmware returns RESOURCE_ACCESS_DENIED for HWRM_TEMP_MONITORY_QUERY for
      VFs. This produces unpleasing error messages in the log when temp1_input
      is queried via the hwmon sysfs interface from a VF.
      
      The error is harmless and expected, so silence it and return unknown as
      the value. Since the device temperature is not particularly sensitive
      information, provide flexibility to change this policy in future by
      silencing the error rather than avoiding the HWRM call entirely for VFs.
      
      Fixes: cde49a42 ("bnxt_en: Add hwmon sysfs support to read temperature")
      Cc: Marc Smith <msmith626@gmail.com>
      Reported-by: default avatarMarc Smith <msmith626@gmail.com>
      Signed-off-by: default avatarEdwin Peer <edwin.peer@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      12cce90b
    • Michael Chan's avatar
      bnxt_en: Fix possible crash in bnxt_fw_reset_task(). · b148bb23
      Michael Chan authored
      bnxt_fw_reset_task() is run from a delayed workqueue.  The current
      code is not cancelling the workqueue in the driver's .remove()
      method and it can potentially crash if the device is removed with
      the workqueue still pending.
      
      The fix is to clear the BNXT_STATE_IN_FW_RESET flag and then cancel
      the delayed workqueue in bnxt_remove_one().  bnxt_queue_fw_reset_work()
      also needs to check that this flag is set before scheduling.  This
      will guarantee that no rescheduling will be done after it is cancelled.
      
      Fixes: 230d1f0d ("bnxt_en: Handle firmware reset.")
      Reviewed-by: default avatarVasundhara Volam <vasundhara-v.volam@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b148bb23
    • Vasundhara Volam's avatar
      bnxt_en: Fix PCI AER error recovery flow · df3875ec
      Vasundhara Volam authored
      When a PCI error is detected the PCI state could be corrupt, save
      the PCI state after initialization and restore it after the slot
      reset.
      
      Fixes: 6316ea6d ("bnxt_en: Enable AER support.")
      Signed-off-by: default avatarVasundhara Volam <vasundhara-v.volam@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df3875ec
    • Michael Chan's avatar
      bnxt_en: Fix ethtool -S statitics with XDP or TCs enabled. · 7de65149
      Michael Chan authored
      We are returning the wrong count for ETH_SS_STATS in get_sset_count()
      when XDP or TCs are enabled.  In a recent commit, we got rid of
      irrelevant counters when the ring is RX only or TX only, but we
      did not make the proper adjustments for the count.  As a result,
      when we have XDP or TCs enabled, we are returning an excess count
      because some of the rings are TX only.  This causes ethtool -S to
      display extra counters with no counter names.
      
      Fix bnxt_get_num_ring_stats() by not assuming that all rings will
      always have RX and TX counters in combined mode.
      
      Fixes: 125592fb ("bnxt_en: show only relevant ethtool stats for a TX or RX ring")
      Reviewed-by: default avatarVasundhara Volam <vasundhara-v.volam@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7de65149