• Firo Yang's avatar
    enic: prevent waking up stopped tx queues over watchdog reset · 0f905225
    Firo Yang authored
    Recent months, our customer reported several kernel crashes all
    preceding with following message:
    NETDEV WATCHDOG: eth2 (enic): transmit queue 0 timed out
    Error message of one of those crashes:
    BUG: unable to handle kernel paging request at ffffffffa007e090
    
    After analyzing severl vmcores, I found that most of crashes are
    caused by memory corruption. And all the corrupted memory areas
    are overwritten by data of network packets. Moreover, I also found
    that the tx queues were enabled over watchdog reset.
    
    After going through the source code, I found that in enic_stop(),
    the tx queues stopped by netif_tx_disable() could be woken up over
    a small time window between netif_tx_disable() and the
    napi_disable() by the following code path:
    napi_poll->
      enic_poll_msix_wq->
         vnic_cq_service->
            enic_wq_service->
               netif_wake_subqueue(enic->netdev, q_number)->
                  test_and_clear_bit(__QUEUE_STATE_DRV_XOFF, &txq->state)
    In turn, upper netowrk stack could queue skb to ENIC NIC though
    enic_hard_start_xmit(). And this might introduce some race condition.
    
    Our customer comfirmed that this kind of kernel crash doesn't occur over
    90 days since they applied this patch.
    Signed-off-by: default avatarFiro Yang <firo.yang@suse.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    0f905225
enic_main.c 76.5 KB