• Ido Schimmel's avatar
    mlxsw: pci: Recycle received packet upon allocation failure · 75963576
    Ido Schimmel authored
    When the driver fails to allocate a new Rx buffer, it passes an empty Rx
    descriptor (contains zero address and size) to the device and marks it
    as invalid by setting the skb pointer in the descriptor's metadata to
    NULL.
    
    After processing enough Rx descriptors, the driver will try to process
    the invalid descriptor, but will return immediately seeing that the skb
    pointer is NULL. Since the driver no longer passes new Rx descriptors to
    the device, the Rx queue will eventually become full and the device will
    start to drop packets.
    
    Fix this by recycling the received packet if allocation of the new
    packet failed. This means that allocation is no longer performed at the
    end of the Rx routine, but at the start, before tearing down the DMA
    mapping of the received packet.
    
    Remove the comment about the descriptor being zeroed as it is no longer
    correct. This is OK because we either use the descriptor as-is (when
    recycling) or overwrite its address and size fields with that of the
    newly allocated Rx buffer.
    
    The issue was discovered when a process ("perf") consumed too much
    memory and put the system under memory pressure. It can be reproduced by
    injecting slab allocation failures [1]. After the fix, the Rx queue no
    longer comes to a halt.
    
    [1]
     # echo 10 > /sys/kernel/debug/failslab/times
     # echo 1000 > /sys/kernel/debug/failslab/interval
     # echo 100 > /sys/kernel/debug/failslab/probability
    
     FAULT_INJECTION: forcing a failure.
     name failslab, interval 1000, probability 100, space 0, times 8
     [...]
     Call Trace:
      <IRQ>
      dump_stack_lvl+0x34/0x44
      should_fail.cold+0x32/0x37
      should_failslab+0x5/0x10
      kmem_cache_alloc_node+0x23/0x190
      __alloc_skb+0x1f9/0x280
      __netdev_alloc_skb+0x3a/0x150
      mlxsw_pci_rdq_skb_alloc+0x24/0x90
      mlxsw_pci_cq_tasklet+0x3dc/0x1200
      tasklet_action_common.constprop.0+0x9f/0x100
      __do_softirq+0xb5/0x252
      irq_exit_rcu+0x7a/0xa0
      common_interrupt+0x83/0xa0
      </IRQ>
      asm_common_interrupt+0x1e/0x40
     RIP: 0010:cpuidle_enter_state+0xc8/0x340
     [...]
     mlxsw_spectrum2 0000:06:00.0: Failed to alloc skb for RDQ
    
    Fixes: eda6500a ("mlxsw: Add PCI bus implementation")
    Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
    Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
    Link: https://lore.kernel.org/r/20211024064014.1060919-1-idosch@idosch.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
    75963576
pci.c 54.2 KB