• Jakub Kicinski's avatar
    net: dqs: add NIC stall detector based on BQL · 6025b913
    Jakub Kicinski authored
    softnet_data->time_squeeze is sometimes used as a proxy for
    host overload or indication of scheduling problems. In practice
    this statistic is very noisy and has hard to grasp units -
    e.g. is 10 squeezes a second to be expected, or high?
    
    Delaying network (NAPI) processing leads to drops on NIC queues
    but also RTT bloat, impacting pacing and CA decisions.
    Stalls are a little hard to detect on the Rx side, because
    there may simply have not been any packets received in given
    period of time. Packet timestamps help a little bit, but
    again we don't know if packets are stale because we're
    not keeping up or because someone (*cough* cgroups)
    disabled IRQs for a long time.
    
    We can, however, use Tx as a proxy for Rx stalls. Most drivers
    use combined Rx+Tx NAPIs so if Tx gets starved so will Rx.
    On the Tx side we know exactly when packets get queued,
    and completed, so there is no uncertainty.
    
    This patch adds stall checks to BQL. Why BQL? Because
    it's a convenient place to add such checks, already
    called by most drivers, and it has copious free space
    in its structures (this patch adds no extra cache
    references or dirtying to the fast path).
    
    The algorithm takes one parameter - max delay AKA stall
    threshold and increments a counter whenever NAPI got delayed
    for at least that amount of time. It also records the length
    of the longest stall.
    
    To be precise every time NAPI has not polled for at least
    stall thrs we check if there were any Tx packets queued
    between last NAPI run and now - stall_thrs/2.
    
    Unlike the classic Tx watchdog this mechanism does not
    ignore stalls caused by Tx being disabled, or loss of link.
    I don't think the check is worth the complexity, and
    stall is a stall, whether due to host overload, flow
    control, link down... doesn't matter much to the application.
    
    We have been running this detector in production at Meta
    for 2 years, with the threshold of 8ms. It's the lowest
    value where false positives become rare. There's still
    a constant stream of reported stalls (especially without
    the ksoftirqd deferral patches reverted), those who like
    their stall metrics to be 0 may prefer higher value.
    Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
    Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    6025b913
net-sysfs.c 51.5 KB