• Vladimir Oltean's avatar
    net: enetc: keep RX ring consumer index in sync with hardware · 3a5d12c9
    Vladimir Oltean authored
    The RX rings have a producer index owned by hardware, where newly
    received frame buffers are placed, and a consumer index owned by
    software, where newly allocated buffers are placed, in expectation of
    hardware being able to place frame data in them.
    
    Hardware increments the producer index when a frame is received, however
    it is not allowed to increment the producer index to match the consumer
    index (RBCIR) since the ring can hold at most RBLENR[LENGTH]-1 received
    BDs. Whenever the producer index matches the value of the consumer
    index, the ring has no unprocessed received frames and all BDs in the
    ring have been initialized/prepared by software, i.e. hardware owns all
    BDs in the ring.
    
    The code uses the next_to_clean variable to keep track of the producer
    index, and the next_to_use variable to keep track of the consumer index.
    
    The RX rings are seeded from enetc_refill_rx_ring, which is called from
    two places:
    
    1. initially the ring is seeded until full with enetc_bd_unused(rx_ring),
       i.e. with 511 buffers. This will make next_to_clean=0 and next_to_use=511:
    
    .ndo_open
    -> enetc_open
       -> enetc_setup_bdrs
          -> enetc_setup_rxbdr
             -> enetc_refill_rx_ring
    
    2. then during the data path processing, it is refilled with 16 buffers
       at a time:
    
    enetc_msix
    -> napi_schedule
       -> enetc_poll
          -> enetc_clean_rx_ring
             -> enetc_refill_rx_ring
    
    There is just one problem: the initial seeding done during .ndo_open
    updates just the producer index (ENETC_RBPIR) with 0, and the software
    next_to_clean and next_to_use variables. Notably, it will not update the
    consumer index to make the hardware aware of the newly added buffers.
    
    Wait, what? So how does it work?
    
    Well, the reset values of the producer index and of the consumer index
    of a ring are both zero. As per the description in the second paragraph,
    it means that the ring is full of buffers waiting for hardware to put
    frames in them, which by coincidence is almost true, because we have in
    fact seeded 511 buffers into the ring.
    
    But will the hardware attempt to access the 512th entry of the ring,
    which has an invalid BD in it? Well, no, because in order to do that, it
    would have to first populate the first 511 entries, and the NAPI
    enetc_poll will kick in by then. Eventually, after 16 processed slots
    have become available in the RX ring, enetc_clean_rx_ring will call
    enetc_refill_rx_ring and then will [ finally ] update the consumer index
    with the new software next_to_use variable. From now on, the
    next_to_clean and next_to_use variables are in sync with the producer
    and consumer ring indices.
    
    So the day is saved, right? Well, not quite. Freeing the memory
    allocated for the rings is done in:
    
    enetc_close
    -> enetc_clear_bdrs
       -> enetc_clear_rxbdr
          -> this just disables the ring
    -> enetc_free_rxtx_rings
       -> enetc_free_rx_ring
          -> sets next_to_clean and next_to_use to 0
    
    but again, nothing is committed to the hardware producer and consumer
    indices (yay!). The assumption is that the ring is disabled, so the
    indices don't matter anyway, and it's the responsibility of the "open"
    code path to set those up.
    
    .. Except that the "open" code path does not set those up properly.
    
    While initially, things almost work, during subsequent enetc_close ->
    enetc_open sequences, we have problems. To be precise, the enetc_open
    that is subsequent to enetc_close will again refill the ring with 511
    entries, but it will leave the consumer index untouched. Untouched
    means, of course, equal to the value it had before disabling the ring
    and draining the old buffers in enetc_close.
    
    But as mentioned, enetc_setup_rxbdr will at least update the producer
    index though, through this line of code:
    
    	enetc_rxbdr_wr(hw, idx, ENETC_RBPIR, 0);
    
    so at this stage we'll have:
    
    next_to_clean=0 (in hardware 0)
    next_to_use=511 (in hardware we'll have the refill index prior to enetc_close)
    
    Again, the next_to_clean and producer index are in sync and set to
    correct values, so the driver manages to limp on. Eventually, 16 ring
    entries will be consumed by enetc_poll, and the savior
    enetc_clean_rx_ring will come and call enetc_refill_rx_ring, and then
    update the hardware consumer ring based upon the new next_to_use.
    
    So.. it works?
    Well, by coincidence, it almost does, but there's a circumstance where
    enetc_clean_rx_ring won't be there to save us. If the previous value of
    the consumer index was 15, there's a problem, because the NAPI poll
    sequence will only issue a refill when 16 or more buffers have been
    consumed.
    
    It's easiest to illustrate this with an example:
    
    ip link set eno0 up
    ip addr add 192.168.100.1/24 dev eno0
    ping 192.168.100.1 -c 20 # ping this port from another board
    ip link set eno0 down
    ip link set eno0 up
    ping 192.168.100.1 -c 20 # ping it again from the same other board
    
    One by one:
    
    1. ip link set eno0 up
    -> calls enetc_setup_rxbdr:
       -> calls enetc_refill_rx_ring(511 buffers)
       -> next_to_clean=0 (in hw 0)
       -> next_to_use=511 (in hw 0)
    
    2. ping 192.168.100.1 -c 20 # ping this port from another board
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 0 (in hw 1) next_to_use 511 (in hw 0)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 1 (in hw 2) next_to_use 511 (in hw 0)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 2 (in hw 3) next_to_use 511 (in hw 0)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 3 (in hw 4) next_to_use 511 (in hw 0)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 4 (in hw 5) next_to_use 511 (in hw 0)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 5 (in hw 6) next_to_use 511 (in hw 0)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=7 next_to_clean 6 (in hw 7) next_to_use 511 (in hw 0)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=8 next_to_clean 7 (in hw 8) next_to_use 511 (in hw 0)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=9 next_to_clean 8 (in hw 9) next_to_use 511 (in hw 0)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=10 next_to_clean 9 (in hw 10) next_to_use 511 (in hw 0)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=11 next_to_clean 10 (in hw 11) next_to_use 511 (in hw 0)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=12 next_to_clean 11 (in hw 12) next_to_use 511 (in hw 0)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=13 next_to_clean 12 (in hw 13) next_to_use 511 (in hw 0)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=14 next_to_clean 13 (in hw 14) next_to_use 511 (in hw 0)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=15 next_to_clean 14 (in hw 15) next_to_use 511 (in hw 0)
    enetc_clean_rx_ring: enetc_refill_rx_ring(16) increments next_to_use by 16 (mod 512) and writes it to hw
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=0 next_to_clean 15 (in hw 16) next_to_use 15 (in hw 15)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 16 (in hw 17) next_to_use 15 (in hw 15)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 17 (in hw 18) next_to_use 15 (in hw 15)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 18 (in hw 19) next_to_use 15 (in hw 15)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 19 (in hw 20) next_to_use 15 (in hw 15)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 20 (in hw 21) next_to_use 15 (in hw 15)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 21 (in hw 22) next_to_use 15 (in hw 15)
    
    20 packets transmitted, 20 packets received, 0% packet loss
    
    3. ip link set eno0 down
    enetc_free_rx_ring: next_to_clean 0 (in hw 22), next_to_use 0 (in hw 15)
    
    4. ip link set eno0 up
    -> calls enetc_setup_rxbdr:
       -> calls enetc_refill_rx_ring(511 buffers)
       -> next_to_clean=0 (in hw 0)
       -> next_to_use=511 (in hw 15)
    
    5. ping 192.168.100.1 -c 20 # ping it again from the same other board
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 0 (in hw 1) next_to_use 511 (in hw 15)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 1 (in hw 2) next_to_use 511 (in hw 15)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 2 (in hw 3) next_to_use 511 (in hw 15)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 3 (in hw 4) next_to_use 511 (in hw 15)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 4 (in hw 5) next_to_use 511 (in hw 15)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 5 (in hw 6) next_to_use 511 (in hw 15)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=7 next_to_clean 6 (in hw 7) next_to_use 511 (in hw 15)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=8 next_to_clean 7 (in hw 8) next_to_use 511 (in hw 15)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=9 next_to_clean 8 (in hw 9) next_to_use 511 (in hw 15)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=10 next_to_clean 9 (in hw 10) next_to_use 511 (in hw 15)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=11 next_to_clean 10 (in hw 11) next_to_use 511 (in hw 15)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=12 next_to_clean 11 (in hw 12) next_to_use 511 (in hw 15)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=13 next_to_clean 12 (in hw 13) next_to_use 511 (in hw 15)
    enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=14 next_to_clean 13 (in hw 14) next_to_use 511 (in hw 15)
    
    20 packets transmitted, 12 packets received, 40% packet loss
    
    And there it dies. No enetc_refill_rx_ring (because cleaned_cnt must be equal
    to 15 for that to happen), no nothing. The hardware enters the condition where
    the producer (14) + 1 is equal to the consumer (15) index, which makes it
    believe it has no more free buffers to put packets in, so it starts discarding
    them:
    
    ip netns exec ns0 ethtool -S eno0 | grep -v ': 0'
    NIC statistics:
         Rx ring  0 discarded frames: 8
    
    Summarized, if the interface receives between 16 and 32 (mod 512) frames
    and then there is a link flap, then the port will eventually die with no
    way to recover. If it receives less than 16 (mod 512) frames, then the
    initial NAPI poll [ before the link flap ] will not update the consumer
    index in hardware (it will remain zero) which will be ok when the buffers
    are later reinitialized. If more than 32 (mod 512) frames are received,
    the initial NAPI poll has the chance to refill the ring twice, updating
    the consumer index to at least 32. So after the link flap, the consumer
    index is still wrong, but the post-flap NAPI poll gets a chance to
    refill the ring once (because it passes through cleaned_cnt=15) and
    makes the consumer index be again back in sync with next_to_use.
    
    The solution to this problem is actually simple, we just need to write
    next_to_use into the hardware consumer index at enetc_open time, which
    always brings it back in sync after an initial buffer seeding process.
    
    The simpler thing would be to put the write to the consumer index into
    enetc_refill_rx_ring directly, but there are issues with the MDIO
    locking: in the NAPI poll code we have the enetc_lock_mdio() taken from
    top-level and we use the unlocked enetc_wr_reg_hot, whereas in
    enetc_open, the enetc_lock_mdio() is not taken at the top level, but
    instead by each individual enetc_wr_reg, so we are forced to put an
    additional enetc_wr_reg in enetc_setup_rxbdr. Better organization of
    the code is left as a refactoring exercise.
    
    Fixes: d4fd0404 ("enetc: Introduce basic PF and VF ENETC ethernet drivers")
    Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    3a5d12c9
enetc.c 44.7 KB