• Michael Ellerman's avatar
    [PATCH] iseries_veth: Try to avoid pathological reset behaviour · 58c5900b
    Michael Ellerman authored
    The iseries_veth driver contains a state machine which is used to manage
    how connections are setup and neogotiated between LPARs.
    
    If one side of a connection resets for some reason, the two LPARs can get
    stuck in a race to re-setup the connection. This can lead to the connection
    being declared dead by one or both ends. In practice the connection is
    declared dead by one or both ends approximately 8/10 times a connection is
    reset, although it is rare for connections to be reset.
    
    (an example here: http://michael.ellerman.id.au/files/misc/veth-trace.html)
    
    The core of the problem is that the end that resets the connection doesn't
    wait for the other end to become aware of the reset. So the resetting end
    starts setting the connection back up, and then receives a reset from the
    other end (which is the response to the initial reset). And so on.
    
    We're severely limited in what we can do to fix this. The protocol between
    LPARs is essentially fixed, as we have to interoperate with both OS/400
    and old Linux drivers. Which also means we need a fix that only changes the
    code on one end.
    
    The only fix I've found given that, is to just blindly sleep for a bit when
    resetting the connection, in the hope that the other end will get itself
    sorted.  Needless to say I'd love it if someone has a better idea.
    
    This does work, I've so far been unable to get it to break, whereas without
    the fix a reset of one end will lead to a dead connection ~8/10 times.
    Signed-off-by: default avatarMichael Ellerman <michael@ellerman.id.au>
    Signed-off-by: default avatarJeff Garzik <jgarzik@pobox.com>
    58c5900b
iseries_veth.c 36.8 KB