• Yuri Nudelman's avatar
    habanalabs/gaudi: fix a race condition causing DMAR error · 17ab47d2
    Yuri Nudelman authored
    There is a rare race condition in CB completion mechanism, that can
    occur under a very high pressure of command submissions.
    The preconditions for this to happen are:
    
     1. There should be enough command submissions for the pre-allocated
        patched CB pool to run out of commands. At this stage we start
        allocating new patched CBs as they arrive.
     2. CB size has to be exactly (128*n + 104)B for some n, i.e. 24B below
        a cache line end.
    
    The flow:
    
     1. Two command buffers being completed on different streams, at the
        same time. Denote those CB1 and CB2.
     2. Each command buffer is injected with two messages, 16B each - one
        for a HBW update of the completion queue, another to raise
        interrupt.
     3. Assume CB1 updated the completion queue and raise the interrupt.
     4. Assume CB2 updated the completion queue but did not raise the
        interrupt yet.
     5. The host receives the interrupt. It goes over the completion queue
        and sees two completions - CB1 and CB2. Release them both.
     6. CB2 performs the last command. The problem is that the last command
        is split between 2 cache lines. So to read the last 8B of the last
        command, it has to access the host again. Problem is - CB2 is
        already released. This causes a DMAR error.
    
    The solution to this problem is simply to make sure the last two
    commands in the CB are always in the same cache line, using NOP padding.
    Signed-off-by: default avatarYuri Nudelman <ynudelman@habana.ai>
    Reviewed-by: default avatarOded Gabbay <ogabbay@kernel.org>
    Signed-off-by: default avatarOded Gabbay <ogabbay@kernel.org>
    17ab47d2
habanalabs.h 127 KB