• Hou Tao's avatar
    bpf: Enable IRQ after irq_work_raise() completes in unit_alloc() · 566f6de3
    Hou Tao authored
    When doing stress test for qp-trie, bpf_mem_alloc() returned NULL
    unexpectedly because all qp-trie operations were initiated from
    bpf syscalls and there was still available free memory. bpf_obj_new()
    has the same problem as shown by the following selftest.
    
    The failure is due to the preemption. irq_work_raise() will invoke
    irq_work_claim() first to mark the irq work as pending and then inovke
    __irq_work_queue_local() to raise an IPI. So when the current task
    which is invoking irq_work_raise() is preempted by other task,
    unit_alloc() may return NULL for preemption task as shown below:
    
    task A         task B
    
    unit_alloc()
      // low_watermark = 32
      // free_cnt = 31 after alloc
      irq_work_raise()
        // mark irq work as IRQ_WORK_PENDING
        irq_work_claim()
    
    	       // task B preempts task A
    	       unit_alloc()
    	         // free_cnt = 30 after alloc
    	         // irq work is already PENDING,
    	         // so just return
    	         irq_work_raise()
    	       // does unit_alloc() 30-times
    	       ......
    	       unit_alloc()
    	         // free_cnt = 0 before alloc
    	         return NULL
    
    Fix it by enabling IRQ after irq_work_raise() completes. An alternative
    fix is using preempt_{disable|enable}_notrace() pair, but it may have
    extra overhead. Another feasible fix is to only disable preemption or
    IRQ before invoking irq_work_queue() and enable preemption or IRQ after
    the invocation completes, but it can't handle the case when
    c->low_watermark is 1.
    Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
    Link: https://lore.kernel.org/r/20230901111954.1804721-2-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
    566f6de3
memalloc.c 24.2 KB