• Nick Piggin's avatar
    radix-tree: fix small lockless radix-tree bug · 643b52b9
    Nick Piggin authored
    We shrink a radix tree when its root node has only one child, in the left
    most slot.  The child becomes the new root node.  To perform this
    operation in a manner compatible with concurrent lockless lookups, we
    atomically switch the root pointer from the parent to its child.
    
    However a concurrent lockless lookup may now have loaded a pointer to the
    parent (and is presently deciding what to do next).  For this reason, we
    also have to keep the parent node in a valid state after shrinking the
    tree, until the next RCU grace period -- otherwise this lookup with the
    parent pointer may not do the right thing.  Notably, we need to keep the
    child in the left most slot there in case that is requested by the lookup.
    
    This is all pretty standard RCU stuff.  It is worth repeating because in
    my eagerness to obey the radix tree node constructor scheme, I had broken
    it by zeroing the radix tree node before the grace period.
    
    What could happen is that a lookup can load the parent pointer, then
    decide it wants to follow the left most child slot, only to find the slot
    contained NULL due to the concurrent shrinker having zeroed the parent
    node before waiting for a grace period.  The lookup would return a false
    negative as a result.
    
    Fix it by doing that clearing in the RCU callback.  I would normally want
    to rip out the constructor entirely, but radix tree nodes are one of those
    places where they make sense (only few cachelines will be touched soon
    after allocation).
    
    This was never actually found in any lockless pagecache testing or by the
    test harness, but by seeing the odd problem with my scalable vmap rewrite.
     I have not tickled the test harness into reproducing it yet, but I'll
    keep working at it.
    
    Fortunately, it is not a problem anywhere lockless pagecache is used in
    mainline kernels (pagecache probe is not a guarantee, and brd does not
    have concurrent lookups and deletes).
    Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
    Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    643b52b9
radix-tree.c 27.4 KB