• Daniel Borkmann's avatar
    bpf: Fix toctou on read-only map's constant scalar tracking · 353050be
    Daniel Borkmann authored
    Commit a23740ec ("bpf: Track contents of read-only maps as scalars") is
    checking whether maps are read-only both from BPF program side and user space
    side, and then, given their content is constant, reading out their data via
    map->ops->map_direct_value_addr() which is then subsequently used as known
    scalar value for the register, that is, it is marked as __mark_reg_known()
    with the read value at verification time. Before a23740ec, the register
    content was marked as an unknown scalar so the verifier could not make any
    assumptions about the map content.
    
    The current implementation however is prone to a TOCTOU race, meaning, the
    value read as known scalar for the register is not guaranteed to be exactly
    the same at a later point when the program is executed, and as such, the
    prior made assumptions of the verifier with regards to the program will be
    invalid which can cause issues such as OOB access, etc.
    
    While the BPF_F_RDONLY_PROG map flag is always fixed and required to be
    specified at map creation time, the map->frozen property is initially set to
    false for the map given the map value needs to be populated, e.g. for global
    data sections. Once complete, the loader "freezes" the map from user space
    such that no subsequent updates/deletes are possible anymore. For the rest
    of the lifetime of the map, this freeze one-time trigger cannot be undone
    anymore after a successful BPF_MAP_FREEZE cmd return. Meaning, any new BPF_*
    cmd calls which would update/delete map entries will be rejected with -EPERM
    since map_get_sys_perms() removes the FMODE_CAN_WRITE permission. This also
    means that pending update/delete map entries must still complete before this
    guarantee is given. This corner case is not an issue for loaders since they
    create and prepare such program private map in successive steps.
    
    However, a malicious user is able to trigger this TOCTOU race in two different
    ways: i) via userfaultfd, and ii) via batched updates. For i) userfaultfd is
    used to expand the competition interval, so that map_update_elem() can modify
    the contents of the map after map_freeze() and bpf_prog_load() were executed.
    This works, because userfaultfd halts the parallel thread which triggered a
    map_update_elem() at the time where we copy key/value from the user buffer and
    this already passed the FMODE_CAN_WRITE capability test given at that time the
    map was not "frozen". Then, the main thread performs the map_freeze() and
    bpf_prog_load(), and once that had completed successfully, the other thread
    is woken up to complete the pending map_update_elem() which then changes the
    map content. For ii) the idea of the batched update is similar, meaning, when
    there are a large number of updates to be processed, it can increase the
    competition interval between the two. It is therefore possible in practice to
    modify the contents of the map after executing map_freeze() and bpf_prog_load().
    
    One way to fix both i) and ii) at the same time is to expand the use of the
    map's map->writecnt. The latter was introduced in fc970227 ("bpf: Add mmap()
    support for BPF_MAP_TYPE_ARRAY") and further refined in 1f6cb19b ("bpf:
    Prevent re-mmap()'ing BPF map as writable for initially r/o mapping") with
    the rationale to make a writable mmap()'ing of a map mutually exclusive with
    read-only freezing. The counter indicates writable mmap() mappings and then
    prevents/fails the freeze operation. Its semantics can be expanded beyond
    just mmap() by generally indicating ongoing write phases. This would essentially
    span any parallel regular and batched flavor of update/delete operation and
    then also have map_freeze() fail with -EBUSY. For the check_mem_access() in
    the verifier we expand upon the bpf_map_is_rdonly() check ensuring that all
    last pending writes have completed via bpf_map_write_active() test. Once the
    map->frozen is set and bpf_map_write_active() indicates a map->writecnt of 0
    only then we are really guaranteed to use the map's data as known constants.
    For map->frozen being set and pending writes in process of still being completed
    we fall back to marking that register as unknown scalar so we don't end up
    making assumptions about it. With this, both TOCTOU reproducers from i) and
    ii) are fixed.
    
    Note that the map->writecnt has been converted into a atomic64 in the fix in
    order to avoid a double freeze_mutex mutex_{un,}lock() pair when updating
    map->writecnt in the various map update/delete BPF_* cmd flavors. Spanning
    the freeze_mutex over entire map update/delete operations in syscall side
    would not be possible due to then causing everything to be serialized.
    Similarly, something like synchronize_rcu() after setting map->frozen to wait
    for update/deletes to complete is not possible either since it would also
    have to span the user copy which can sleep. On the libbpf side, this won't
    break d66562fb ("libbpf: Add BPF object skeleton support") as the
    anonymous mmap()-ed "map initialization image" is remapped as a BPF map-backed
    mmap()-ed memory where for .rodata it's non-writable.
    
    Fixes: a23740ec ("bpf: Track contents of read-only maps as scalars")
    Reported-by: w1tcher.bupt@gmail.com
    Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
    Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
    Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
    353050be
syscall.c 115 KB