1. 03 Feb, 2018 7 commits
    • Daniel Borkmann's avatar
      bpf: avoid false sharing of map refcount with max_entries · 96d9b233
      Daniel Borkmann authored
      [ upstream commit be95a845 ]
      
      In addition to commit b2157399 ("bpf: prevent out-of-bounds
      speculation") also change the layout of struct bpf_map such that
      false sharing of fast-path members like max_entries is avoided
      when the maps reference counter is altered. Therefore enforce
      them to be placed into separate cachelines.
      
      pahole dump after change:
      
        struct bpf_map {
              const struct bpf_map_ops  * ops;                 /*     0     8 */
              struct bpf_map *           inner_map_meta;       /*     8     8 */
              void *                     security;             /*    16     8 */
              enum bpf_map_type          map_type;             /*    24     4 */
              u32                        key_size;             /*    28     4 */
              u32                        value_size;           /*    32     4 */
              u32                        max_entries;          /*    36     4 */
              u32                        map_flags;            /*    40     4 */
              u32                        pages;                /*    44     4 */
              u32                        id;                   /*    48     4 */
              int                        numa_node;            /*    52     4 */
              bool                       unpriv_array;         /*    56     1 */
      
              /* XXX 7 bytes hole, try to pack */
      
              /* --- cacheline 1 boundary (64 bytes) --- */
              struct user_struct *       user;                 /*    64     8 */
              atomic_t                   refcnt;               /*    72     4 */
              atomic_t                   usercnt;              /*    76     4 */
              struct work_struct         work;                 /*    80    32 */
              char                       name[16];             /*   112    16 */
              /* --- cacheline 2 boundary (128 bytes) --- */
      
              /* size: 128, cachelines: 2, members: 17 */
              /* sum members: 121, holes: 1, sum holes: 7 */
        };
      
      Now all entries in the first cacheline are read only throughout
      the life time of the map, set up once during map creation. Overall
      struct size and number of cachelines doesn't change from the
      reordering. struct bpf_map is usually first member and embedded
      in map structs in specific map implementations, so also avoid those
      members to sit at the end where it could potentially share the
      cacheline with first map values e.g. in the array since remote
      CPUs could trigger map updates just as well for those (easily
      dirtying members like max_entries intentionally as well) while
      having subsequent values in cache.
      
      Quoting from Google's Project Zero blog [1]:
      
        Additionally, at least on the Intel machine on which this was
        tested, bouncing modified cache lines between cores is slow,
        apparently because the MESI protocol is used for cache coherence
        [8]. Changing the reference counter of an eBPF array on one
        physical CPU core causes the cache line containing the reference
        counter to be bounced over to that CPU core, making reads of the
        reference counter on all other CPU cores slow until the changed
        reference counter has been written back to memory. Because the
        length and the reference counter of an eBPF array are stored in
        the same cache line, this also means that changing the reference
        counter on one physical CPU core causes reads of the eBPF array's
        length to be slow on other physical CPU cores (intentional false
        sharing).
      
      While this doesn't 'control' the out-of-bounds speculation through
      masking the index as in commit b2157399, triggering a manipulation
      of the map's reference counter is really trivial, so lets not allow
      to easily affect max_entries from it.
      
      Splitting to separate cachelines also generally makes sense from
      a performance perspective anyway in that fast-path won't have a
      cache miss if the map gets pinned, reused in other progs, etc out
      of control path, thus also avoids unintentional false sharing.
      
        [1] https://googleprojectzero.blogspot.ch/2018/01/reading-privileged-memory-with-side.htmlSigned-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      96d9b233
    • Daniel Borkmann's avatar
      bpf: arsh is not supported in 32 bit alu thus reject it · 7dcda40e
      Daniel Borkmann authored
      [ upstream commit 7891a87e ]
      
      The following snippet was throwing an 'unknown opcode cc' warning
      in BPF interpreter:
      
        0: (18) r0 = 0x0
        2: (7b) *(u64 *)(r10 -16) = r0
        3: (cc) (u32) r0 s>>= (u32) r0
        4: (95) exit
      
      Although a number of JITs do support BPF_ALU | BPF_ARSH | BPF_{K,X}
      generation, not all of them do and interpreter does neither. We can
      leave existing ones and implement it later in bpf-next for the
      remaining ones, but reject this properly in verifier for the time
      being.
      
      Fixes: 17a52670 ("bpf: verifier (add verifier core)")
      Reported-by: syzbot+93c4904c5c70348a6890@syzkaller.appspotmail.com
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7dcda40e
    • Alexei Starovoitov's avatar
      bpf: introduce BPF_JIT_ALWAYS_ON config · 28c48674
      Alexei Starovoitov authored
      [ upstream commit 290af866 ]
      
      The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.
      
      A quote from goolge project zero blog:
      "At this point, it would normally be necessary to locate gadgets in
      the host kernel code that can be used to actually leak data by reading
      from an attacker-controlled location, shifting and masking the result
      appropriately and then using the result of that as offset to an
      attacker-controlled address for a load. But piecing gadgets together
      and figuring out which ones work in a speculation context seems annoying.
      So instead, we decided to use the eBPF interpreter, which is built into
      the host kernel - while there is no legitimate way to invoke it from inside
      a VM, the presence of the code in the host kernel's text section is sufficient
      to make it usable for the attack, just like with ordinary ROP gadgets."
      
      To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
      option that removes interpreter from the kernel in favor of JIT-only mode.
      So far eBPF JIT is supported by:
      x64, arm64, arm32, sparc64, s390, powerpc64, mips64
      
      The start of JITed program is randomized and code page is marked as read-only.
      In addition "constant blinding" can be turned on with net.core.bpf_jit_harden
      
      v2->v3:
      - move __bpf_prog_ret0 under ifdef (Daniel)
      
      v1->v2:
      - fix init order, test_bpf and cBPF (Daniel's feedback)
      - fix offloaded bpf (Jakub's feedback)
      - add 'return 0' dummy in case something can invoke prog->bpf_func
      - retarget bpf tree. For bpf-next the patch would need one extra hunk.
        It will be sent when the trees are merged back to net-next
      
      Considered doing:
        int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
      but it seems better to land the patch as-is and in bpf-next remove
      bpf_jit_enable global variable from all JITs, consolidate in one place
      and remove this jit_init() function.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      28c48674
    • Alexei Starovoitov's avatar
      bpf: fix bpf_tail_call() x64 JIT · 361fb048
      Alexei Starovoitov authored
      [ upstream commit 90caccdd ]
      
      - bpf prog_array just like all other types of bpf array accepts 32-bit index.
        Clarify that in the comment.
      - fix x64 JIT of bpf_tail_call which was incorrectly loading 8 instead of 4 bytes
      - tighten corresponding check in the interpreter to stay consistent
      
      The JIT bug can be triggered after introduction of BPF_F_NUMA_NODE flag
      in commit 96eabe7a in 4.14. Before that the map_flags would stay zero and
      though JIT code is wrong it will check bounds correctly.
      Hence two fixes tags. All other JITs don't have this problem.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Fixes: 96eabe7a ("bpf: Allow selecting numa node during map creation")
      Fixes: b52f00e6 ("x86: bpf_jit: implement bpf_tail_call() helper")
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      361fb048
    • Eric Dumazet's avatar
      x86: bpf_jit: small optimization in emit_bpf_tail_call() · 5a802e67
      Eric Dumazet authored
      [ upstream commit 84ccac6e ]
      
      Saves 4 bytes replacing following instructions :
      
      lea rax, [rsi + rdx * 8 + offsetof(...)]
      mov rax, qword ptr [rax]
      cmp rax, 0
      
      by :
      
      mov rax, [rsi + rdx * 8 + offsetof(...)]
      test rax, rax
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5a802e67
    • Alexei Starovoitov's avatar
      bpf: fix branch pruning logic · 1367d854
      Alexei Starovoitov authored
      [ Upstream commit c131187d ]
      
      when the verifier detects that register contains a runtime constant
      and it's compared with another constant it will prune exploration
      of the branch that is guaranteed not to be taken at runtime.
      This is all correct, but malicious program may be constructed
      in such a way that it always has a constant comparison and
      the other branch is never taken under any conditions.
      In this case such path through the program will not be explored
      by the verifier. It won't be taken at run-time either, but since
      all instructions are JITed the malicious program may cause JITs
      to complain about using reserved fields, etc.
      To fix the issue we have to track the instructions explored by
      the verifier and sanitize instructions that are dead at run time
      with NOPs. We cannot reject such dead code, since llvm generates
      it for valid C code, since it doesn't do as much data flow
      analysis as the verifier does.
      
      Fixes: 17a52670 ("bpf: verifier (add verifier core)")
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1367d854
    • Linus Torvalds's avatar
      loop: fix concurrent lo_open/lo_release · b3922254
      Linus Torvalds authored
      commit ae665016 upstream.
      
      范龙飞 reports that KASAN can report a use-after-free in __lock_acquire.
      The reason is due to insufficient serialization in lo_release(), which
      will continue to use the loop device even after it has decremented the
      lo_refcnt to zero.
      
      In the meantime, another process can come in, open the loop device
      again as it is being shut down. Confusion ensues.
      Reported-by: default avatar范龙飞 <long7573@126.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b3922254
  2. 31 Jan, 2018 33 commits