1. 12 Jan, 2018 9 commits
  2. 11 Jan, 2018 1 commit
    • Eran Ben Elisha's avatar
      {net,ib}/mlx5: Don't disable local loopback multicast traffic when needed · 8978cc92
      Eran Ben Elisha authored
      There are systems platform information management interfaces (such as
      HOST2BMC) for which we cannot disable local loopback multicast traffic.
      
      Separate disable_local_lb_mc and disable_local_lb_uc capability bits so
      driver will not disable multicast loopback traffic if not supported.
      (It is expected that Firmware will not set disable_local_lb_mc if
      HOST2BMC is running for example.)
      
      Function mlx5_nic_vport_update_local_lb will do best effort to
      disable/enable UC/MC loopback traffic and return success only in case it
      succeeded to changed all allowed by Firmware.
      
      Adapt mlx5_ib and mlx5e to support the new cap bits.
      
      Fixes: 2c43c5a0 ("net/mlx5e: Enable local loopback in loopback selftest")
      Fixes: c85023e1 ("IB/mlx5: Add raw ethernet local loopback support")
      Fixes: bded747b ("net/mlx5: Add raw ethernet local loopback firmware command")
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Cc: kernel-team@fb.com
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      8978cc92
  3. 10 Jan, 2018 17 commits
  4. 09 Jan, 2018 12 commits
    • Alexei Starovoitov's avatar
      bpf: introduce BPF_JIT_ALWAYS_ON config · 290af866
      Alexei Starovoitov authored
      The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.
      
      A quote from goolge project zero blog:
      "At this point, it would normally be necessary to locate gadgets in
      the host kernel code that can be used to actually leak data by reading
      from an attacker-controlled location, shifting and masking the result
      appropriately and then using the result of that as offset to an
      attacker-controlled address for a load. But piecing gadgets together
      and figuring out which ones work in a speculation context seems annoying.
      So instead, we decided to use the eBPF interpreter, which is built into
      the host kernel - while there is no legitimate way to invoke it from inside
      a VM, the presence of the code in the host kernel's text section is sufficient
      to make it usable for the attack, just like with ordinary ROP gadgets."
      
      To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
      option that removes interpreter from the kernel in favor of JIT-only mode.
      So far eBPF JIT is supported by:
      x64, arm64, arm32, sparc64, s390, powerpc64, mips64
      
      The start of JITed program is randomized and code page is marked as read-only.
      In addition "constant blinding" can be turned on with net.core.bpf_jit_harden
      
      v2->v3:
      - move __bpf_prog_ret0 under ifdef (Daniel)
      
      v1->v2:
      - fix init order, test_bpf and cBPF (Daniel's feedback)
      - fix offloaded bpf (Jakub's feedback)
      - add 'return 0' dummy in case something can invoke prog->bpf_func
      - retarget bpf tree. For bpf-next the patch would need one extra hunk.
        It will be sent when the trees are merged back to net-next
      
      Considered doing:
        int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
      but it seems better to land the patch as-is and in bpf-next remove
      bpf_jit_enable global variable from all JITs, consolidate in one place
      and remove this jit_init() function.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      290af866
    • Daniel Borkmann's avatar
      bpf: avoid false sharing of map refcount with max_entries · be95a845
      Daniel Borkmann authored
      In addition to commit b2157399 ("bpf: prevent out-of-bounds
      speculation") also change the layout of struct bpf_map such that
      false sharing of fast-path members like max_entries is avoided
      when the maps reference counter is altered. Therefore enforce
      them to be placed into separate cachelines.
      
      pahole dump after change:
      
        struct bpf_map {
              const struct bpf_map_ops  * ops;                 /*     0     8 */
              struct bpf_map *           inner_map_meta;       /*     8     8 */
              void *                     security;             /*    16     8 */
              enum bpf_map_type          map_type;             /*    24     4 */
              u32                        key_size;             /*    28     4 */
              u32                        value_size;           /*    32     4 */
              u32                        max_entries;          /*    36     4 */
              u32                        map_flags;            /*    40     4 */
              u32                        pages;                /*    44     4 */
              u32                        id;                   /*    48     4 */
              int                        numa_node;            /*    52     4 */
              bool                       unpriv_array;         /*    56     1 */
      
              /* XXX 7 bytes hole, try to pack */
      
              /* --- cacheline 1 boundary (64 bytes) --- */
              struct user_struct *       user;                 /*    64     8 */
              atomic_t                   refcnt;               /*    72     4 */
              atomic_t                   usercnt;              /*    76     4 */
              struct work_struct         work;                 /*    80    32 */
              char                       name[16];             /*   112    16 */
              /* --- cacheline 2 boundary (128 bytes) --- */
      
              /* size: 128, cachelines: 2, members: 17 */
              /* sum members: 121, holes: 1, sum holes: 7 */
        };
      
      Now all entries in the first cacheline are read only throughout
      the life time of the map, set up once during map creation. Overall
      struct size and number of cachelines doesn't change from the
      reordering. struct bpf_map is usually first member and embedded
      in map structs in specific map implementations, so also avoid those
      members to sit at the end where it could potentially share the
      cacheline with first map values e.g. in the array since remote
      CPUs could trigger map updates just as well for those (easily
      dirtying members like max_entries intentionally as well) while
      having subsequent values in cache.
      
      Quoting from Google's Project Zero blog [1]:
      
        Additionally, at least on the Intel machine on which this was
        tested, bouncing modified cache lines between cores is slow,
        apparently because the MESI protocol is used for cache coherence
        [8]. Changing the reference counter of an eBPF array on one
        physical CPU core causes the cache line containing the reference
        counter to be bounced over to that CPU core, making reads of the
        reference counter on all other CPU cores slow until the changed
        reference counter has been written back to memory. Because the
        length and the reference counter of an eBPF array are stored in
        the same cache line, this also means that changing the reference
        counter on one physical CPU core causes reads of the eBPF array's
        length to be slow on other physical CPU cores (intentional false
        sharing).
      
      While this doesn't 'control' the out-of-bounds speculation through
      masking the index as in commit b2157399, triggering a manipulation
      of the map's reference counter is really trivial, so lets not allow
      to easily affect max_entries from it.
      
      Splitting to separate cachelines also generally makes sense from
      a performance perspective anyway in that fast-path won't have a
      cache miss if the map gets pinned, reused in other progs, etc out
      of control path, thus also avoids unintentional false sharing.
      
        [1] https://googleprojectzero.blogspot.ch/2018/01/reading-privileged-memory-with-side.htmlSigned-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      be95a845
    • Wei Wang's avatar
      ipv6: remove null_entry before adding default route · 4512c43e
      Wei Wang authored
      In the current code, when creating a new fib6 table, tb6_root.leaf gets
      initialized to net->ipv6.ip6_null_entry.
      If a default route is being added with rt->rt6i_metric = 0xffffffff,
      fib6_add() will add this route after net->ipv6.ip6_null_entry. As
      null_entry is shared, it could cause problem.
      
      In order to fix it, set fn->leaf to NULL before calling
      fib6_add_rt2node() when trying to add the first default route.
      And reset fn->leaf to null_entry when adding fails or when deleting the
      last default route.
      
      syzkaller reported the following issue which is fixed by this commit:
      
      WARNING: suspicious RCU usage
      4.15.0-rc5+ #171 Not tainted
      -----------------------------
      net/ipv6/ip6_fib.c:1702 suspicious rcu_dereference_protected() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      4 locks held by swapper/0/0:
       #0:  ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] lockdep_copy_map include/linux/lockdep.h:178 [inline]
       #0:  ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1310
       #1:  (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] spin_lock_bh include/linux/spinlock.h:315 [inline]
       #1:  (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] fib6_run_gc+0x9d/0x3c0 net/ipv6/ip6_fib.c:2007
       #2:  (rcu_read_lock){....}, at: [<0000000091db762d>] __fib6_clean_all+0x0/0x3a0 net/ipv6/ip6_fib.c:1560
       #3:  (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] spin_lock_bh include/linux/spinlock.h:315 [inline]
       #3:  (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] __fib6_clean_all+0x1d0/0x3a0 net/ipv6/ip6_fib.c:1948
      
      stack backtrace:
      CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #171
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:53
       lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
       fib6_del+0xcaa/0x11b0 net/ipv6/ip6_fib.c:1701
       fib6_clean_node+0x3aa/0x4f0 net/ipv6/ip6_fib.c:1892
       fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1815
       fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1863
       fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1933
       __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1949
       fib6_clean_all net/ipv6/ip6_fib.c:1960 [inline]
       fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2016
       fib6_gc_timer_cb+0x20/0x30 net/ipv6/ip6_fib.c:2033
       call_timer_fn+0x228/0x820 kernel/time/timer.c:1320
       expire_timers kernel/time/timer.c:1357 [inline]
       __run_timers+0x7ee/0xb70 kernel/time/timer.c:1660
       run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
       __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
       invoke_softirq kernel/softirq.c:365 [inline]
       irq_exit+0x1cc/0x200 kernel/softirq.c:405
       exiting_irq arch/x86/include/asm/apic.h:540 [inline]
       smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
       apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904
       </IRQ>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Fixes: 66f5d6ce ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4512c43e
    • David S. Miller's avatar
      Merge branch 'Ether-fixes-for-the-SolutionEngine771x-boards' · 22dd8e6b
      David S. Miller authored
      Sergei Shtylyov says:
      
      ====================
      Ether fixes for the SolutionEngine771x boards
      
      Here's the series of 2 patches against Linus' repo. This series should
      (hoplefully) fix the Ether support on the SolutionEngine771x boards...
      
      [1/2] SolutionEngine771x: fix Ether platform data
      [2/2] SolutionEngine771x: add Ether TSU resource
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      22dd8e6b
    • Sergei Shtylyov's avatar
      SolutionEngine771x: add Ether TSU resource · f9a531d6
      Sergei Shtylyov authored
      After the  Ether platform data is fixed, the driver probe() method would
      still fail since the 'struct sh_eth_cpu_data' corresponding  to SH771x
      indicates the presence of TSU but the memory resource for it is absent.
      Add the missing TSU resource  to both Ether devices and fix the harmless
      off-by-one error in the main memory resources, while at it...
      
      Fixes: 4986b996 ("net: sh_eth: remove the SH_TSU_ADDR")
      Signed-off-by: default avatarSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9a531d6
    • Sergei Shtylyov's avatar
      SolutionEngine771x: fix Ether platform data · 195e2add
      Sergei Shtylyov authored
      The 'sh_eth' driver's probe() method would fail  on the SolutionEngine7710
      board and crash on SolutionEngine7712 board  as the platform code is
      hopelessly behind the driver's platform data --  it passes the PHY address
      instead of 'struct sh_eth_plat_data *'; pass the latter to the driver in
      order to fix the bug...
      
      Fixes: 71557a37 ("[netdrvr] sh_eth: Add SH7619 support")
      Signed-off-by: default avatarSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      195e2add
    • Mike Rapoport's avatar
      docs-rst: networking: wire up msg_zerocopy · 2fdd1811
      Mike Rapoport authored
      Fix the following 'make htmldocs' complaint:
      
      Documentation/networking/msg_zerocopy.rst:: WARNING: document isn't included in any toctree.
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2fdd1811
    • Nicolai Stange's avatar
      net: ipv4: emulate READ_ONCE() on ->hdrincl bit-field in raw_sendmsg() · 20b50d79
      Nicolai Stange authored
      Commit 8f659a03 ("net: ipv4: fix for a race condition in
      raw_sendmsg") fixed the issue of possibly inconsistent ->hdrincl handling
      due to concurrent updates by reading this bit-field member into a local
      variable and using the thus stabilized value in subsequent tests.
      
      However, aforementioned commit also adds the (correct) comment that
      
        /* hdrincl should be READ_ONCE(inet->hdrincl)
         * but READ_ONCE() doesn't work with bit fields
         */
      
      because as it stands, the compiler is free to shortcut or even eliminate
      the local variable at its will.
      
      Note that I have not seen anything like this happening in reality and thus,
      the concern is a theoretical one.
      
      However, in order to be on the safe side, emulate a READ_ONCE() on the
      bit-field by doing it on the local 'hdrincl' variable itself:
      
      	int hdrincl = inet->hdrincl;
      	hdrincl = READ_ONCE(hdrincl);
      
      This breaks the chain in the sense that the compiler is not allowed
      to replace subsequent reads from hdrincl with reloads from inet->hdrincl.
      
      Fixes: 8f659a03 ("net: ipv4: fix for a race condition in raw_sendmsg")
      Signed-off-by: default avatarNicolai Stange <nstange@suse.de>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20b50d79
    • Xiongfeng Wang's avatar
      net: caif: use strlcpy() instead of strncpy() · 3dc2fa47
      Xiongfeng Wang authored
      gcc-8 reports
      
      net/caif/caif_dev.c: In function 'caif_enroll_dev':
      ./include/linux/string.h:245:9: warning: '__builtin_strncpy' output may
      be truncated copying 15 bytes from a string of length 15
      [-Wstringop-truncation]
      
      net/caif/cfctrl.c: In function 'cfctrl_linkup_request':
      ./include/linux/string.h:245:9: warning: '__builtin_strncpy' output may
      be truncated copying 15 bytes from a string of length 15
      [-Wstringop-truncation]
      
      net/caif/cfcnfg.c: In function 'caif_connect_client':
      ./include/linux/string.h:245:9: warning: '__builtin_strncpy' output may
      be truncated copying 15 bytes from a string of length 15
      [-Wstringop-truncation]
      
      The compiler require that the input param 'len' of strncpy() should be
      greater than the length of the src string, so that '\0' is copied as
      well. We can just use strlcpy() to avoid this warning.
      Signed-off-by: default avatarXiongfeng Wang <xiongfeng.wang@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3dc2fa47
    • Andrii Vladyka's avatar
      net: core: fix module type in sock_diag_bind · b8fd0823
      Andrii Vladyka authored
      Use AF_INET6 instead of AF_INET in IPv6-related code path
      Signed-off-by: default avatarAndrii Vladyka <tulup@mail.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b8fd0823
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · ef7f8cec
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Frag and UDP handling fixes in i40e driver, from Amritha Nambiar and
          Alexander Duyck.
      
       2) Undo unintentional UAPI change in netfilter conntrack, from Florian
          Westphal.
      
       3) Revert a change to how error codes are returned from
          dev_get_valid_name(), it broke some apps.
      
       4) Cannot cache routes for ipv6 tunnels in the tunnel is ipv4/ipv6
          dual-stack. From Eli Cooper.
      
       5) Fix missed PMTU updates in geneve, from Xin Long.
      
       6) Cure double free in macvlan, from Gao Feng.
      
       7) Fix heap out-of-bounds write in rds_message_alloc_sgs(), from
          Mohamed Ghannam.
      
       8) FEC bug fixes from FUgang Duan (mis-accounting of dev_id, missed
          deferral of probe when the regulator is not ready yet).
      
       9) Missing DMA mapping error checks in 3c59x, from Neil Horman.
      
      10) Turn off Broadcom tags for some b53 switches, from Florian Fainelli.
      
      11) Fix OOPS when get_target_net() is passed an SKB whose NETLINK_CB()
          isn't initialized. From Andrei Vagin.
      
      12) Fix crashes in fib6_add(), from Wei Wang.
      
      13) PMTU bug fixes in SCTP from Marcelo Ricardo Leitner.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (56 commits)
        sh_eth: fix TXALCR1 offsets
        mdio-sun4i: Fix a memory leak
        phylink: mark expected switch fall-throughs in phylink_mii_ioctl
        sctp: fix the handling of ICMP Frag Needed for too small MTUs
        sctp: do not retransmit upon FragNeeded if PMTU discovery is disabled
        xen-netfront: enable device after manual module load
        bnxt_en: Fix the 'Invalid VF' id check in bnxt_vf_ndo_prep routine.
        bnxt_en: Fix population of flow_type in bnxt_hwrm_cfa_flow_alloc()
        sh_eth: fix SH7757 GEther initialization
        net: fec: free/restore resource in related probe error pathes
        uapi/if_ether.h: prevent redefinition of struct ethhdr
        ipv6: fix general protection fault in fib6_add()
        RDS: null pointer dereference in rds_atomic_free_op
        sh_eth: fix TSU resource handling
        net: stmmac: enable EEE in MII, GMII or RGMII only
        rtnetlink: give a user socket to get_target_net()
        MAINTAINERS: Update my email address.
        can: ems_usb: improve error reporting for error warning and error passive
        can: flex_can: Correct the checking for frame length in flexcan_start_xmit()
        can: gs_usb: fix return value of the "set_bittiming" callback
        ...
      ef7f8cec
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 44596f86
      Linus Torvalds authored
      Pull rdma fixes from Doug Ledford:
      
       - One line fix to mlx4 error flow (same as mlx5 fix in last pull
         request, just in the mlx4 driver)
      
       - Fix a race condition in the IPoIB driver. This patch is larger than
         just a one line fix, but resolves a race condition in a fairly
         straight forward manner
      
       - Fix a locking issue in the RDMA netlink code. This patch is also
         larger than I would like for a late -rc. It has, however, had a week
         to bake in the rdma tree prior to this pull request
      
       - One line fix to fix granting remote machine access to memory that
         they don't need and shouldn't have
      
       - One line fix to correct the fact that our sgid/dgid pair is swapped
         from what you would expect when receiving an incoming connection
         request
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
        IB/srpt: Fix ACL lookup during login
        IB/srpt: Disable RDMA access by the initiator
        RDMA/netlink: Fix locking around __ib_get_device_by_index
        IB/ipoib: Fix race condition in neigh creation
        IB/mlx4: Fix mlx4_ib_alloc_mr error flow
      44596f86
  5. 08 Jan, 2018 1 commit
    • Alexei Starovoitov's avatar
      bpf: prevent out-of-bounds speculation · b2157399
      Alexei Starovoitov authored
      Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
      memory accesses under a bounds check may be speculated even if the
      bounds check fails, providing a primitive for building a side channel.
      
      To avoid leaking kernel data round up array-based maps and mask the index
      after bounds check, so speculated load with out of bounds index will load
      either valid value from the array or zero from the padded area.
      
      Unconditionally mask index for all array types even when max_entries
      are not rounded to power of 2 for root user.
      When map is created by unpriv user generate a sequence of bpf insns
      that includes AND operation to make sure that JITed code includes
      the same 'index & index_mask' operation.
      
      If prog_array map is created by unpriv user replace
        bpf_tail_call(ctx, map, index);
      with
        if (index >= max_entries) {
          index &= map->index_mask;
          bpf_tail_call(ctx, map, index);
        }
      (along with roundup to power 2) to prevent out-of-bounds speculation.
      There is secondary redundant 'if (index >= max_entries)' in the interpreter
      and in all JITs, but they can be optimized later if necessary.
      
      Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
      cannot be used by unpriv, so no changes there.
      
      That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
      all architectures with and without JIT.
      
      v2->v3:
      Daniel noticed that attack potentially can be crafted via syscall commands
      without loading the program, so add masking to those paths as well.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      b2157399