1. 26 Nov, 2016 1 commit
    • Eric Dumazet's avatar
      net: properly flush delay-freed skbs · f52dffe0
      Eric Dumazet authored
      Typical NAPI drivers use napi_consume_skb(skb) at TX completion time.
      This put skb in a percpu special queue, napi_alloc_cache, to get bulk
      frees.
      
      It turns out the queue is not flushed and hits the NAPI_SKB_CACHE_SIZE
      limit quite often, with skbs that were queued hundreds of usec earlier.
      I measured this can take ~6000 nsec to perform one flush.
      
      __kfree_skb_flush() can be called from two points right now :
      
      1) From net_tx_action(), but only for skbs that were queued to
      sd->completion_queue.
      
       -> Irrelevant for NAPI drivers in normal operation.
      
      2) From net_rx_action(), but only under high stress or if RPS/RFS has a
      pending action.
      
      This patch changes net_rx_action() to perform the flush in all cases and
      after more urgent operations happened (like kicking remote CPUS for
      RPS/RFS).
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Acked-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f52dffe0
  2. 25 Nov, 2016 8 commits
    • David S. Miller's avatar
      Merge branch 'cgroup-bpf' · ca89fa77
      David S. Miller authored
      Daniel Mack says:
      
      ====================
      Add eBPF hooks for cgroups
      
      This is v9 of the patch set to allow eBPF programs for network
      filtering and accounting to be attached to cgroups, so that they apply
      to all sockets of all tasks placed in that cgroup. The logic also
      allows to be extendeded for other cgroup based eBPF logic.
      
      Again, only minor details are updated in this version.
      
      Changes from v8:
      
      * Move the egress hooks into ip_finish_output() and ip6_finish_output()
        so they run after the netfilter hooks. For IPv4 multicast, add a new
        ip_mc_finish_output() callback that is invoked on success by
        netfilter, and call the eBPF program from there.
      
      Changes from v7:
      
      * Replace the static inline function cgroup_bpf_run_filter() with
        two specific macros for ingress and egress.  This addresses David
        Miller's concern regarding skb->sk vs. sk in the egress path.
        Thanks a lot to Daniel Borkmann and Alexei Starovoitov for the
        suggestions.
      
      Changes from v6:
      
      * Rebased to 4.9-rc2
      
      * Add EXPORT_SYMBOL(__cgroup_bpf_run_filter). The kbuild test robot
        now succeeds in building this version of the patch set.
      
      * Switch from bpf_prog_run_save_cb() to bpf_prog_run_clear_cb() to not
        tamper with the contents of skb->cb[]. Pointed out by Daniel
        Borkmann.
      
      * Use sk_to_full_sk() in the egress path, as suggested by Daniel
        Borkmann.
      
      * Renamed BPF_PROG_TYPE_CGROUP_SOCKET to BPF_PROG_TYPE_CGROUP_SKB, as
        requested by David Ahern.
      
      * Added Alexei's Acked-by tags.
      
      Changes from v5:
      
      * The eBPF programs now operate on L3 rather than on L2 of the packets,
        and the egress hooks were moved from __dev_queue_xmit() to
        ip*_output().
      
      * For BPF_PROG_TYPE_CGROUP_SOCKET, disallow direct access to the skb
        through BPF_LD_[ABS|IND] instructions, but hook up the
        bpf_skb_load_bytes() access helper instead. Thanks to Daniel Borkmann
        for the help.
      
      Changes from v4:
      
      * Plug an skb leak when dropping packets due to eBPF verdicts in
        __dev_queue_xmit(). Spotted by Daniel Borkmann.
      
      * Check for sk_fullsock(sk) in __cgroup_bpf_run_filter() so we don't
        operate on timewait or request sockets. Suggested by Daniel Borkmann.
      
      * Add missing @parent parameter in kerneldoc of __cgroup_bpf_update().
        Spotted by Rami Rosen.
      
      * Include linux/jump_label.h from bpf-cgroup.h to fix a kbuild error.
      
      Changes from v3:
      
      * Dropped the _FILTER suffix from BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
        renamed BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS to
        BPF_CGROUP_INET_{IN,E}GRESS and alias BPF_MAX_ATTACH_TYPE to
        __BPF_MAX_ATTACH_TYPE, as suggested by Daniel Borkmann.
      
      * Dropped the attach_flags member from the anonymous struct for BPF
        attach operations in union bpf_attr. They can be added later on via
        CHECK_ATTR. Requested by Daniel Borkmann and Alexei.
      
      * Release old_prog at the end of __cgroup_bpf_update rather that at
        the beginning to fix a race gap between program updates and their
        users. Spotted by Daniel Borkmann.
      
      * Plugged an skb leak when dropping packets on the egress path.
        Spotted by Daniel Borkmann.
      
      * Add cgroups@vger.kernel.org to the loop, as suggested by Rami Rosen.
      
      * Some minor coding style adoptions not worth mentioning in particular.
      
      Changes from v2:
      
      * Fixed the RCU locking details Tejun pointed out.
      
      * Assert bpf_attr.flags == 0 in BPF_PROG_DETACH syscall handler.
      
      Changes from v1:
      
      * Moved all bpf specific cgroup code into its own file, and stub
        out related functions for !CONFIG_CGROUP_BPF as static inline nops.
        This way, the call sites are not cluttered with #ifdef guards while
        the feature remains compile-time configurable.
      
      * Implemented the new scheme proposed by Tejun. Per cgroup, store one
        set of pointers that are pinned to the cgroup, and one for the
        programs that are effective. When a program is attached or detached,
        the change is propagated to all the cgroup's descendants. If a
        subcgroup has its own pinned program, skip the whole subbranch in
        order to allow delegation models.
      
      * The hookup for egress packets is now done from __dev_queue_xmit().
      
      * A static key is now used in both the ingress and egress fast paths
        to keep performance penalties close to zero if the feature is
        not in use.
      
      * Overall cleanup to make the accessors use the program arrays.
        This should make it much easier to add new program types, which
        will then automatically follow the pinned vs. effective logic.
      
      * Fixed locking issues, as pointed out by Eric Dumazet and Alexei
        Starovoitov. Changes to the program array are now done with
        xchg() and are protected by cgroup_mutex.
      
      * eBPF programs are now expected to return 1 to let the packet pass,
        not >= 0. Pointed out by Alexei.
      
      * Operation is now limited to INET sockets, so local AF_UNIX sockets
        are not affected. The enum members are renamed accordingly. In case
        other socket families should be supported, this can be extended in
        the future.
      
      * The sample program learned to support both ingress and egress, and
        can now optionally make the eBPF program drop packets by making it
        return 0.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca89fa77
    • Daniel Mack's avatar
      samples: bpf: add userspace example for attaching eBPF programs to cgroups · d8c5b17f
      Daniel Mack authored
      Add a simple userpace program to demonstrate the new API to attach eBPF
      programs to cgroups. This is what it does:
      
       * Create arraymap in kernel with 4 byte keys and 8 byte values
      
       * Load eBPF program
      
         The eBPF program accesses the map passed in to store two pieces of
         information. The number of invocations of the program, which maps
         to the number of packets received, is stored to key 0. Key 1 is
         incremented on each iteration by the number of bytes stored in
         the skb.
      
       * Detach any eBPF program previously attached to the cgroup
      
       * Attach the new program to the cgroup using BPF_PROG_ATTACH
      
       * Once a second, read map[0] and map[1] to see how many bytes and
         packets were seen on any socket of tasks in the given cgroup.
      
      The program takes a cgroup path as 1st argument, and either "ingress"
      or "egress" as 2nd. Optionally, "drop" can be passed as 3rd argument,
      which will make the generated eBPF program return 0 instead of 1, so
      the kernel will drop the packet.
      
      libbpf gained two new wrappers for the new syscall commands.
      Signed-off-by: default avatarDaniel Mack <daniel@zonque.org>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d8c5b17f
    • Daniel Mack's avatar
      net: ipv4, ipv6: run cgroup eBPF egress programs · 33b48679
      Daniel Mack authored
      If the cgroup associated with the receiving socket has an eBPF
      programs installed, run them from ip_output(), ip6_output() and
      ip_mc_output(). From mentioned functions we have two socket contexts
      as per 7026b1dd ("netfilter: Pass socket pointer down through
      okfn()."). We explicitly need to use sk instead of skb->sk here,
      since otherwise the same program would run multiple times on egress
      when encap devices are involved, which is not desired in our case.
      
      eBPF programs used in this context are expected to either return 1 to
      let the packet pass, or != 1 to drop them. The programs have access to
      the skb through bpf_skb_load_bytes(), and the payload starts at the
      network headers (L3).
      
      Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
      for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
      the feature is unused.
      Signed-off-by: default avatarDaniel Mack <daniel@zonque.org>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33b48679
    • Daniel Mack's avatar
      net: filter: run cgroup eBPF ingress programs · c11cd3a6
      Daniel Mack authored
      If the cgroup associated with the receiving socket has an eBPF
      programs installed, run them from sk_filter_trim_cap().
      
      eBPF programs used in this context are expected to either return 1 to
      let the packet pass, or != 1 to drop them. The programs have access to
      the skb through bpf_skb_load_bytes(), and the payload starts at the
      network headers (L3).
      
      Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
      for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
      the feature is unused.
      Signed-off-by: default avatarDaniel Mack <daniel@zonque.org>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c11cd3a6
    • Daniel Mack's avatar
      bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands · f4324551
      Daniel Mack authored
      Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
      BPF_PROG_DETACH which allow attaching and detaching eBPF programs
      to a target.
      
      On the API level, the target could be anything that has an fd in
      userspace, hence the name of the field in union bpf_attr is called
      'target_fd'.
      
      When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
      expected to be a valid file descriptor of a cgroup v2 directory which
      has the bpf controller enabled. These are the only use-cases
      implemented by this patch at this point, but more can be added.
      
      If a program of the given type already exists in the given cgroup,
      the program is swapped automically, so userspace does not have to drop
      an existing program first before installing a new one, which would
      otherwise leave a gap in which no program is attached.
      
      For more information on the propagation logic to subcgroups, please
      refer to the bpf cgroup controller implementation.
      
      The API is guarded by CAP_NET_ADMIN.
      Signed-off-by: default avatarDaniel Mack <daniel@zonque.org>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f4324551
    • Daniel Mack's avatar
      cgroup: add support for eBPF programs · 30070984
      Daniel Mack authored
      This patch adds two sets of eBPF program pointers to struct cgroup.
      One for such that are directly pinned to a cgroup, and one for such
      that are effective for it.
      
      To illustrate the logic behind that, assume the following example
      cgroup hierarchy.
      
        A - B - C
              \ D - E
      
      If only B has a program attached, it will be effective for B, C, D
      and E. If D then attaches a program itself, that will be effective for
      both D and E, and the program in B will only affect B and C. Only one
      program of a given type is effective for a cgroup.
      
      Attaching and detaching programs will be done through the bpf(2)
      syscall. For now, ingress and egress inet socket filtering are the
      only supported use-cases.
      Signed-off-by: default avatarDaniel Mack <daniel@zonque.org>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      30070984
    • Daniel Mack's avatar
      bpf: add new prog type for cgroup socket filtering · 0e33661d
      Daniel Mack authored
      This program type is similar to BPF_PROG_TYPE_SOCKET_FILTER, except that
      it does not allow BPF_LD_[ABS|IND] instructions and hooks up the
      bpf_skb_load_bytes() helper.
      
      Programs of this type will be attached to cgroups for network filtering
      and accounting.
      Signed-off-by: default avatarDaniel Mack <daniel@zonque.org>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0e33661d
    • Colin Ian King's avatar
      cxgb4: fix memory leak on txq_info · 619228d8
      Colin Ian King authored
      Currently if txq_info->uldtxq cannot be allocated then
      txq_info->txq is being kfree'd (which is redundant because it
      is NULL) instead of txq_info. Fix this by instead kfree'ing
      txq_info.
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      619228d8
  3. 24 Nov, 2016 23 commits
  4. 22 Nov, 2016 8 commits
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · f9aa9dc7
      David S. Miller authored
      All conflicts were simple overlapping changes except perhaps
      for the Thunder driver.
      
      That driver has a change_mtu method explicitly for sending
      a message to the hardware.  If that fails it returns an
      error.
      
      Normally a driver doesn't need an ndo_change_mtu method becuase those
      are usually just range changes, which are now handled generically.
      But since this extra operation is needed in the Thunder driver, it has
      to stay.
      
      However, if the message send fails we have to restore the original
      MTU before the change because the entire call chain expects that if
      an error is thrown by ndo_change_mtu then the MTU did not change.
      Therefore code is added to nicvf_change_mtu to remember the original
      MTU, and to restore it upon nicvf_update_hw_max_frs() failue.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9aa9dc7
    • Arnd Bergmann's avatar
      marvell: mark mvneta and mvpp2 32-bit only · 06b37b65
      Arnd Bergmann authored
      Both of these drivers won't work on 64-bit architectures unless they
      are redesigned, since they store a virtual address pointer in a 32-bit
      field of the descriptors:
      
      drivers/net/ethernet/marvell/mvneta_bm.c: In function 'mvneta_bm_construct':
      drivers/net/ethernet/marvell/mvneta_bm.c:103:16: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
      drivers/net/ethernet/marvell/mvpp2.c: In function 'mvpp2_prs_vlan_init':
      drivers/net/ethernet/marvell/mvpp2.c:2563:32: error: large integer implicitly truncated to unsigned type [-Werror=overflow]
      
      This limits the COMPILE_TEST option for the two drivers again to
      only build them on 32-bit. This seems nicer than shutting up the
      warnings, in case we ever actually want to use them on 64-bit,
      as the warnings indicate which parts of the driver are currently
      broken there.
      
      Fixes: a0627f77 ("net: marvell: Allow drivers to be built with COMPILE_TEST")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06b37b65
    • David S. Miller's avatar
      Merge branch 'mlxsw-thermal-zone' · 41f698b0
      David S. Miller authored
      Jiri Pirko says:
      
      ====================
      mlxsw: core: Implement thermal zone
      
      Implement thermal zone for mlxsw based HW.
      The first patch is just a register dependency for the second patch.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      41f698b0
    • Ivan Vecera's avatar
      mlxsw: core: Implement thermal zone · a50c1e35
      Ivan Vecera authored
      Implement thermal zone for mlxsw based HW. It uses temperature sensor
      provided by ASIC (the same as mlxsw hwmon interface) to report current
      temp to thermal core. The ASIC's PWM is then used to control speed
      of system fans registered as cooling devices.
      Signed-off-by: default avatarIvan Vecera <cera@cera.cz>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a50c1e35
    • Jiri Pirko's avatar
      mlxsw: reg: Add Management Fan Speed Limit register · 55c63aaa
      Jiri Pirko authored
      The MFSL register is used to configure the fan speed event / interrupt
      notification mechanism. Fan speed threshold are defined for both
      under-speed and over-speed.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      55c63aaa
    • David S. Miller's avatar
      Merge branch 'mv88e6390-initial-support' · 29033726
      David S. Miller authored
      Andrew Lunn says:
      
      ====================
      Start adding support for mv88e6390
      
      This is the first patchset implementing support for the mv88e6390
      family.  This is a new generation of switch devices and has numerous
      incompatible changes to the registers. These patches allow the switch
      to the detected during probe, and makes the statistics unit work.
      
      These patches are insufficient to make the mv88e6390 functional. More
      patches will follow.
      
      v2:
        Move stats code into global1
        Change DT compatible string to mv88e6190
        Fixed mv88e6351 stats which v1 had broken
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29033726
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Move g1 stats code in global1.[ch] · 7f9ef3af
      Andrew Lunn authored
      Move the stats functions which access global 1 registers into
      global1.c.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f9ef3af
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Implement mv88e6390 get_stats · e0d8b615
      Andrew Lunn authored
      The mv88e6390 uses a different bit to select between bank0 and bank1
      of the statistics. So implement an ops function for this, and pass the
      selector bit to the generic stats read function. Also, the histogram
      selection has moved for the mv88e6390, so abstract its selection as
      well.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e0d8b615