1. 08 Mar, 2016 34 commits
    • David S. Miller's avatar
      Merge branch 'bpf-map-prealloc' · f14b488d
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      bpf: map pre-alloc
      
      v1->v2:
      . fix few issues spotted by Daniel
      . converted stackmap into pre-allocation as well
      . added a workaround for lockdep false positive
      . added pcpu_freelist_populate to be used by hashmap and stackmap
      
      this path set switches bpf hash map to use pre-allocation by default
      and introduces BPF_F_NO_PREALLOC flag to keep old behavior for cases
      where full map pre-allocation is too memory expensive.
      
      Some time back Daniel Wagner reported crashes when bpf hash map is
      used to compute time intervals between preempt_disable->preempt_enable
      and recently Tom Zanussi reported a dead lock in iovisor/bcc/funccount
      tool if it's used to count the number of invocations of kernel
      '*spin*' functions. Both problems are due to the recursive use of
      slub and can only be solved by pre-allocating all map elements.
      
      A lot of different solutions were considered. Many implemented,
      but at the end pre-allocation seems to be the only feasible answer.
      As far as pre-allocation goes it also was implemented 4 different ways:
      - simple free-list with single lock
      - percpu_ida with optimizations
      - blk-mq-tag variant customized for bpf use case
      - percpu_freelist
      For bpf style of alloc/free patterns percpu_freelist is the best
      and implemented in this patch set.
      Detailed performance numbers in patch 3.
      Patch 2 introduces percpu_freelist
      Patch 1 fixes simple deadlocks due to missing recursion checks
      Patch 5: converts stackmap to pre-allocation
      Patches 6-9: prepare test infra
      Patch 10: stress test for hash map infra. It attaches to spin_lock
      functions and bpf_map_update/delete are called from different contexts
      Patch 11: stress for bpf_get_stackid
      Patch 12: map performance test
      Reported-by: default avatarDaniel Wagner <daniel.wagner@bmw-carit.de>
      Reported-by: default avatarTom Zanussi <tom.zanussi@linux.intel.com>
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f14b488d
    • Alexei Starovoitov's avatar
      samples/bpf: test both pre-alloc and normal maps · c3f85cff
      Alexei Starovoitov authored
      extend test coveraged to include pre-allocated and run-time alloc maps
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3f85cff
    • Alexei Starovoitov's avatar
      samples/bpf: add map_flags to bpf loader · 89b97607
      Alexei Starovoitov authored
      note old loader is compatible with new kernel.
      map_flags are optional
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      89b97607
    • Alexei Starovoitov's avatar
      samples/bpf: move ksym_search() into library · 3622e7e4
      Alexei Starovoitov authored
      move ksym search from offwaketime into library to be reused
      in other tests
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3622e7e4
    • Alexei Starovoitov's avatar
      samples/bpf: make map creation more verbose · 618ec9a7
      Alexei Starovoitov authored
      map creation is typically the first one to fail when rlimits are
      too low, not enough memory, etc
      Make this failure scenario more verbose
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      618ec9a7
    • Alexei Starovoitov's avatar
      bpf: convert stackmap to pre-allocation · 557c0c6e
      Alexei Starovoitov authored
      It was observed that calling bpf_get_stackid() from a kprobe inside
      slub or from spin_unlock causes similar deadlock as with hashmap,
      therefore convert stackmap to use pre-allocated memory.
      
      The call_rcu is no longer feasible mechanism, since delayed freeing
      causes bpf_get_stackid() to fail unpredictably when number of actual
      stacks is significantly less than user requested max_entries.
      Since elements are no longer freed into slub, we can push elements into
      freelist immediately and let them be recycled.
      However the very unlikley race between user space map_lookup() and
      program-side recycling is possible:
           cpu0                          cpu1
           ----                          ----
      user does lookup(stackidX)
      starts copying ips into buffer
                                         delete(stackidX)
                                         calls bpf_get_stackid()
      				   which recyles the element and
                                         overwrites with new stack trace
      
      To avoid user space seeing a partial stack trace consisting of two
      merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket);
      to preserve consistent stack trace delivery to user space.
      Now we can move memset(,0) of left-over element value from critical
      path of bpf_get_stackid() into slow-path of user space lookup.
      Also disallow lookup() from bpf program, since it's useless and
      program shouldn't be messing with collected stack trace.
      
      Note that similar race between user space lookup and kernel side updates
      is also present in hashmap, but it's not a new race. bpf programs were
      always allowed to modify hash and array map elements while user space
      is copying them.
      
      Fixes: d5a3b1f6 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      557c0c6e
    • Alexei Starovoitov's avatar
    • Alexei Starovoitov's avatar
      bpf: pre-allocate hash map elements · 6c905981
      Alexei Starovoitov authored
      If kprobe is placed on spin_unlock then calling kmalloc/kfree from
      bpf programs is not safe, since the following dead lock is possible:
      kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
      bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
      and deadlocks.
      
      The following solutions were considered and some implemented, but
      eventually discarded
      - kmem_cache_create for every map
      - add recursion check to slow-path of slub
      - use reserved memory in bpf_map_update for in_irq or in preempt_disabled
      - kmalloc via irq_work
      
      At the end pre-allocation of all map elements turned out to be the simplest
      solution and since the user is charged upfront for all the memory, such
      pre-allocation doesn't affect the user space visible behavior.
      
      Since it's impossible to tell whether kprobe is triggered in a safe
      location from kmalloc point of view, use pre-allocation by default
      and introduce new BPF_F_NO_PREALLOC flag.
      
      While testing of per-cpu hash maps it was discovered
      that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
      fails to allocate memory even when 90% of it is free.
      The pre-allocation of per-cpu hash elements solves this problem as well.
      
      Turned out that bpf_map_update() quickly followed by
      bpf_map_lookup()+bpf_map_delete() is very common pattern used
      in many of iovisor/bcc/tools, so there is additional benefit of
      pre-allocation, since such use cases are must faster.
      
      Since all hash map elements are now pre-allocated we can remove
      atomic increment of htab->count and save few more cycles.
      
      Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
      large malloc/free done by users who don't have sufficient limits.
      
      Pre-allocation is done with vmalloc and alloc/free is done
      via percpu_freelist. Here are performance numbers for different
      pre-allocation algorithms that were implemented, but discarded
      in favor of percpu_freelist:
      
      1 cpu:
      pcpu_ida	2.1M
      pcpu_ida nolock	2.3M
      bt		2.4M
      kmalloc		1.8M
      hlist+spinlock	2.3M
      pcpu_freelist	2.6M
      
      4 cpu:
      pcpu_ida	1.5M
      pcpu_ida nolock	1.8M
      bt w/smp_align	1.7M
      bt no/smp_align	1.1M
      kmalloc		0.7M
      hlist+spinlock	0.2M
      pcpu_freelist	2.0M
      
      8 cpu:
      pcpu_ida	0.7M
      bt w/smp_align	0.8M
      kmalloc		0.4M
      pcpu_freelist	1.5M
      
      32 cpu:
      kmalloc		0.13M
      pcpu_freelist	0.49M
      
      pcpu_ida nolock is a modified percpu_ida algorithm without
      percpu_ida_cpu locks and without cross-cpu tag stealing.
      It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
      
      bt is a variant of block/blk-mq-tag.c simlified and customized
      for bpf use case. bt w/smp_align is using cache line for every 'long'
      (similar to blk-mq-tag). bt no/smp_align allocates 'long'
      bitmasks continuously to save memory. It's comparable to percpu_ida
      and in some cases faster, but slower than percpu_freelist
      
      hlist+spinlock is the simplest free list with single spinlock.
      As expeceted it has very bad scaling in SMP.
      
      kmalloc is existing implementation which is still available via
      BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
      in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
      but saves memory, so in cases where map->max_entries can be large
      and number of map update/delete per second is low, it may make
      sense to use it.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c905981
    • Alexei Starovoitov's avatar
      bpf: introduce percpu_freelist · e19494ed
      Alexei Starovoitov authored
      Introduce simple percpu_freelist to keep single list of elements
      spread across per-cpu singly linked lists.
      
      /* push element into the list */
      void pcpu_freelist_push(struct pcpu_freelist *, struct pcpu_freelist_node *);
      
      /* pop element from the list */
      struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *);
      
      The object is pushed to the current cpu list.
      Pop first trying to get the object from the current cpu list,
      if it's empty goes to the neigbour cpu list.
      
      For bpf program usage pattern the collision rate is very low,
      since programs push and pop the objects typically on the same cpu.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e19494ed
    • Alexei Starovoitov's avatar
      bpf: prevent kprobe+bpf deadlocks · b121d1e7
      Alexei Starovoitov authored
      if kprobe is placed within update or delete hash map helpers
      that hold bucket spin lock and triggered bpf program is trying to
      grab the spinlock for the same bucket on the same cpu, it will
      deadlock.
      Fix it by extending existing recursion prevention mechanism.
      
      Note, map_lookup and other tracing helpers don't have this problem,
      since they don't hold any locks and don't modify global data.
      bpf_trace_printk has its own recursive check and ok as well.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b121d1e7
    • David S. Miller's avatar
      Merge branch 'ipv6-per-netns-gc' · 8aba8b83
      David S. Miller authored
      Michal Kubecek says:
      
      ====================
      ipv6: per netns FIB6 walkers and garbage collector
      
      Commit 2ac3ac8f ("ipv6: prevent fib6_run_gc() contention") reduced
      the risk of contention on FIB6 garbage collector lock on systems with
      many CPUs. However, one of our customers can still observe heavy
      contention on fib6_gc_lock which can even trigger the soft lockup
      detector.
      
      This is caused by garbage collector running in forced mode from a timer.
      While there is one timer per network namespace, the instances of
      fib6_run_gc() running from them are protected by one global spinlock so
      that only one garbage collector can run at any moment and other
      namespaces have to wait. As most relevant data structures are separated
      per netns, there is little reason for garbage collectors blocking each
      other.
      
      Similar problem exists for walkers: changes in one tree do not need to
      adjust (and block) walkers traversing FIB trees in other namespaces.
      
      This series separates both the walkers infrastructure and garbage
      collector so that they work independently in network namespaces.
      
      v2: get rid of ifdef in ipv6_route_seq_setup_walk(), pass net from
      callers instead
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8aba8b83
    • Michal Kubeček's avatar
      ipv6: per netns FIB garbage collection · 3dc94f93
      Michal Kubeček authored
      One of our customers observed issues with FIB6 garbage collectors
      running in different network namespaces blocking each other, resulting
      in soft lockups (fib6_run_gc() initiated from timer runs always in
      forced mode).
      
      Now that FIB6 walkers are separated per namespace, there is no more need
      for instances of fib6_run_gc() in different namespaces blocking each
      other. There is still a call to icmp6_dst_gc() which operates on shared
      data but this function is protected by its own shared lock.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3dc94f93
    • Michal Kubeček's avatar
      ipv6: per netns fib6 walkers · 9a03cd8f
      Michal Kubeček authored
      The IPv6 FIB data structures are separated per network namespace but
      there is still only one global walkers list and one global walker list
      lock. This means changes in one namespace unnecessarily interfere with
      walkers in other namespaces.
      
      Replace the global list with per-netns lists (and give each its own
      lock).
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9a03cd8f
    • Michal Kubeček's avatar
      ipv6: replace global gc_args with local variable · 3570df91
      Michal Kubeček authored
      Global variable gc_args is only used in fib6_run_gc() and functions
      called from it. As fib6_run_gc() makes sure there is at most one
      instance of fib6_clean_all() running at any moment, we can replace
      gc_args with a local variable which will be needed once multiple
      instances (per netns) of garbage collector are allowed.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3570df91
    • David S. Miller's avatar
      Merge branch 'bnxt_en-next' · 02daec7c
      David S. Miller authored
      Michael Chan says:
      
      ====================
      bnxt_en: Updates for net-next.
      
      Updates to support autoneg for all supported speeds, add PF port statistics,
      and Advanced Error Reporting.
      
      v2: Fixed patch 3 to not use parentheses on function return.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      02daec7c
    • Satish Baddipadige's avatar
      bnxt_en: Enable AER support. · 6316ea6d
      Satish Baddipadige authored
      Add pci_error_handler callbacks to support for pcie advanced error
      recovery.
      Signed-off-by: default avatarSatish Baddipadige <sbaddipa@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6316ea6d
    • Michael Chan's avatar
      bnxt_en: Include hardware port statistics in ethtool -S. · 8ddc9aaa
      Michael Chan authored
      Include the more useful port statistics in ethtool -S for the PF device.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ddc9aaa
    • Michael Chan's avatar
      bnxt_en: Include some hardware port statistics in ndo_get_stats64(). · 9947f83f
      Michael Chan authored
      Include some of the port error counters (e.g. crc) in ->ndo_get_stats64()
      for the PF device.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9947f83f
    • Michael Chan's avatar
      bnxt_en: Add port statistics support. · 3bdf56c4
      Michael Chan authored
      Gather periodic port statistics if the device is PF and link is up.  This
      is triggered in bnxt_timer() every one second to request firmware to DMA
      the counters.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadocm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3bdf56c4
    • Michael Chan's avatar
      bnxt_en: Extend autoneg to all speeds. · f1a082a6
      Michael Chan authored
      Allow all autoneg speeds aupported by firmware to be advertised.  If
      the advertising parameter is 0, then all supported speeds will be
      advertised.
      
      Remove BNXT_ALL_COPPER_ETHTOOL_SPEED which is no longer used as all
      supported speeds can be advertised.
      Signed-off-by: default avatarMichael Chan <mchan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f1a082a6
    • Michael Chan's avatar
      bnxt_en: Use common function to get ethtool supported flags. · 4b32cacc
      Michael Chan authored
      The supported bits and advertising bits in ethtool have the same
      definitions.  The same is true for the firmware bits.  So use the
      common function to handle the conversion for both supported and
      advertising bits.
      
      v2: Don't use parentheses on function return.
      Signed-off-by: default avatarMichael Chan <mchan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4b32cacc
    • Michael Chan's avatar
      bnxt_en: Add reporting of link partner advertisement. · 3277360e
      Michael Chan authored
      And report actual pause settings to ETHTOOL_GPAUSEPARAM to let ethtool
      resolve the actual pause settings.
      Signed-off-by: default avatarMichael Chan <mchan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3277360e
    • Michael Chan's avatar
      bnxt_en: Refactor bnxt_fw_to_ethtool_advertised_spds(). · 27c4d578
      Michael Chan authored
      Include the conversion of pause bits and add one extra call layer so
      that the same refactored function can be reused to get the link partner
      advertisement bits.
      Signed-off-by: default avatarMichael Chan <mchan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      27c4d578
    • Kyeong Yoo's avatar
      net_sched: dsmark: use qdisc_dequeue_peeked() · f8b33d8e
      Kyeong Yoo authored
      This fix is for dsmark similar to commit 3557619f
      ("net_sched: prio: use qdisc_dequeue_peeked")
      and makes use of qdisc_dequeue_peeked() instead of direct dequeue() call.
      
      First time, wrr peeks dsmark, which will then peek into sfq.
      sfq dequeues an skb and it's stored in sch->gso_skb.
      Next time, wrr tries to dequeue from dsmark, which will call sfq dequeue
      directly. This results skipping the previously peeked skb.
      
      So changed dsmark dequeue to call qdisc_dequeue_peeked() instead to use
      peeked skb if exists.
      Signed-off-by: default avatarKyeong Yoo <kyeong.yoo@alliedtelesis.co.nz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8b33d8e
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · 4c38cd61
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter/IPVS updates for net-next
      
      The following patchset contains Netfilter updates for your net-next tree,
      they are:
      
      1) Remove useless debug message when deleting IPVS service, from
         Yannick Brosseau.
      
      2) Get rid of compilation warning when CONFIG_PROC_FS is unset in
         several spots of the IPVS code, from Arnd Bergmann.
      
      3) Add prandom_u32 support to nft_meta, from Florian Westphal.
      
      4) Remove unused variable in xt_osf, from Sudip Mukherjee.
      
      5) Don't calculate IP checksum twice from netfilter ipv4 defrag hook
         since fixing af_packet defragmentation issues, from Joe Stringer.
      
      6) On-demand hook registration for iptables from netns. Instead of
         registering the hooks for every available netns whenever we need
         one of the support tables, we register this on the specific netns
         that needs it, patchset from Florian Westphal.
      
      7) Add missing port range selection to nf_tables masquerading support.
      
      BTW, just for the record, there is a typo in the description of
      5f6c253e ("netfilter: bridge: register hooks only when bridge
      interface is added") that refers to the cluster match as deprecated, but
      it is actually the CLUSTERIP target (which registers hooks
      inconditionally) the one that is scheduled for removal.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4c38cd61
    • David S. Miller's avatar
      Merge branch 'bpf-next' · d24ad3fc
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      BPF updates
      
      Couple of misc updates to BPF, besides others this series adds
      bpf_csum_diff() to be used with L3 csums, allows for managing
      tunnel options for collect meta data mode, and enabling ipv6
      traffic class for collect meta data in vxlan specifically (geneve
      already supports it). For more details, please see individual
      patches.
      
      The series requires net to be merged into net-next first to
      avoid any further pending merge conflicts.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d24ad3fc
    • Daniel Borkmann's avatar
      vxlan: allow setting ipv6 traffic class · 1400615d
      Daniel Borkmann authored
      We can already do that for IPv4, but IPv6 support was missing. Add
      it for vxlan, so it can be used with collect metadata frontends.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1400615d
    • Daniel Borkmann's avatar
      bpf, vxlan, geneve, gre: fix usage of dst_cache on xmit · db3c6139
      Daniel Borkmann authored
      The assumptions from commit 0c1d70af ("net: use dst_cache for vxlan
      device"), 468dfffc ("geneve: add dst caching support") and 3c1cb4d2
      ("net/ipv4: add dst cache support for gre lwtunnels") on dst_cache usage
      when ip_tunnel_info is used is unfortunately not always valid as assumed.
      
      While it seems correct for ip_tunnel_info front-ends such as OVS, eBPF
      however can fill in ip_tunnel_info for consumers like vxlan, geneve or gre
      with different remote dsts, tos, etc, therefore they cannot be assumed as
      packet independent.
      
      Right now vxlan, geneve, gre would cache the dst for eBPF and every packet
      would reuse the same entry that was first created on the initial route
      lookup. eBPF doesn't store/cache the ip_tunnel_info, so each skb may have
      a different one.
      
      Fix it by adding a flag that checks the ip_tunnel_info. Also the !tos test
      in vxlan needs to be handeled differently in this context as it is currently
      inferred from ip_tunnel_info as well if present. ip_tunnel_dst_cache_usable()
      helper is added for the three tunnel cases, which checks if we can use dst
      cache.
      
      Fixes: 0c1d70af ("net: use dst_cache for vxlan device")
      Fixes: 468dfffc ("geneve: add dst caching support")
      Fixes: 3c1cb4d2 ("net/ipv4: add dst cache support for gre lwtunnels")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      db3c6139
    • Daniel Borkmann's avatar
      bpf: support for access to tunnel options · 14ca0751
      Daniel Borkmann authored
      After eBPF being able to programmatically access/manage tunnel key meta
      data via commit d3aa45ce ("bpf: add helpers to access tunnel metadata")
      and more recently also for IPv6 through c6c33454 ("bpf: support ipv6
      for bpf_skb_{set,get}_tunnel_key"), this work adds two complementary
      helpers to generically access their auxiliary tunnel options.
      
      Geneve and vxlan support this facility. For geneve, TLVs can be pushed,
      and for the vxlan case its GBP extension. I.e. setting tunnel key for geneve
      case only makes sense, if we can also read/write TLVs into it. In the GBP
      case, it provides the flexibility to easily map the group policy ID in
      combination with other helpers or maps.
      
      I chose to model this as two separate helpers, bpf_skb_{set,get}_tunnel_opt(),
      for a couple of reasons. bpf_skb_{set,get}_tunnel_key() is already rather
      complex by itself, and there may be cases for tunnel key backends where
      tunnel options are not always needed. If we would have integrated this
      into bpf_skb_{set,get}_tunnel_key() nevertheless, we are very limited with
      remaining helper arguments, so keeping compatibility on structs in case of
      passing in a flat buffer gets more cumbersome. Separating both also allows
      for more flexibility and future extensibility, f.e. options could be fed
      directly from a map, etc.
      
      Moreover, change geneve's xmit path to test only for info->options_len
      instead of TUNNEL_GENEVE_OPT flag. This makes it more consistent with vxlan's
      xmit path and allows for avoiding to specify a protocol flag in the API on
      xmit, so it can be protocol agnostic. Having info->options_len is enough
      information that is needed. Tested with vxlan and geneve.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      14ca0751
    • Daniel Borkmann's avatar
      bpf: allow to propagate df in bpf_skb_set_tunnel_key · 22080870
      Daniel Borkmann authored
      Added by 9a628224 ("ip_tunnel: Add dont fragment flag."), allow to
      feed df flag into tunneling facilities (currently supported on TX by
      vxlan, geneve and gre) as a hint from eBPF's bpf_skb_set_tunnel_key()
      helper.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      22080870
    • Daniel Borkmann's avatar
      bpf: make helper function protos static · 577c50aa
      Daniel Borkmann authored
      They are only used here, so there's no reason they should not be static.
      Only the vlan push/pop protos are used in the test_bpf suite.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      577c50aa
    • Daniel Borkmann's avatar
      bpf: add flags to bpf_skb_store_bytes for clearing hash · 8afd54c8
      Daniel Borkmann authored
      When overwriting parts of the packet with bpf_skb_store_bytes() that
      were fed previously into skb->hash calculation, we should clear the
      current hash with skb_clear_hash(), so that a next skb_get_hash() call
      can determine the correct hash related to this skb.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8afd54c8
    • Daniel Borkmann's avatar
      bpf: allow bpf_csum_diff to feed bpf_l3_csum_replace as well · 8050c0f0
      Daniel Borkmann authored
      Commit 7d672345 ("bpf: add generic bpf_csum_diff helper") added a
      generic checksum diff helper that can feed bpf_l4_csum_replace() with
      a target __wsum diff that is to be applied to the L4 checksum. This
      facility is very flexible, can be cascaded, allows for adding, removing,
      or diffing data, or for calculating the pseudo header checksum from
      scratch, but it can also be reused for working with the IPv4 header
      checksum.
      
      Thus, analogous to bpf_l4_csum_replace(), add a case for header field
      value of 0 to change the checksum at a given offset through a new helper
      csum_replace_by_diff(). Also, in addition to that, this provides an
      easy to use interface for feeding precalculated diffs f.e. coming from
      a map. It nicely complements bpf_l3_csum_replace() that currently allows
      only for csum updates of 2 and 4 byte diffs.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8050c0f0
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 810813c4
      David S. Miller authored
      Several cases of overlapping changes, as well as one instance
      (vxlan) of a bug fix in 'net' overlapping with code movement
      in 'net-next'.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      810813c4
  2. 07 Mar, 2016 6 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · e2857b8f
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix ordering of WEXT netlink messages so we don't see a newlink
          after a dellink, from Johannes Berg.
      
       2) Out of bounds access in minstrel_ht_set_best_prob_rage, from
          Konstantin Khlebnikov.
      
       3) Paging buffer memory leak in iwlwifi, from Matti Gottlieb.
      
       4) Wrong units used to set initial TCP rto from cached metrics, also
          from Konstantin Khlebnikov.
      
       5) Fix stale IP options data in the SKB control block from leaking
          through layers of encapsulation, from Bernie Harris.
      
       6) Zero padding len miscalculated in bnxt_en, from Michael Chan.
      
       7) Only CHECKSUM_PARTIAL packets should be passed down through GSO, fix
          from Hannes Frederic Sowa.
      
       8) Fix suspend/resume with JME networking devices, from Diego Violat
          and Guo-Fu Tseng.
      
       9) Checksums not validated properly in bridge multicast support due to
          the placement of the SKB header pointers at the time of the check,
          fix from Álvaro Fernández Rojas.
      
      10) Fix hang/tiemout with r8169 if a stats fetch is done while the
          device is runtime suspended.  From Chun-Hao Lin.
      
      11) The forwarding database netlink dump facilities don't track the
          state of the dump properly, resulting in skipped/missed entries.
          From Minoura Makoto.
      
      12) Fix regression from a recent 3c59x bug fix, from Neil Horman.
      
      13) Fix list corruption in bna driver, from Ivan Vecera.
      
      14) Big endian machines crash on vlan add in bnx2x, fix from Michal
          Schmidt.
      
      15) Ethtool RSS configuration not propagated properly in mlx5 driver,
          from Tariq Toukan.
      
      16) Fix regression in PHY probing in stmmac driver, from Gabriel
          Fernandez.
      
      17) Fix SKB tailroom calculation in igmp/mld code, from Benjamin
          Poirier.
      
      18) A past change to skip empty routing headers in ipv6 extention header
          parsing accidently caused fragment headers to not be matched any
          longer.  Fix from Florian Westphal.
      
      19) eTSEC-106 erratum needs to be applied to more gianfar chips, from
          Atsushi Nemoto.
      
      20) Fix netdev reference after free via workqueues in usb networking
          drivers, from Oliver Neukum and Bjørn Mork.
      
      21) mdio->irq is now an array rather than a pointer to dynamic memory,
          but several drivers were still trying to free it :-/ Fixes from
          Colin Ian King.
      
      22) act_ipt iptables action forgets to set the family field, thus LOG
          netfilter targets don't work with it.  Fix from Phil Sutter.
      
      23) SKB leak in ibmveth when skb_linearize() fails, from Thomas Falcon.
      
      24) pskb_may_pull() cannot be called with interrupts disabled, fix code
          that tries to do this in vmxnet3 driver, from Neil Horman.
      
      25) be2net driver leaks iomap'd memory on removal, fix from Douglas
          Miller.
      
      26) Forgotton RTNL mutex unlock in ppp_create_interface() error paths,
          from Guillaume Nault.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (97 commits)
        ppp: release rtnl mutex when interface creation fails
        cdc_ncm: do not call usbnet_link_change from cdc_ncm_bind
        tcp: fix tcpi_segs_in after connection establishment
        net: hns: fix the bug about loopback
        jme: Fix device PM wakeup API usage
        jme: Do not enable NIC WoL functions on S0
        udp6: fix UDP/IPv6 encap resubmit path
        be2net: Don't leak iomapped memory on removal.
        vmxnet3: avoid calling pskb_may_pull with interrupts disabled
        net: ethernet: Add missing MFD_SYSCON dependency on HAS_IOMEM
        ibmveth: check return of skb_linearize in ibmveth_start_xmit
        cdc_ncm: toggle altsetting to force reset before setup
        usbnet: cleanup after bind() in probe()
        mlxsw: pci: Correctly determine if descriptor queue is full
        mlxsw: spectrum: Always decrement bridge's ref count
        tipc: fix nullptr crash during subscription cancel
        net: eth: altera: do not free array priv->mdio->irq
        net/ethoc: do not free array priv->mdio->irq
        net: sched: fix act_ipt for LOG target
        asix: do not free array priv->mdio->irq
        ...
      e2857b8f
    • Linus Torvalds's avatar
      Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs · 01ffa3df
      Linus Torvalds authored
      Pull overlayfs fixes from Miklos Szeredi:
       "Overlayfs bug fixes.  All marked as -stable material"
      
      * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
        ovl: copy new uid/gid into overlayfs runtime inode
        ovl: ignore lower entries when checking purity of non-directory entries
        ovl: fix getcwd() failure after unsuccessful rmdir
        ovl: fix working on distributed fs as lower layer
      01ffa3df
    • Linus Torvalds's avatar
      Revert "drm/radeon: call hpd_irq_event on resume" · 256faedc
      Linus Torvalds authored
      This reverts commit dbb17a21.
      
      It turns out that commit can cause problems for systems with multiple
      GPUs, and causes X to hang on at least a HP Pavilion dv7 with hybrid
      graphics.
      
      This got noticed originally in 4.4.4, where this patch had already
      gotten back-ported, but 4.5-rc7 was verified to have the same problem.
      
      Alexander Deucher says:
       "It looks like you have a muxed system so I suspect what's happening is
        that one of the display is being reported as connected for both the
        IGP and the dGPU and then the desktop environment gets confused or
        there some sort problem in the detect functions since the mux is not
        switched to the dGPU.  I don't see an easy fix unless Dave has any
        ideas.  I'd say just revert for now"
      Reported-by: default avatarJörg-Volker Peetz <jvpeetz@web.de>
      Acked-by: default avatarAlexander Deucher <Alexander.Deucher@amd.com>
      Cc: Dave Airlie <airlied@gmail.com>
      Cc: stable@kernel.org  # wherever dbb17a21 got back-ported
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      256faedc
    • Guillaume Nault's avatar
      ppp: release rtnl mutex when interface creation fails · 6faac63a
      Guillaume Nault authored
      Add missing rtnl_unlock() in the error path of ppp_create_interface().
      
      Fixes: 58a89eca ("ppp: fix lockdep splat in ppp_dev_uninit()")
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6faac63a
    • Bjørn Mork's avatar
      cdc_ncm: do not call usbnet_link_change from cdc_ncm_bind · 4d06dd53
      Bjørn Mork authored
      usbnet_link_change will call schedule_work and should be
      avoided if bind is failing. Otherwise we will end up with
      scheduled work referring to a netdev which has gone away.
      
      Instead of making the call conditional, we can just defer
      it to usbnet_probe, using the driver_info flag made for
      this purpose.
      
      Fixes: 8a34b0ae ("usbnet: cdc_ncm: apply usbnet_link_change")
      Reported-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBjørn Mork <bjorn@mork.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4d06dd53
    • Eric Dumazet's avatar
      tcp: fix tcpi_segs_in after connection establishment · a9d99ce2
      Eric Dumazet authored
      If final packet (ACK) of 3WHS is lost, it appears we do not properly
      account the following incoming segment into tcpi_segs_in
      
      While we are at it, starts segs_in with one, to count the SYN packet.
      
      We do not yet count number of SYN we received for a request sock, we
      might add this someday.
      
      packetdrill script showing proper behavior after fix :
      
      // Tests tcpi_segs_in when 3rd packet (ACK) of 3WHS is lost
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop>
         +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
      +.020 < P. 1:1001(1000) ack 1 win 32792
      
         +0 accept(3, ..., ...) = 4
      
      +.000 %{ assert tcpi_segs_in == 2, 'tcpi_segs_in=%d' % tcpi_segs_in }%
      
      Fixes: 2efd055c ("tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a9d99ce2