1. 02 Mar, 2015 5 commits
  2. 01 Mar, 2015 13 commits
    • David S. Miller's avatar
      Merge branch 'ebpf_support_for_cls_bpf' · 68932f71
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      eBPF support for cls_bpf
      
      This is the non-RFC version of my patchset posted before netdev01 [1]
      conference. It contains a couple of eBPF cleanups and preparation
      patches to get eBPF support into cls_bpf. The last patch adds the
      actual support. I'll post the iproute2 parts after the kernel bits
      are merged, an initial preview link to the code is mentioned in the
      last patch.
      
      Patch 4 and 5 were originally one patch, but I've split them into
      two parts upon request as patch 4 only is also needed for Alexei's
      tracing patches that go via tip tree.
      
      Tested with tc and all in-kernel available BPF test suites.
      
      I have configured and built LLVM with --enable-experimental-targets=BPF
      but as Alexei put it, the plan is to get rid of the experimental
      status in future [2].
      
      Thanks a lot!
      
      v1 -> v2:
       - Removed arch patches from this series
        - x86 is already queued in tip tree, under x86/mm
        - arm64 just reposted directly to arm folks
       - Rest is unchanged
      
        [1] http://thread.gmane.org/gmane.linux.network/350191
        [2] http://article.gmane.org/gmane.linux.kernel/1874969
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      68932f71
    • Daniel Borkmann's avatar
      cls_bpf: add initial eBPF support for programmable classifiers · e2e9b654
      Daniel Borkmann authored
      This work extends the "classic" BPF programmable tc classifier by
      extending its scope also to native eBPF code!
      
      This allows for user space to implement own custom, 'safe' C like
      classifiers (or whatever other frontend language LLVM et al may
      provide in future), that can then be compiled with the LLVM eBPF
      backend to an eBPF elf file. The result of this can be loaded into
      the kernel via iproute2's tc. In the kernel, they can be JITed on
      major archs and thus run in native performance.
      
      Simple, minimal toy example to demonstrate the workflow:
      
        #include <linux/ip.h>
        #include <linux/if_ether.h>
        #include <linux/bpf.h>
      
        #include "tc_bpf_api.h"
      
        __section("classify")
        int cls_main(struct sk_buff *skb)
        {
          return (0x800 << 16) | load_byte(skb, ETH_HLEN + __builtin_offsetof(struct iphdr, tos));
        }
      
        char __license[] __section("license") = "GPL";
      
      The classifier can then be compiled into eBPF opcodes and loaded
      via tc, for example:
      
        clang -O2 -emit-llvm -c cls.c -o - | llc -march=bpf -filetype=obj -o cls.o
        tc filter add dev em1 parent 1: bpf cls.o [...]
      
      As it has been demonstrated, the scope can even reach up to a fully
      fledged flow dissector (similarly as in samples/bpf/sockex2_kern.c).
      
      For tc, maps are allowed to be used, but from kernel context only,
      in other words, eBPF code can keep state across filter invocations.
      In future, we perhaps may reattach from a different application to
      those maps e.g., to read out collected statistics/state.
      
      Similarly as in socket filters, we may extend functionality for eBPF
      classifiers over time depending on the use cases. For that purpose,
      cls_bpf programs are using BPF_PROG_TYPE_SCHED_CLS program type, so
      we can allow additional functions/accessors (e.g. an ABI compatible
      offset translation to skb fields/metadata). For an initial cls_bpf
      support, we allow the same set of helper functions as eBPF socket
      filters, but we could diverge at some point in time w/o problem.
      
      I was wondering whether cls_bpf and act_bpf could share C programs,
      I can imagine that at some point, we introduce i) further common
      handlers for both (or even beyond their scope), and/or if truly needed
      ii) some restricted function space for each of them. Both can be
      abstracted easily through struct bpf_verifier_ops in future.
      
      The context of cls_bpf versus act_bpf is slightly different though:
      a cls_bpf program will return a specific classid whereas act_bpf a
      drop/non-drop return code, latter may also in future mangle skbs.
      That said, we can surely have a "classify" and "action" section in
      a single object file, or considered mentioned constraint add a
      possibility of a shared section.
      
      The workflow for getting native eBPF running from tc [1] is as
      follows: for f_bpf, I've added a slightly modified ELF parser code
      from Alexei's kernel sample, which reads out the LLVM compiled
      object, sets up maps (and dynamically fixes up map fds) if any, and
      loads the eBPF instructions all centrally through the bpf syscall.
      
      The resulting fd from the loaded program itself is being passed down
      to cls_bpf, which looks up struct bpf_prog from the fd store, and
      holds reference, so that it stays available also after tc program
      lifetime. On tc filter destruction, it will then drop its reference.
      
      Moreover, I've also added the optional possibility to annotate an
      eBPF filter with a name (e.g. path to object file, or something
      else if preferred) so that when tc dumps currently installed filters,
      some more context can be given to an admin for a given instance (as
      opposed to just the file descriptor number).
      
      Last but not least, bpf_prog_get() and bpf_prog_put() needed to be
      exported, so that eBPF can be used from cls_bpf built as a module.
      Thanks to 60a3b225 ("net: bpf: make eBPF interpreter images
      read-only") I think this is of no concern since anything wanting to
      alter eBPF opcode after verification stage would crash the kernel.
      
        [1] http://git.breakpoint.cc/cgit/dborkman/iproute2.git/log/?h=ebpfSigned-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Acked-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e2e9b654
    • Daniel Borkmann's avatar
      ebpf: move read-only fields to bpf_prog and shrink bpf_prog_aux · 24701ece
      Daniel Borkmann authored
      is_gpl_compatible and prog_type should be moved directly into bpf_prog
      as they stay immutable during bpf_prog's lifetime, are core attributes
      and they can be locked as read-only later on via bpf_prog_select_runtime().
      
      With a bit of rearranging, this also allows us to shrink bpf_prog_aux
      to exactly 1 cacheline.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      24701ece
    • Daniel Borkmann's avatar
      ebpf: add sched_cls_type and map it to sk_filter's verifier ops · 96be4325
      Daniel Borkmann authored
      As discussed recently and at netconf/netdev01, we want to prevent making
      bpf_verifier_ops registration available for modules, but have them at a
      controlled place inside the kernel instead.
      
      The reason for this is, that out-of-tree modules can go crazy and define
      and register any verfifier ops they want, doing all sorts of crap, even
      bypassing available GPLed eBPF helper functions. We don't want to offer
      such a shiny playground, of course, but keep strict control to ourselves
      inside the core kernel.
      
      This also encourages us to design eBPF user helpers carefully and
      generically, so they can be shared among various subsystems using eBPF.
      
      For the eBPF traffic classifier (cls_bpf), it's a good start to share
      the same helper facilities as we currently do in eBPF for socket filters.
      
      That way, we have BPF_PROG_TYPE_SCHED_CLS look like it's own type, thus
      one day if there's a good reason to diverge the set of helper functions
      from the set available to socket filters, we keep ABI compatibility.
      
      In future, we could place all bpf_prog_type_list at a central place,
      perhaps.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      96be4325
    • Daniel Borkmann's avatar
      ebpf: remove CONFIG_BPF_SYSCALL ifdefs in socket filter code · d4052c4a
      Daniel Borkmann authored
      This gets rid of CONFIG_BPF_SYSCALL ifdefs in the socket filter code,
      now that the BPF internal header can deal with it.
      
      While going over it, I also changed eBPF related functions to a sk_filter
      prefix to be more consistent with the rest of the file.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d4052c4a
    • Daniel Borkmann's avatar
      ebpf: make internal bpf API independent of CONFIG_BPF_SYSCALL ifdefs · 0fc174de
      Daniel Borkmann authored
      Socket filter code and other subsystems with upcoming eBPF support should
      not need to deal with the fact that we have CONFIG_BPF_SYSCALL defined or
      not.
      
      Having the bpf syscall as a config option is a nice thing and I'd expect
      it to stay that way for expert users (I presume one day the default setting
      of it might change, though), but code making use of it should not care if
      it's actually enabled or not.
      
      Instead, hide this via header files and let the rest deal with it.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0fc174de
    • Daniel Borkmann's avatar
      ebpf: export BPF_PSEUDO_MAP_FD to uapi · f1a66f85
      Daniel Borkmann authored
      We need to export BPF_PSEUDO_MAP_FD to user space, as it's used in the
      ELF BPF loader where instructions are being loaded that need map fixups.
      
      An initial stage loads all maps into the kernel, and later on replaces
      related instructions in the eBPF blob with BPF_PSEUDO_MAP_FD as source
      register and the actual fd as immediate value.
      
      The kernel verifier recognizes this keyword and replaces the map fd with
      a real pointer internally.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f1a66f85
    • Daniel Borkmann's avatar
      ebpf: constify various function pointer structs · a2c83fff
      Daniel Borkmann authored
      We can move bpf_map_ops and bpf_verifier_ops and other structs into ro
      section, bpf_map_type_list and bpf_prog_type_list into read mostly.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a2c83fff
    • Daniel Borkmann's avatar
      ebpf: remove kernel test stubs · f91fe17e
      Daniel Borkmann authored
      Now that we have BPF_PROG_TYPE_SOCKET_FILTER up and running, we can
      remove the test stubs which were added to get the verifier suite up.
      
      We can just let the test cases probe under socket filter type instead.
      In the fill/spill test case, we cannot (yet) access fields from the
      context (skb), but we may adapt that test case in future.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f91fe17e
    • David S. Miller's avatar
      Merge branch 's390-next' · b656cc64
      David S. Miller authored
      Ursula Braun says:
      
      ====================
      s390: network patches for net-next
      
      here are some s390 related patches for net-next
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b656cc64
    • Ursula Braun's avatar
      MAINTAINERS: update S390 NETWORK DRIVERS maintainer · 8b7ac017
      Ursula Braun authored
      remove Frank Blaschka as S390 NETWORK DRIVERS maintainer
      Acked-by: default avatarFrank Blaschka <blaschka@linux.vnet.ibm.com>
      Signed-off-by: default avatarUrsula Braun <ursula.braun@de.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b7ac017
    • Stefan Raspl's avatar
      qeth: Fix command sizes · ca5b20ac
      Stefan Raspl authored
      This patch adjusts two instances where we were using the (too big)
      struct qeth_ipacmd_setadpparms size instead of the commands' actual
      size. This didn't do any harm, but wasted a few bytes.
      Signed-off-by: default avatarStefan Raspl <raspl@linux.vnet.ibm.com>
      Signed-off-by: default avatarUrsula Braun <ursula.braun@de.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca5b20ac
    • Ursula Braun's avatar
      s390: remove claw driver · 83650a2e
      Ursula Braun authored
      claw devices are outdated and no longer supported.
      This patch removes the claw driver.
      Signed-off-by: default avatarUrsula Braun <ursula.braun@de.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      83650a2e
  3. 28 Feb, 2015 7 commits
    • Eric Dumazet's avatar
      tcp: cleanup static functions · 74abc20c
      Eric Dumazet authored
      tcp_fastopen_create_child() is static and should not be exported.
      
      tcp4_gso_segment() and tcp6_gso_segment() should be static.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      74abc20c
    • Andrew Schwartzmeyer's avatar
      hyperv: Implement netvsc_get_channels() ethool op · 59995370
      Andrew Schwartzmeyer authored
      This adds support for reporting the actual and maximum combined channels
      count of the hv_netvsc driver via 'ethtool --show-channels'.
      
      This required adding 'max_chn' to 'struct netvsc_device', and assigning
      it 'rsscap.num_recv_que' in 'rndis_filter_device_add'. Now we can access
      the combined maximum channel count via 'struct netvsc_device' in the
      ethtool callback.
      Signed-off-by: default avatarAndrew Schwartzmeyer <andrew@schwartzmeyer.com>
      Signed-off-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      59995370
    • David S. Miller's avatar
      Merge branch 'tcp-tso' · f9c7ce18
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      tcp: tso improvements
      
      This patch serie reworks tcp_tso_should_defer() a bit
      to get less bursts, and better ECN behavior.
      
      We also removed tso_deferred field in tcp socket.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9c7ce18
    • Eric Dumazet's avatar
      tcp: tso: allow CA_CWR state in tcp_tso_should_defer() · a0ea700e
      Eric Dumazet authored
      Another TCP issue is triggered by ECN.
      
      Under pressure, receiver gets ECN marks, and send back ACK packets
      with ECE TCP flag. Senders enter CA_CWR state.
      
      In this state, tcp_tso_should_defer() is short cut :
      
      if (icsk->icsk_ca_state != TCP_CA_Open)
          goto send_now;
      
      This means that about all ACK packets we receive are triggering
      a partial send, and because cwnd is kept small, we can only send
      a small amount of data for each incoming ACK,
      which in return generate more ACK packets.
      
      Allowing CA_Open and CA_CWR states to enable TSO defer in
      tcp_tso_should_defer() brings performance back :
      TSO autodefer has more chance to defer under pressure.
      
      This patch increases TSO and LRO/GRO efficiency back to normal levels,
      and does not impact overall ECN behavior.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a0ea700e
    • Eric Dumazet's avatar
      tcp: tso: restore IW10 after TSO autosizing · 50c8339e
      Eric Dumazet authored
      With sysctl_tcp_min_tso_segs being 4, it is very possible
      that tcp_tso_should_defer() decides not sending last 2 MSS
      of initial window of 10 packets. This also applies if
      autosizing decides to send X MSS per GSO packet, and cwnd
      is not a multiple of X.
      
      This patch implements an heuristic based on age of first
      skb in write queue : If it was sent very recently (less than half srtt),
      we can predict that no ACK packet will come in less than half rtt,
      so deferring might cause an under utilization of our window.
      
      This is visible on initial send (IW10) on web servers,
      but more generally on some RPC, as the last part of the message
      might need an extra RTT to get delivered.
      
      Tested:
      
      Ran following packetdrill test
      // A simple server-side test that sends exactly an initial window (IW10)
      // worth of packets.
      
      `sysctl -e -q net.ipv4.tcp_min_tso_segs=4`
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0    bind(3, ..., ...) = 0
      +0    listen(3, 1) = 0
      
      +.1   < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      +0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
      +.1   < . 1:1(0) ack 1 win 257
      +0    accept(3, ..., ...) = 4
      
      +0    write(4, ..., 14600) = 14600
      +0    > . 1:5841(5840) ack 1 win 457
      +0    > . 5841:11681(5840) ack 1 win 457
      // Following packet should be sent right now.
      +0    > P. 11681:14601(2920) ack 1 win 457
      
      +.1   < . 1:1(0) ack 14601 win 257
      
      +0    close(4) = 0
      +0    > F. 14601:14601(0) ack 1
      +.1   < F. 1:1(0) ack 14602 win 257
      +0    > . 14602:14602(0) ack 2
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      50c8339e
    • Eric Dumazet's avatar
      tcp: tso: remove tp->tso_deferred · 5f852eb5
      Eric Dumazet authored
      TSO relies on ability to defer sending a small amount of packets.
      Heuristic is to wait for future ACKS in hope to send more packets at once.
      Current algorithm uses a per socket tso_deferred field as a pseudo timer.
      
      This pseudo timer relies on future ACK, but there is no guarantee
      we receive them in time.
      
      Fix would be to use a real timer, but cost of such timer is probably too
      expensive for typical cases.
      
      This patch changes the logic to test the time of last transmit,
      because we should not add bursts of more than 1ms for any given flow.
      
      We've used this patch for about two years at Google, before FQ/pacing
      as it would reduce a fair amount of bursts.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5f852eb5
    • Ben Hutchings's avatar
      usbnet: Fix tx_packets stat for FLAG_MULTI_FRAME drivers · 6588af61
      Ben Hutchings authored
      Currently the usbnet core does not update the tx_packets statistic for
      drivers with FLAG_MULTI_PACKET and there is no hook in the TX
      completion path where they could do this.
      
      cdc_ncm and dependent drivers are bumping tx_packets stat on the
      transmit path while asix and sr9800 aren't updating it at all.
      
      Add a packet count in struct skb_data so these drivers can fill it
      in, initialise it to 1 for other drivers, and add the packet count
      to the tx_packets statistic on completion.
      Signed-off-by: default avatarBen Hutchings <ben.hutchings@codethink.co.uk>
      Tested-by: default avatarBjørn Mork <bjorn@mork.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6588af61
  4. 27 Feb, 2015 15 commits
    • David S. Miller's avatar
      Merge branch 'tipc-next' · 721a57a0
      David S. Miller authored
      Erik Hugne says:
      
      ====================
      tipc: bug fix and some improvements
      
      Most important is a fix for a nullptr exception that would occur when
      name table subscriptions fail. The remaining patches are performance
      improvements and cosmetic changes.
      
      v2: remove unnecessary whitespace in patch #2
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      721a57a0
    • Erik Hugne's avatar
      tipc: make media address offset a common define · d76a436d
      Erik Hugne authored
      With the exception of infiniband media which does not use media
      offsets, the media address is always located at offset 4 in the
      media info field as defined by the protocol, so we move the
      definition to the generic bearer.h
      Signed-off-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d76a436d
    • Erik Hugne's avatar
      tipc: rename media/msg related definitions · 91e2eb56
      Erik Hugne authored
      The TIPC_MEDIA_ADDR_SIZE and TIPC_MEDIA_ADDR_OFFSET names
      are misleading, as they actually define the size and offset of
      the whole media info field and not the address part. This patch
      does not have any functional changes.
      Signed-off-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91e2eb56
    • Erik Hugne's avatar
      tipc: purge links when bearer is disabled · afaa3f65
      Erik Hugne authored
      If a bearer is disabled by manual intervention, all links over that
      bearer should be purged, indicated with the 'shutting_down' flag.
      Otherwise tipc will get confused if a new bearer is enabled using
      a different media type.
      Signed-off-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      afaa3f65
    • Erik Hugne's avatar
      tipc: fix nullpointer bug when subscribing to events · 7fe8097c
      Erik Hugne authored
      If a subscription request is sent to a topology server
      connection, and any error occurs (malformed request, oom
      or limit reached) while processing this request, TIPC should
      terminate the subscriber connection. While doing so, it tries
      to access fields in an already freed (or never allocated)
      subscription element leading to a nullpointer exception.
      We fix this by removing the subscr_terminate function and
      terminate the connection immediately upon any subscription
      failure.
      Signed-off-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7fe8097c
    • Erik Hugne's avatar
      tipc: only create header copy for name distr messages · 3622c36f
      Erik Hugne authored
      The TIPC name distributor pushes topology updates to the cluster
      neighbors. Currently this is done in a unicast manner, and the
      skb holding the update is cloned for each cluster member. This
      is unnecessary, as we only modify the destnode field in the header
      so we change it to do pskb_copy instead.
      Signed-off-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3622c36f
    • Jiri Pirko's avatar
      team: allow TSO being set on master · 247f6d0f
      Jiri Pirko authored
      This patch allows TSO being set/unset on the master, so that GSO
      segmentation is done after team layer.
      
      Similar patch is present for bonding:
      	b0ce3508 ("bonding: allow TSO being set on bonding master")
      and bridge:
      	f902e881 ("bridge: Add ability to enable TSO")
      Suggested-by: default avatarJiri Prochazka <jprochaz@redhat.com>
      Signed-off-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      247f6d0f
    • David S. Miller's avatar
      Merge branch 'fib_trie_remove_leaf_info' · 7eb60345
      David S. Miller authored
      Alexander Duyck says:
      
      ====================
      fib_trie: Remove leaf_info structure
      
      This patch set removes the leaf_info structure from the IPv4 fib_trie.  The
      general idea is that the leaf_info structure itself only held about 6
      actual bits of data, beyond that it was mostly just waste.  As such we can
      drop the structure, move the 1 byte representing the prefix/suffix length
      into the fib_alias and just link it all into one list.
      
      My testing shows that this saves somewhere between 4 to 10ns depending on
      the type of test performed.  I'm suspecting that this represents 1 to 2 L1
      cache misses saved per look-up.
      
      One side effect of this change is that semantic_match_miss will now only
      increment once per leaf instead of once per leaf_info miss.  However the
      stat is already skewed now that we perform a preliminary check on the leaf
      as a part of the look-up.
      
      I also have gone through and addressed a number of ordering issues in the
      first patch since I had misread the behavior of list_add_tail.
      
      I have since run some additional testing and verified the resulting lists
      are in the same order when combining multiple prefix length and tos values
      in a single leaf.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7eb60345
    • Alexander Duyck's avatar
      fib_trie: Remove leaf_info · 79e5ad2c
      Alexander Duyck authored
      At this point the leaf_info hash is redundant.  By adding the suffix length
      to the fib_alias hash list we no longer have need of leaf_info as we can
      determine the prefix length from fa_slen.  So we can compress things by
      dropping the leaf_info structure from fib_trie and instead directly connect
      the leaves to the fib_alias hash list.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      79e5ad2c
    • Alexander Duyck's avatar
      fib_trie: Add slen to fib alias · 9b6ebad5
      Alexander Duyck authored
      Make use of an empty spot in the alias to store the suffix length so that
      we don't need to pull that information from the leaf_info structure.
      
      This patch also makes a slight change to the user statistics.  Instead of
      incrementing semantic_match_miss once per leaf_info miss we now just
      increment it once per leaf if a match was not found.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9b6ebad5
    • Alexander Duyck's avatar
      fib_trie: Replace plen with slen in leaf_info · 5786ec60
      Alexander Duyck authored
      This replaces the prefix length variable in the leaf_info structure with a
      suffix length value, or host identifier length in bits.  By doing this it
      makes it easier to sort out since the tnodes and leaf are carrying this
      value as well since it is compatible with the ->pos field in tnodes.
      
      I also cleaned up one spot that had some list manipulation that could be
      simplified.  I basically updated it so that we just use hlist_add_head_rcu
      instead of calling hlist_add_before_rcu on the first node in the list.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5786ec60
    • Alexander Duyck's avatar
      fib_trie: Convert fib_alias to hlist from list · 56315f9e
      Alexander Duyck authored
      There isn't any advantage to having it as a list and by making it an hlist
      we make the fib_alias more compatible with the list_info in terms of the
      type of list used.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      56315f9e
    • David S. Miller's avatar
      Merge branch 'ip_level_multicast_join_leave' · 7705f730
      David S. Miller authored
      Madhu Challa says:
      
      ====================
      Multicast group join/leave at ip level
      
      This series enables configuring multicast group join/leave at ip level
      by extending the "ip address" command.
      
      It adds a new control socket mc_autojoin_sock and ifa_flag IFA_F_MCAUTOJOIN
      to invoke the corresponding igmp group join/leave api.
      
      Since the igmp group join/leave api takes the rtnl_lock the code had to
      be refactored by adding a shim layer prefixed by __ that can be invoked
      by code that already has the rtnl_lock. This way we avoid proliferation of
      work queues.
      
      The first patch in this series does the refactoring for igmp v6.
      Its based on igmp v4 changes that were added by Eric Dumazet.
      
      The second patch in this series does the group join/leave based on the
      setting of the IFA_F_MCAUTOJOIN flag.
      
      v5:
      - addressed comments from Daniel Borkmann.
       - removed blank line in patch 1/2
       - removed unused variable, const arg in patch 2/2
      v4:
      - addressed comments from Yoshifuji Hideaki.
       - Remove WARN_ON not needed because we return a value from v2.
      - addressed comments from Daniel Borkmann.
       - rename sock to mc_autojoin_sk
       - ip_mc_config() pass ifa so it needs one less argument.
       - igmp_net_{init|destroy}() use inet_ctl_sock_{create|destroy}
       - inet_rtm_newaddr() change scope of ret.
       - igmp_net_init() no need to initialize sock to NULL.
      v3:
      - addressed comments from David Miller.
       - fixed indentation and local variable order.
      v2:
      - addressed comments from Eric Dumazet.
       - removed workqueue and call __ip_mc_{join|leave}_group or
         __ipv6_sock_mc_{join|drop}
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7705f730
    • Madhu Challa's avatar
      multicast: Extend ip address command to enable multicast group join/leave on · 93a714d6
      Madhu Challa authored
      Joining multicast group on ethernet level via "ip maddr" command would
      not work if we have an Ethernet switch that does igmp snooping since
      the switch would not replicate multicast packets on ports that did not
      have IGMP reports for the multicast addresses.
      
      Linux vxlan interfaces created via "ip link add vxlan" have the group option
      that enables then to do the required join.
      
      By extending ip address command with option "autojoin" we can get similar
      functionality for openvswitch vxlan interfaces as well as other tunneling
      mechanisms that need to receive multicast traffic. The kernel code is
      structured similar to how the vxlan driver does a group join / leave.
      
      example:
      ip address add 224.1.1.10/24 dev eth5 autojoin
      ip address del 224.1.1.10/24 dev eth5
      Signed-off-by: default avatarMadhu Challa <challa@noironetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      93a714d6
    • Madhu Challa's avatar
      igmp v6: add __ipv6_sock_mc_join and __ipv6_sock_mc_drop · 46a4dee0
      Madhu Challa authored
      Based on the igmp v4 changes from Eric Dumazet.
      959d10f6("igmp: add __ip_mc_{join|leave}_group()")
      
      These changes are needed to perform igmp v6 join/leave while
      RTNL is held.
      
      Make ipv6_sock_mc_join and ipv6_sock_mc_drop wrappers around
      __ipv6_sock_mc_join and  __ipv6_sock_mc_drop to avoid
      proliferation of work queues.
      Signed-off-by: default avatarMadhu Challa <challa@noironetworks.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      46a4dee0