1. 01 Jul, 2016 28 commits
    • Martin KaFai Lau's avatar
      cgroup: bpf: Add an example to do cgroup checking in BPF · a3f74617
      Martin KaFai Lau authored
      test_cgrp2_array_pin.c:
      A userland program that creates a bpf_map (BPF_MAP_TYPE_GROUP_ARRAY),
      pouplates/updates it with a cgroup2's backed fd and pins it to a
      bpf-fs's file.  The pinned file can be loaded by tc and then used
      by the bpf prog later.  This program can also update an existing pinned
      array and it could be useful for debugging/testing purpose.
      
      test_cgrp2_tc_kern.c:
      A bpf prog which should be loaded by tc.  It is to demonstrate
      the usage of bpf_skb_in_cgroup.
      
      test_cgrp2_tc.sh:
      A script that glues the test_cgrp2_array_pin.c and
      test_cgrp2_tc_kern.c together.  The idea is like:
      1. Load the test_cgrp2_tc_kern.o by tc
      2. Use test_cgrp2_array_pin.c to populate a BPF_MAP_TYPE_CGROUP_ARRAY
         with a cgroup fd
      3. Do a 'ping -6 ff02::1%ve' to ensure the packet has been
         dropped because of a match on the cgroup
      
      Most of the lines in test_cgrp2_tc.sh is the boilerplate
      to setup the cgroup/bpf-fs/net-devices/netns...etc.  It is
      not bulletproof on errors but should work well enough and
      give enough debug info if things did not go well.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a3f74617
    • Martin KaFai Lau's avatar
      cgroup: bpf: Add bpf_skb_in_cgroup_proto · 4a482f34
      Martin KaFai Lau authored
      Adds a bpf helper, bpf_skb_in_cgroup, to decide if a skb->sk
      belongs to a descendant of a cgroup2.  It is similar to the
      feature added in netfilter:
      commit c38c4597 ("netfilter: implement xt_cgroup cgroup2 path match")
      
      The user is expected to populate a BPF_MAP_TYPE_CGROUP_ARRAY
      which will be used by the bpf_skb_in_cgroup.
      
      Modifications to the bpf verifier is to ensure BPF_MAP_TYPE_CGROUP_ARRAY
      and bpf_skb_in_cgroup() are always used together.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4a482f34
    • Martin KaFai Lau's avatar
      cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAY · 4ed8ec52
      Martin KaFai Lau authored
      Add a BPF_MAP_TYPE_CGROUP_ARRAY and its bpf_map_ops's implementations.
      To update an element, the caller is expected to obtain a cgroup2 backed
      fd by open(cgroup2_dir) and then update the array with that fd.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ed8ec52
    • Martin KaFai Lau's avatar
      cgroup: Add cgroup_get_from_fd · 1f3fe7eb
      Martin KaFai Lau authored
      Add a helper function to get a cgroup2 from a fd.  It will be
      stored in a bpf array (BPF_MAP_TYPE_CGROUP_ARRAY) which will
      be introduced in the later patch.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f3fe7eb
    • David S. Miller's avatar
      Merge branch 'bpf-robustify' · 6bd3847b
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      Further robustify putting BPF progs
      
      This series addresses a potential issue reported to us by Jann Horn
      with regards to putting progs. First patch moves progs generally under
      RCU destruction and second patch refactors getting of progs to simplify
      code a bit. For details, please see individual patches. Note, we think
      that addressing this one in net-next should be sufficient.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6bd3847b
    • Daniel Borkmann's avatar
      bpf: refactor bpf_prog_get and type check into helper · 113214be
      Daniel Borkmann authored
      Since bpf_prog_get() and program type check is used in a couple of places,
      refactor this into a small helper function that we can make use of. Since
      the non RO prog->aux part is not used in performance critical paths and a
      program destruction via RCU is rather very unlikley when doing the put, we
      shouldn't have an issue just doing the bpf_prog_get() + prog->type != type
      check, but actually not taking the ref at all (due to being in fdget() /
      fdput() section of the bpf fd) is even cleaner and makes the diff smaller
      as well, so just go for that. Callsites are changed to make use of the new
      helper where possible.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      113214be
    • Daniel Borkmann's avatar
      bpf: generally move prog destruction to RCU deferral · 1aacde3d
      Daniel Borkmann authored
      Jann Horn reported following analysis that could potentially result
      in a very hard to trigger (if not impossible) UAF race, to quote his
      event timeline:
      
       - Set up a process with threads T1, T2 and T3
       - Let T1 set up a socket filter F1 that invokes another filter F2
         through a BPF map [tail call]
       - Let T1 trigger the socket filter via a unix domain socket write,
         don't wait for completion
       - Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion
       - Now T2 should be behind bpf_prog_get(), but before bpf_prog_put()
       - Let T3 close the file descriptor for F2, dropping the reference
         count of F2 to 2
       - At this point, T1 should have looked up F2 from the map, but not
         finished executing it
       - Let T3 remove F2 from the BPF map, dropping the reference count of
         F2 to 1
       - Now T2 should call bpf_prog_put() (wrong BPF program type), dropping
         the reference count of F2 to 0 and scheduling bpf_prog_free_deferred()
         via schedule_work()
       - At this point, the BPF program could be freed
       - BPF execution is still running in a freed BPF program
      
      While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf
      event fd we're doing the syscall on doesn't disappear from underneath us
      for whole syscall time, it may not be the case for the bpf fd used as
      an argument only after we did the put. It needs to be a valid fd pointing
      to a BPF program at the time of the call to make the bpf_prog_get() and
      while T2 gets preempted, F2 must have dropped reference to 1 on the other
      CPU. The fput() from the close() in T3 should also add additionally delay
      to the reference drop via exit_task_work() when bpf_prog_release() gets
      called as well as scheduling bpf_prog_free_deferred().
      
      That said, it makes nevertheless sense to move the BPF prog destruction
      generally after RCU grace period to guarantee that such scenario above,
      but also others as recently fixed in ceb56070 ("bpf, perf: delay release
      of BPF prog after grace period") with regards to tail calls won't happen.
      Integrating bpf_prog_free_deferred() directly into the RCU callback is
      not allowed since the invocation might happen from either softirq or
      process context, so we're not permitted to block. Reviewing all bpf_prog_put()
      invocations from eBPF side (note, cBPF -> eBPF progs don't use this for
      their destruction) with call_rcu() look good to me.
      
      Since we don't know whether at the time of attaching the program, we're
      already part of a tail call map, we need to use RCU variant. However, due
      to this, there won't be severely more stress on the RCU callback queue:
      situations with above bpf_prog_get() and bpf_prog_put() combo in practice
      normally won't lead to releases, but even if they would, enough effort/
      cycles have to be put into loading a BPF program into the kernel already.
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1aacde3d
    • Amitoj Kaur Chawla's avatar
      atm: horizon: Use setup_timer · 466fc793
      Amitoj Kaur Chawla authored
      Convert a call to init_timer and accompanying intializations of
      the timer's data and function fields to a call to setup_timer.
      
      The Coccinelle semantic patch that fixes this problem is
      as follows:
      @@
      expression t,d,f,e1;
      identifier x1;
      statement S1;
      @@
      
      (
      -t.data = d;
      |
      -t.function = f;
      |
      -init_timer(&t);
      +setup_timer(&t,f,d);
      |
      -init_timer_on_stack(&t);
      +setup_timer_on_stack(&t,f,d);
      )
      <... when != S1
      t.x1 = e1;
      ...>
      Signed-off-by: default avatarAmitoj Kaur Chawla <amitoj1606@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      466fc793
    • David S. Miller's avatar
      Merge branch 'qed-next' · e3cc6e37
      David S. Miller authored
      Manish Chopra says:
      
      ====================
      qede: Enhancements
      
      This patch series have few small fastpath features
      support and code refactoring.
      
      Note - regarding get/set tunable configuration via ethtool
      Surprisingly, there is NO ethtool application support for
      such configuration given that we have kernel support.
      Do let us know if we need to add support for that in user ethtool.
      
      Please consider applying this series to "net-next".
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e3cc6e37
    • Manish Chopra's avatar
    • Manish Chopra's avatar
    • Manish Chopra's avatar
      qede: Utilize xmit_more · 312e0676
      Manish Chopra authored
      This patch uses xmit_more optimization to reduce
      number of TX doorbells write per packet.
      Signed-off-by: default avatarManish <manish.chopra@qlogic.com>
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      312e0676
    • Manish Chopra's avatar
      qede: qede_poll refactoring · c774169d
      Manish Chopra authored
      This patch cleanups qede_poll() routine a bit
      and allows qede_poll() to do single iteration to handle
      TX completion [As under heavy TX load qede_poll() might
      run for indefinite time in the while(1) loop for TX
      completion processing and cause CPU stuck].
      Signed-off-by: default avatarManish <manish.chopra@qlogic.com>
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c774169d
    • Manish Chopra's avatar
      qede: Add support for handling IP fragmented packets. · c72a6125
      Manish Chopra authored
      When handling IP fragmented packets with csum in their
      transport header, the csum isn't changed as part of the
      fragmentation. As a result, the packet containing the
      transport headers would have the correct csum of the original
      packet, but one that mismatches the actual packet that
      passes on the wire. As a result, on receive path HW would
      give an indication that the packet has incorrect csum,
      which would cause qede to discard the incoming packet.
      
      Since HW also delivers a notification of IP fragments,
      change driver behavior to pass such incoming packets
      to stack and let it make the decision whether it needs
      to be dropped.
      Signed-off-by: default avatarManish <manish.chopra@qlogic.com>
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c72a6125
    • David S. Miller's avatar
      Merge branch 'tun-skb_array' · beb528d0
      David S. Miller authored
      Jason Wang says:
      
      ====================
      switch to use tx skb array in tun
      
      This series tries to switch to use skb array in tun. This is used to
      eliminate the spinlock contention between producer and consumer. The
      conversion was straightforward: just introdce a tx skb array and use
      it instead of sk_receive_queue.
      
      A minor issue is to keep the tx_queue_len behaviour, since tun used to
      use it for the length of sk_receive_queue. This is done through:
      
      - add the ability to resize multiple rings at once to avoid handling
        partial resize failure for mutiple rings.
      - add the support for zero length ring.
      - introduce a notifier which was triggered when tx_queue_len was
        changed for a netdev.
      - resize all queues during the tx_queue_len changing.
      
      Tests shows about 15% improvement on guest rx pps:
      
      Before: ~1300000pps
      After : ~1500000pps
      
      Changes from V3:
      - fix kbuild warnings
      - call NETDEV_CHANGE_TX_QUEUE_LEN on IFLA_TXQLEN
      
      Changes from V2:
      - add multiple rings resizing support for ptr_ring/skb_array
      - add zero length ring support
      - introdce a NETDEV_CHANGE_TX_QUEUE_LEN
      - drop new flags
      
      Changes from V1:
      - switch to use skb array instead of a customized circular buffer
      - add non-blocking support
      - rename .peek to .peek_len
      - drop lockless peeking since test show very minor improvement
      ====================
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-from-altitude: 34697 feet.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      beb528d0
    • Jason Wang's avatar
      tun: switch to use skb array for tx · 1576d986
      Jason Wang authored
      We used to queue tx packets in sk_receive_queue, this is less
      efficient since it requires spinlocks to synchronize between producer
      and consumer.
      
      This patch tries to address this by:
      
      - switch from sk_receive_queue to a skb_array, and resize it when
        tx_queue_len was changed.
      - introduce a new proto_ops peek_len which was used for peeking the
        skb length.
      - implement a tun version of peek_len for vhost_net to use and convert
        vhost_net to use peek_len if possible.
      
      Pktgen test shows about 15.3% improvement on guest receiving pps for small
      buffers:
      
      Before: ~1300000pps
      After : ~1500000pps
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1576d986
    • Jason Wang's avatar
      net: introduce NETDEV_CHANGE_TX_QUEUE_LEN · 08294a26
      Jason Wang authored
      This patch introduces a new event - NETDEV_CHANGE_TX_QUEUE_LEN, this
      will be triggered when tx_queue_len. It could be used by net device
      who want to do some processing at that time. An example is tun who may
      want to resize tx array when tx_queue_len is changed.
      
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      08294a26
    • Jason Wang's avatar
      skb_array: add wrappers for resizing · bf900b3d
      Jason Wang authored
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf900b3d
    • Michael S. Tsirkin's avatar
      ptr_ring: support resizing multiple queues · 59e6ae53
      Michael S. Tsirkin authored
      Sometimes, we need support resizing multiple queues at once. This is
      because it was not easy to recover to recover from a partial failure
      of multiple queues resizing.
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      59e6ae53
    • Jason Wang's avatar
      skb_array: minor tweak · fd68adec
      Jason Wang authored
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fd68adec
    • Jason Wang's avatar
      ptr_ring: support zero length ring · 982fb490
      Jason Wang authored
      Sometimes, we need zero length ring. But current code will crash since
      we don't do any check before accessing the ring. This patch fixes this.
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      982fb490
    • David S. Miller's avatar
      Merge branch 'sch_hfsc-fixes-cleanups' · 8dc7243a
      David S. Miller authored
      Michal Soltys says:
      
      ====================
      HFSC patches, part 1
      
      It's revised version of part of the patches I submitted really, really long
      time ago (back then I asked Patrick to ignore them as I found some issues
      shortly after submitting).
      
      Anyway this is the first set with very simple fixes/changes though some of them
      relatively subtle (I tried to do very exhaustive commit messages explaining what
      and why with those).
      
      The patches are against net-next tree.
      
      The second set will be heavier - or rather with more complex explanations, among those I have:
      
      - a fix to subtle issue introduced in
        http://permalink.gmane.org/gmane.linux.kernel.commits.2-4/8281
        along with simplifying related stuff
      - update times to 96 bits (which allows to "just" use 32 bit shifts and
        improves curve definition accuracy at more extreme low/high speeds)
      - add curve "merging" instead of just selecting in convex case (computations
        mirror those from concave intersection)
      
      But these are eventually for later.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8dc7243a
    • Michal Soltys's avatar
      net/sched/sch_hfsc.c: anchor virtual curve at proper vt in hfsc_change_fsc() · 33ef84a7
      Michal Soltys authored
      cl->cl_vt alone is relative only to the current backlog period, while
      the curve operates on cumulative virtual time. This patch adds missing
      cl->cl_vtoff.
      Signed-off-by: default avatarMichal Soltys <soltys@ziu.info>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33ef84a7
    • Michal Soltys's avatar
      net/sched/sch_hfsc.c: go passive after vt update · ab12cb47
      Michal Soltys authored
      When a class is going passive, it should update its cl_vt first
      to be consistent with the last dequeue operation.
      
      Otherwise its cl_vt will be one packet behind and parent's cvtmax might
      not be updated as well.
      
      One possible side effect is if some class goes passive and subsequently
      goes active /without/ its parent going passive - with cl_vt lagging one
      packet behind - comparison made in init_vf() will be affected (same
      period).
      Signed-off-by: default avatarMichal Soltys <soltys@ziu.info>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab12cb47
    • Michal Soltys's avatar
      net/sched/sch_hfsc.c: remove leftover dlist and droplist · 2354f056
      Michal Soltys authored
      This is update to:
      commit a09ceb0e ("sched: remove qdisc->drop")
      
      That commit removed qdisc->drop, but left alone dlist and droplist
      that no longer serve any meaningful purpose.
      Signed-off-by: default avatarMichal Soltys <soltys@ziu.info>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2354f056
    • Michal Soltys's avatar
      net/sched/sch_hfsc.c: add unlikely() in qdisc_peek_len() · d1d0fc5e
      Michal Soltys authored
      The condition can only succeed on wrong configurations.
      Signed-off-by: default avatarMichal Soltys <soltys@ziu.info>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d1d0fc5e
    • Michal Soltys's avatar
      net/sched/sch_hfsc.c: handle corner cases where head may change invalidating calculated deadline · 12d0ad3b
      Michal Soltys authored
      Realtime scheduling implemented in HFSC uses head of the queue to make
      the decision about which packet to schedule next. But in case of any
      head drop, the deadline calculated for the previous head is not
      necessarily correct for the next head (unless both packets have the same
      length).
      
      Thanks to peek() function used during dequeue - which internally is a
      dequeue operation - hfsc is almost safe from this issue, as peek()
      dequeues and isolates the head storing it temporarily until the real
      dequeue happens.
      
      But there is one exception: if after the class activation a drop happens
      before the first dequeue operation, there's never a chance to do the
      peek().
      
      Adding peek() call in enqueue - if this is the first packet in a new
      backlog period AND the scheduler has realtime curve defined - fixes that
      one corner case. The 1st hfsc_dequeue() will use that peeked packet,
      similarly as every subsequent hfsc_dequeue() call uses packet peeked by
      the previous call.
      Signed-off-by: default avatarMichal Soltys <soltys@ziu.info>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      12d0ad3b
    • Eric Dumazet's avatar
      tcp: md5: use kmalloc() backed scratch areas · 19689e38
      Eric Dumazet authored
      Some arches have virtually mapped kernel stacks, or will soon have.
      
      tcp_md5_hash_header() uses an automatic variable to copy tcp header
      before mangling th->check and calling crypto function, which might
      be problematic on such arches.
      
      David says that using percpu storage is also problematic on non SMP
      builds.
      
      Just use kmalloc() to allocate scratch areas.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      19689e38
  2. 30 Jun, 2016 12 commits