1. 23 Jan, 2016 1 commit
  2. 15 Dec, 2015 1 commit
    • Eric Dumazet's avatar
      net_sched: fix qdisc_tree_decrease_qlen() races · 36204ef3
      Eric Dumazet authored
      [ Upstream commit 4eaf3b84
      
       ]
      
      qdisc_tree_decrease_qlen() suffers from two problems on multiqueue
      devices.
      
      One problem is that it updates sch->q.qlen and sch->qstats.drops
      on the mq/mqprio root qdisc, while it should not : Daniele
      reported underflows errors :
      [  681.774821] PAX: sch->q.qlen: 0 n: 1
      [  681.774825] PAX: size overflow detected in function qdisc_tree_decrease_qlen net/sched/sch_api.c:769 cicus.693_49 min, count: 72, decl: qlen; num: 0; context: sk_buff_head;
      [  681.774954] CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: G           O    4.2.6.201511282239-1-grsec #1
      [  681.774955] Hardware name: ASUSTeK COMPUTER INC. X302LJ/X302LJ, BIOS X302LJ.202 03/05/2015
      [  681.774956]  ffffffffa9a04863 0000000000000000 0000000000000000 ffffffffa990ff7c
      [  681.774959]  ffffc90000d3bc38 ffffffffa95d2810 0000000000000007 ffffffffa991002b
      [  681.774960]  ffffc90000d3bc68 ffffffffa91a44f4 0000000000000001 0000000000000001
      [  681.774962] Call Trace:
      [  681.774967]  [<ffffffffa95d2810>] dump_stack+0x4c/0x7f
      [  681.774970]  [<ffffffffa91a44f4>] report_size_overflow+0x34/0x50
      [  681.774972]  [<ffffffffa94d17e2>] qdisc_tree_decrease_qlen+0x152/0x160
      [  681.774976]  [<ffffffffc02694b1>] fq_codel_dequeue+0x7b1/0x820 [sch_fq_codel]
      [  681.774978]  [<ffffffffc02680a0>] ? qdisc_peek_dequeued+0xa0/0xa0 [sch_fq_codel]
      [  681.774980]  [<ffffffffa94cd92d>] __qdisc_run+0x4d/0x1d0
      [  681.774983]  [<ffffffffa949b2b2>] net_tx_action+0xc2/0x160
      [  681.774985]  [<ffffffffa90664c1>] __do_softirq+0xf1/0x200
      [  681.774987]  [<ffffffffa90665ee>] run_ksoftirqd+0x1e/0x30
      [  681.774989]  [<ffffffffa90896b0>] smpboot_thread_fn+0x150/0x260
      [  681.774991]  [<ffffffffa9089560>] ? sort_range+0x40/0x40
      [  681.774992]  [<ffffffffa9085fe4>] kthread+0xe4/0x100
      [  681.774994]  [<ffffffffa9085f00>] ? kthread_worker_fn+0x170/0x170
      [  681.774995]  [<ffffffffa95d8d1e>] ret_from_fork+0x3e/0x70
      
      mq/mqprio have their own ways to report qlen/drops by folding stats on
      all their queues, with appropriate locking.
      
      A second problem is that qdisc_tree_decrease_qlen() calls qdisc_lookup()
      without proper locking : concurrent qdisc updates could corrupt the list
      that qdisc_match_from_root() parses to find a qdisc given its handle.
      
      Fix first problem adding a TCQ_F_NOPARENT qdisc flag that
      qdisc_tree_decrease_qlen() can use to abort its tree traversal,
      as soon as it meets a mq/mqprio qdisc children.
      
      Second problem can be fixed by RCU protection.
      Qdisc are already freed after RCU grace period, so qdisc_list_add() and
      qdisc_list_del() simply have to use appropriate rcu list variants.
      
      A future patch will add a per struct netdev_queue list anchor, so that
      qdisc_tree_decrease_qlen() can have more efficient lookups.
      Reported-by: default avatarDaniele Fucini <dfucini@gmail.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Cong Wang <cwang@twopensource.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      36204ef3
  3. 28 Aug, 2015 2 commits
  4. 27 Aug, 2015 1 commit
    • Daniel Borkmann's avatar
      net: sched: consolidate tc_classify{,_compat} · 3b3ae880
      Daniel Borkmann authored
      
      For classifiers getting invoked via tc_classify(), we always need an
      extra function call into tc_classify_compat(), as both are being
      exported as symbols and tc_classify() itself doesn't do much except
      handling of reclassifications when tp->classify() returned with
      TC_ACT_RECLASSIFY.
      
      CBQ and ATM are the only qdiscs that directly call into tc_classify_compat(),
      all others use tc_classify(). When tc actions are being configured
      out in the kernel, tc_classify() effectively does nothing besides
      delegating.
      
      We could spare this layer and consolidate both functions. pktgen on
      single CPU constantly pushing skbs directly into the netif_receive_skb()
      path with a dummy classifier on ingress qdisc attached, improves
      slightly from 22.3Mpps to 23.1Mpps.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b3ae880
  5. 27 May, 2015 1 commit
    • WANG Cong's avatar
      net_sched: invoke ->attach() after setting dev->qdisc · 86e363dc
      WANG Cong authored
      For mq qdisc, we add per tx queue qdisc to root qdisc
      for display purpose, however, that happens too early,
      before the new dev->qdisc is finally set, this causes
      q->list points to an old root qdisc which is going to be
      freed right before assigning with a new one.
      
      Fix this by moving ->attach() after setting dev->qdisc.
      
      For the record, this fixes the following crash:
      
       ------------[ cut here ]------------
       WARNING: CPU: 1 PID: 975 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
       list_del corruption. prev->next should be ffff8800d1998ae8, but was 6b6b6b6b6b6b6b6b
       CPU: 1 PID: 975 Comm: tc Not tainted 4.1.0-rc4+ #1019
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        0000000000000009 ffff8800d73fb928 ffffffff81a44e7f 0000000047574756
        ffff8800d73fb978 ffff8800d73fb968 ffffffff810790da ffff8800cfc4cd20
        ffffffff814e725b ffff8800d1998ae8 ffffffff82381250 0000000000000000
       Call Trace:
        [<ffffffff81a44e7f>] dump_stack+0x4c/0x65
        [<ffffffff810790da>] warn_slowpath_common+0x9c/0xb6
        [<ffffffff814e725b>] ? __list_del_entry+0x5a/0x98
        [<ffffffff81079162>] warn_slowpath_fmt+0x46/0x48
        [<ffffffff81820eb0>] ? dev_graft_qdisc+0x5e/0x6a
        [<ffffffff814e725b>] __list_del_entry+0x5a/0x98
        [<ffffffff814e72a7>] list_del+0xe/0x2d
        [<ffffffff81822f05>] qdisc_list_del+0x1e/0x20
        [<ffffffff81820cd1>] qdisc_destroy+0x30/0xd6
        [<ffffffff81822676>] qdisc_graft+0x11d/0x243
        [<ffffffff818233c1>] tc_get_qdisc+0x1a6/0x1d4
        [<ffffffff810b5eaf>] ? mark_lock+0x2e/0x226
        [<ffffffff817ff8f5>] rtnetlink_rcv_msg+0x181/0x194
        [<ffffffff817ff72e>] ? rtnl_lock+0x17/0x19
        [<ffffffff817ff72e>] ? rtnl_lock+0x17/0x19
        [<ffffffff817ff774>] ? __rtnl_unlock+0x17/0x17
        [<ffffffff81855dc6>] netlink_rcv_skb+0x4d/0x93
        [<ffffffff817ff756>] rtnetlink_rcv+0x26/0x2d
        [<ffffffff818544b2>] netlink_unicast+0xcb/0x150
        [<ffffffff81161db9>] ? might_fault+0x59/0xa9
        [<ffffffff81854f78>] netlink_sendmsg+0x4fa/0x51c
        [<ffffffff817d6e09>] sock_sendmsg_nosec+0x12/0x1d
        [<ffffffff817d8967>] sock_sendmsg+0x29/0x2e
        [<ffffffff817d8cf3>] ___sys_sendmsg+0x1b4/0x23a
        [<ffffffff8100a1b8>] ? native_sched_clock+0x35/0x37
        [<ffffffff810a1d83>] ? sched_clock_local+0x12/0x72
        [<ffffffff810a1fd4>] ? sched_clock_cpu+0x9e/0xb7
        [<ffffffff810def2a>] ? current_kernel_time+0xe/0x32
        [<ffffffff810b4bc5>] ? lock_release_holdtime.part.29+0x71/0x7f
        [<ffffffff810ddebf>] ? read_seqcount_begin.constprop.27+0x5f/0x76
        [<ffffffff810b6292>] ? trace_hardirqs_on_caller+0x17d/0x199
        [<ffffffff811b14d5>] ? __fget_light+0x50/0x78
        [<ffffffff817d9808>] __sys_sendmsg+0x42/0x60
        [<ffffffff817d9838>] SyS_sendmsg+0x12/0x1c
        [<ffffffff81a50e97>] system_call_fastpath+0x12/0x6f
       ---[ end trace ef29d3fb28e97ae7 ]---
      
      For long term, we probably need to clean up the qdisc_graft() code
      in case it hides other bugs like this.
      
      Fixes: 95dc1929
      
       ("pkt_sched: give visibility to mq slave qdiscs")
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      86e363dc
  6. 13 May, 2015 1 commit
  7. 22 Apr, 2015 1 commit
  8. 09 Mar, 2015 1 commit
    • Cong Wang's avatar
      net_sched: destroy proto tp when all filters are gone · 1e052be6
      Cong Wang authored
      
      Kernel automatically creates a tp for each
      (kind, protocol, priority) tuple, which has handle 0,
      when we add a new filter, but it still is left there
      after we remove our own, unless we don't specify the
      handle (literally means all the filters under
      the tuple). For example this one is left:
      
        # tc filter show dev eth0
        filter parent 8001: protocol arp pref 49152 basic
      
      The user-space is hard to clean up these for kernel
      because filters like u32 are organized in a complex way.
      So kernel is responsible to remove it after all filters
      are gone.  Each type of filter has its own way to
      store the filters, so each type has to provide its
      way to check if all filters are gone.
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e052be6
  9. 13 Jan, 2015 1 commit
  10. 22 Oct, 2014 1 commit
  11. 06 Oct, 2014 1 commit
  12. 05 Oct, 2014 1 commit
    • John Fastabend's avatar
      net: sched: suspicious RCU usage in qdisc_watchdog · 1e203c1a
      John Fastabend authored
      Suspicious RCU usage in qdisc_watchdog call needs to be done inside
      rcu_read_lock/rcu_read_unlock. And then Qdisc destroy operations
      need to ensure timer is cancelled before removing qdisc structure.
      
      [ 3992.191339] ===============================
      [ 3992.191340] [ INFO: suspicious RCU usage. ]
      [ 3992.191343] 3.17.0-rc6net-next+ #72 Not tainted
      [ 3992.191345] -------------------------------
      [ 3992.191347] include/net/sch_generic.h:272 suspicious rcu_dereference_check() usage!
      [ 3992.191348]
      [ 3992.191348] other info that might help us debug this:
      [ 3992.191348]
      [ 3992.191351]
      [ 3992.191351] rcu_scheduler_active = 1, debug_locks = 1
      [ 3992.191353] no locks held by swapper/1/0.
      [ 3992.191355]
      [ 3992.191355] stack backtrace:
      [ 3992.191358] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.17.0-rc6net-next+ #72
      [ 3992.191360] Hardware name:                  /DZ77RE-75K, BIOS GAZ7711H.86A.0060.2012.1115.1750 11/15/2012
      [ 3992.191362]  0000000000000001 ffff880235803e48 ffffffff8178f92c 0000000000000000
      [ 3992.191366]  ffff8802322224a0 ffff880235803e78 ffffffff810c9966 ffff8800a5fe3000
      [ 3992.191370]  ffff880235803f30 ffff8802359cd768 ffff8802359cd6e0 ffff880235803e98
      [ 3992.191374] Call Trace:
      [ 3992.191376]  <IRQ>  [<ffffffff8178f92c>] dump_stack+0x4e/0x68
      [ 3992.191387]  [<ffffffff810c9966>] lockdep_rcu_suspicious+0xe6/0x130
      [ 3992.191392]  [<ffffffff8167213a>] qdisc_watchdog+0x8a/0xb0
      [ 3992.191396]  [<ffffffff810f93f2>] __run_hrtimer+0x72/0x420
      [ 3992.191399]  [<ffffffff810f9bcd>] ? hrtimer_interrupt+0x7d/0x240
      [ 3992.191403]  [<ffffffff816720b0>] ? tc_classify+0xc0/0xc0
      [ 3992.191406]  [<ffffffff810f9c4f>] hrtimer_interrupt+0xff/0x240
      [ 3992.191410]  [<ffffffff8109e4a5>] ? __atomic_notifier_call_chain+0x5/0x140
      [ 3992.191415]  [<ffffffff8103577b>] local_apic_timer_interrupt+0x3b/0x60
      [ 3992.191419]  [<ffffffff8179c2b5>] smp_apic_timer_interrupt+0x45/0x60
      [ 3992.191422]  [<ffffffff8179a6bf>] apic_timer_interrupt+0x6f/0x80
      [ 3992.191424]  <EOI>  [<ffffffff815ed233>] ? cpuidle_enter_state+0x73/0x2e0
      [ 3992.191432]  [<ffffffff815ed22e>] ? cpuidle_enter_state+0x6e/0x2e0
      [ 3992.191437]  [<ffffffff815ed567>] cpuidle_enter+0x17/0x20
      [ 3992.191441]  [<ffffffff810c0741>] cpu_startup_entry+0x3d1/0x4a0
      [ 3992.191445]  [<ffffffff81106fc6>] ? clockevents_config_and_register+0x26/0x30
      [ 3992.191448]  [<ffffffff81033c16>] start_secondary+0x1b6/0x260
      
      Fixes: b26b0d1e
      
       ("net: qdisc: use rcu prefix and silence sparse warnings")
      Signed-off-by: default avatarJohn Fastabend <john.r.fastabend@intel.com>
      Acked-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e203c1a
  13. 30 Sep, 2014 4 commits
  14. 26 Sep, 2014 1 commit
    • Eric Dumazet's avatar
      net: sched: use pinned timers · 4a8e320c
      Eric Dumazet authored
      
      While using a MQ + NETEM setup, I had confirmation that the default
      timer migration ( /proc/sys/kernel/timer_migration ) is killing us.
      
      Installing this on a receiver side of a TCP_STREAM test, (NIC has 8 TX
      queues) :
      
      EST="est 1sec 4sec"
      for ETH in eth1
      do
       tc qd del dev $ETH root 2>/dev/null
       tc qd add dev $ETH root handle 1: mq
       tc qd add dev $ETH parent 1:1 $EST netem limit 70000 delay 6ms
       tc qd add dev $ETH parent 1:2 $EST netem limit 70000 delay 8ms
       tc qd add dev $ETH parent 1:3 $EST netem limit 70000 delay 10ms
       tc qd add dev $ETH parent 1:4 $EST netem limit 70000 delay 12ms
       tc qd add dev $ETH parent 1:5 $EST netem limit 70000 delay 14ms
       tc qd add dev $ETH parent 1:6 $EST netem limit 70000 delay 16ms
       tc qd add dev $ETH parent 1:7 $EST netem limit 80000 delay 18ms
       tc qd add dev $ETH parent 1:8 $EST netem limit 90000 delay 20ms
      done
      
      We can see that timers get migrated into a single cpu, presumably idle
      at the time timers are set up.
      Then all qdisc dequeues run from this cpu and huge lock contention
      happens. This single cpu is stuck in softirq mode and cannot dequeue
      fast enough.
      
          39.24%  [kernel]          [k] _raw_spin_lock
           2.65%  [kernel]          [k] netem_enqueue
           1.80%  [kernel]          [k] netem_dequeue
           1.63%  [kernel]          [k] copy_user_enhanced_fast_string
           1.45%  [kernel]          [k] _raw_spin_lock_bh
      
      By pinning qdisc timers on the cpu running the qdisc, we respect proper
      XPS setting and remove this lock contention.
      
           5.84%  [kernel]          [k] netem_enqueue
           4.83%  [kernel]          [k] _raw_spin_lock
           2.92%  [kernel]          [k] copy_user_enhanced_fast_string
      
      Current Qdiscs that benefit from this change are :
      
      	netem, cbq, fq, hfsc, tbf, htb.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4a8e320c
  15. 13 Sep, 2014 2 commits
  16. 11 Jun, 2014 1 commit
    • Florian Westphal's avatar
      net_sched: drr: warn when qdisc is not work conserving · 6e765a00
      Florian Westphal authored
      The DRR scheduler requires that items on the active list are work
      conserving, i.e. do not hold on to skbs for throttling purposes, etc.
      Attaching e.g. tbf renders DRR useless because all other classes on the
      active list are delayed as well.
      
      So, warn users that this configuration won't work as expected; we
      already do this in couple of other qdiscs, see e.g.
      
      commit b00355db
      
      
      ('pkt_sched: sch_hfsc: sch_htb: Add non-work-conserving warning handler')
      
      The 'const' change is needed to avoid compiler warning ("discards 'const'
      qualifier from pointer target type").
      
      tested with:
      drr_hier() {
              parent=$1
              classes=$2
              for i in  $(seq 1 $classes); do
                      classid=$parent$(printf %x $i)
                      tc class add dev eth0 parent $parent classid $classid drr
      		tc qdisc add dev eth0 parent $classid tbf rate 64kbit burst 256kbit limit 64kbit
              done
      }
      tc qdisc add dev eth0 root handle 1: drr
      drr_hier 1: 32
      tc filter add dev eth0 protocol all pref 1 parent 1: handle 1 flow hash keys dst perturb 1 divisor 32
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6e765a00
  17. 02 May, 2014 1 commit
  18. 24 Apr, 2014 1 commit
  19. 12 Mar, 2014 2 commits
  20. 10 Mar, 2014 1 commit
  21. 31 Dec, 2013 1 commit
  22. 14 Dec, 2013 1 commit
  23. 10 Dec, 2013 1 commit
  24. 08 Oct, 2013 1 commit
  25. 31 Aug, 2013 2 commits
  26. 15 Aug, 2013 1 commit
    • Jesper Dangaard Brouer's avatar
      net_sched: restore "linklayer atm" handling · 8a8e3d84
      Jesper Dangaard Brouer authored
      commit 56b765b7 ("htb: improved accuracy at high rates")
      broke the "linklayer atm" handling.
      
       tc class add ... htb rate X ceil Y linklayer atm
      
      The linklayer setting is implemented by modifying the rate table
      which is send to the kernel.  No direct parameter were
      transferred to the kernel indicating the linklayer setting.
      
      The commit 56b765b7
      
       ("htb: improved accuracy at high rates")
      removed the use of the rate table system.
      
      To keep compatible with older iproute2 utils, this patch detects
      the linklayer by parsing the rate table.  It also supports future
      versions of iproute2 to send this linklayer parameter to the
      kernel directly. This is done by using the __reserved field in
      struct tc_ratespec, to convey the choosen linklayer option, but
      only using the lower 4 bits of this field.
      
      Linklayer detection is limited to speeds below 100Mbit/s, because
      at high rates the rtab is gets too inaccurate, so bad that
      several fields contain the same values, this resembling the ATM
      detect.  Fields even start to contain "0" time to send, e.g. at
      1000Mbit/s sending a 96 bytes packet cost "0", thus the rtab have
      been more broken than we first realized.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a8e3d84
  27. 07 Jun, 2013 1 commit
  28. 28 Mar, 2013 1 commit
  29. 26 Mar, 2013 1 commit
  30. 22 Mar, 2013 1 commit
  31. 28 Feb, 2013 1 commit
    • Sasha Levin's avatar
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin authored
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist...
      b67bfe0d
  32. 18 Feb, 2013 2 commits