1. 29 Oct, 2013 10 commits
    • Joe Perches's avatar
      netconsole: Convert to pr_<level> · 22ded577
      Joe Perches authored
      Use a more current logging style.
      
      Convert printks to pr_<level>.
      
      Consolidate multiple printks into a single printk to avoid
      any possible dmesg interleaving.  Add a default "event" msg
      in case the listed types are ever expanded.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      22ded577
    • Daniel Borkmann's avatar
      net: sched: cls_bpf: add BPF-based classifier · 7d1d65cb
      Daniel Borkmann authored
      This work contains a lightweight BPF-based traffic classifier that can
      serve as a flexible alternative to ematch-based tree classification, i.e.
      now that BPF filter engine can also be JITed in the kernel. Naturally, tc
      actions and policies are supported as well with cls_bpf. Multiple BPF
      programs/filter can be attached for a class, or they can just as well be
      written within a single BPF program, that's really up to the user how he
      wishes to run/optimize the code, e.g. also for inversion of verdicts etc.
      The notion of a BPF program's return/exit codes is being kept as follows:
      
           0: No match
          -1: Select classid given in "tc filter ..." command
        else: flowid, overwrite the default one
      
      As a minimal usage example with iproute2, we use a 3 band prio root qdisc
      on a router with sfq each as leave, and assign ssh and icmp bpf-based
      filters to band 1, http traffic to band 2 and the rest to band 3. For the
      first two bands we load the bytecode from a file, in the 2nd we load it
      inline as an example:
      
      echo 1 > /proc/sys/net/core/bpf_jit_enable
      
      tc qdisc del dev em1 root
      tc qdisc add dev em1 root handle 1: prio bands 3 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      
      tc qdisc add dev em1 parent 1:1 sfq perturb 16
      tc qdisc add dev em1 parent 1:2 sfq perturb 16
      tc qdisc add dev em1 parent 1:3 sfq perturb 16
      
      tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/ssh.bpf flowid 1:1
      tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/icmp.bpf flowid 1:1
      tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/http.bpf flowid 1:2
      tc filter add dev em1 parent 1: bpf run bytecode "`bpfc -f tc -i misc.ops`" flowid 1:3
      
      BPF programs can be easily created and passed to tc, either as inline
      'bytecode' or 'bytecode-file'. There are a couple of front-ends that can
      compile opcodes, for example:
      
      1) People familiar with tcpdump-like filters:
      
         tcpdump -iem1 -ddd port 22 | tr '\n' ',' > /etc/tc/ssh.bpf
      
      2) People that want to low-level program their filters or use BPF
         extensions that lack support by libpcap's compiler:
      
         bpfc -f tc -i ssh.ops > /etc/tc/ssh.bpf
      
         ssh.ops example code:
         ldh [12]
         jne #0x800, drop
         ldb [23]
         jneq #6, drop
         ldh [20]
         jset #0x1fff, drop
         ldxb 4 * ([14] & 0xf)
         ldh [%x + 14]
         jeq #0x16, pass
         ldh [%x + 16]
         jne #0x16, drop
         pass: ret #-1
         drop: ret #0
      
      It was chosen to load bytecode into tc, since the reverse operation,
      tc filter list dev em1, is then able to show the exact commands again.
      Possible follow-up work could also include a small expression compiler
      for iproute2. Tested with the help of bmon. This idea came up during
      the Netfilter Workshop 2013 in Copenhagen. Also thanks to feedback from
      Eric Dumazet!
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7d1d65cb
    • Rafał Miłecki's avatar
      bgmac: separate RX descriptor setup code into a new function · d549c76b
      Rafał Miłecki authored
      This cleans code a bit and will be useful when allocating buffers in
      other places (like RX path, to avoid skb_copy_from_linear_data_offset).
      Signed-off-by: default avatarRafał Miłecki <zajec5@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d549c76b
    • Zhi Yong Wu's avatar
    • Zhi Yong Wu's avatar
    • Zhi Yong Wu's avatar
    • Zhi Yong Wu's avatar
      vxlan: silence one build warning · 39deb2c7
      Zhi Yong Wu authored
      drivers/net/vxlan.c: In function ‘vxlan_sock_add’:
      drivers/net/vxlan.c:2298:11: warning: ‘sock’ may be used uninitialized in this function [-Wmaybe-uninitialized]
      drivers/net/vxlan.c:2275:17: note: ‘sock’ was declared here
        LD      drivers/net/built-in.o
      Signed-off-by: default avatarZhi Yong Wu <wuzhy@linux.vnet.ibm.com>
      Signed-off-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      39deb2c7
    • Hannes Frederic Sowa's avatar
      ipv4: fix DO and PROBE pmtu mode regarding local fragmentation with UFO/CORK · daba287b
      Hannes Frederic Sowa authored
      UFO as well as UDP_CORK do not respect IP_PMTUDISC_DO and
      IP_PMTUDISC_PROBE well enough.
      
      UFO enabled packet delivery just appends all frags to the cork and hands
      it over to the network card. So we just deliver non-DF udp fragments
      (DF-flag may get overwritten by hardware or virtual UFO enabled
      interface).
      
      UDP_CORK does enqueue the data until the cork is disengaged. At this
      point it sets the correct IP_DF and local_df flags and hands it over to
      ip_fragment which in this case will generate an icmp error which gets
      appended to the error socket queue. This is not reflected in the syscall
      error (of course, if UFO is enabled this also won't happen).
      
      Improve this by checking the pmtudisc flags before appending data to the
      socket and if we still can fit all data in one packet when IP_PMTUDISC_DO
      or IP_PMTUDISC_PROBE is set, only then proceed.
      
      We use (mtu-fragheaderlen) to check for the maximum length because we
      ensure not to generate a fragment and non-fragmented data does not need
      to have its length aligned on 64 bit boundaries. Also the passed in
      ip_options are already aligned correctly.
      
      Maybe, we can relax some other checks around ip_fragment. This needs
      more research.
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      daba287b
    • Michael Dalton's avatar
      virtio_net: migrate mergeable rx buffers to page frag allocators · 2613af0e
      Michael Dalton authored
      The virtio_net driver's mergeable receive buffer allocator
      uses 4KB packet buffers. For MTU-sized traffic, SKB truesize
      is > 4KB but only ~1500 bytes of the buffer is used to store
      packet data, reducing the effective TCP window size
      substantially. This patch addresses the performance concerns
      with mergeable receive buffers by allocating MTU-sized packet
      buffers using page frag allocators. If more than MAX_SKB_FRAGS
      buffers are needed, the SKB frag_list is used.
      Signed-off-by: default avatarMichael Dalton <mwdalton@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2613af0e
    • David S. Miller's avatar
      ipv6: Remove privacy config option. · 5d9efa7e
      David S. Miller authored
      The code for privacy extentions is very mature, and making it
      configurable only gives marginal memory/code savings in exchange
      for obfuscation and hard to read code via CPP ifdef'ery.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5d9efa7e
  2. 28 Oct, 2013 12 commits
  3. 27 Oct, 2013 7 commits
    • Sathya Perla's avatar
      be2net: add support for ndo_busy_poll · 6384a4d0
      Sathya Perla authored
      Includes:
      - ndo_busy_poll implementation
      - Locking between napi and busy_poll
      - Fix rx_post_starvation (replenish rx-queues in out-of-mememory scenario)
        logic to accomodate busy_poll.
      
      v2 changes:
      [Eric D.'s comment] call alloc_pages() with GFP_ATOMIC even in ndo_busy_poll
      context as it is not allowed to sleep.
      Signed-off-by: default avatarSathya Perla <sathya.perla@emulex.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6384a4d0
    • David S. Miller's avatar
      Merge branch 'bonding_monitor_locking' · 4d961a10
      David S. Miller authored
      Ding Tianhong says:
      
      ====================
      bonding: patchset for rcu use in bonding
      
      The slave list will add and del by bond_master_upper_dev_link() and
      bond_upper_dev_unlink(), which will call call_netdevice_notifiers(),
      even it is safe to call it in write bond lock now, but we can't sure
      that whether it is safe later, because other drivers may deal
      NETDEV_CHANGEUPPER in sleep way, so I didn't admit move the
      bond_upper_dev_unlink() in write bond lock.
      
      now the bond_for_each_slave only protect by rtnl_lock(), maybe use
      bond_for_each_slave_rcu is a good way to protect slave list for bond,
      but as a system slow path, it is no need to transform
      bond_for_each_slave() to bond_for_each_slave_rcu() in slow path, so in
      the patchset, I will remove the unused read bond lock for monitor
      function, maybe it is a better way, I will wait to accept any relay
      for it.
      
      Thanks for the Veaceslav Falico opinion.
      
      v2: add and modify commit for patchset and patch, it will be the first
      step for the whole patchset.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4d961a10
    • dingtianhong's avatar
      bonding: remove bond read lock for bond_3ad_state_machine_handler() · 5cc172c6
      dingtianhong authored
      The bond slave list may change when the monitor is running, the slave list is no longer
      protected by bond->lock, only protected by rtnl lock(), so we have 3 ways to modify it:
      1.add bond_master_upper_dev_link() and bond_upper_dev_unlink() in bond->lock, but it is unsafe
      to call call_netdevice_notifiers() in write lock.
      2.remove unused bond->lock for monitor function, only use the existing rtnl lock().
      3.use rcu_read_lock() to protect it, of course, it will transform bond_for_each_slave to
      bond_for_each_slave_rcu() and performance is better, but in slow path, it is ignored.
      so I remove the bond->lock and move the rtnl lock to protect the whole monitor function.
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5cc172c6
    • dingtianhong's avatar
      bonding: remove bond read lock for bond_activebackup_arp_mon() · 80b9d236
      dingtianhong authored
      The bond slave list may change when the monitor is running, the slave list is no longer
      protected by bond->lock, only protected by rtnl lock(), so we have 3 ways to modify it:
      1.add bond_master_upper_dev_link() and bond_upper_dev_unlink() in bond->lock, but it is unsafe
      to call call_netdevice_notifiers() in write lock.
      2.remove unused bond->lock for monitor function, only use the existing rtnl lock().
      3.use rcu_read_lock() to protect it, of course, it will transform bond_for_each_slave to
      bond_for_each_slave_rcu() and performance is better, but in slow path, it is ignored.
      so I remove the bond->lock and move the rtnl lock to protect the whole monitor function.
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      80b9d236
    • dingtianhong's avatar
      bonding: remove bond read lock for bond_loadbalance_arp_mon() · 7f1bb571
      dingtianhong authored
      The bond slave list may change when the monitor is running, the slave list is no longer
      protected by bond->lock, only protected by rtnl lock(), so we have 3 ways to modify it:
      1.add bond_master_upper_dev_link() and bond_upper_dev_unlink() in bond->lock, but it is unsafe
      to call call_netdevice_notifiers() in write lock.
      2.remove unused bond->lock for monitor function, only use the existing rtnl lock().
      3.use rcu_read_lock() to protect it, of course, it will transform bond_for_each_slave to
      bond_for_each_slave_rcu() and performance is better, but in slow path, it is ignored.
      so I remove the bond->lock and add the rtnl lock to protect the whole monitor function.
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f1bb571
    • dingtianhong's avatar
      bonding: remove bond read lock for bond_alb_monitor() · 2d0dafb0
      dingtianhong authored
      The bond slave list may change when the monitor is running, the slave list is no longer
      protected by bond->lock, only protected by rtnl lock(), so we have 3 ways to modify it:
      1.add bond_master_upper_dev_link() and bond_upper_dev_unlink() in bond->lock, but it is unsafe
      to call call_netdevice_notifiers() in write lock.
      2.remove unused bond->lock for monitor function, only use the existing rtnl lock().
      3.use rcu_read_lock() to protect it, of course, it will transform bond_for_each_slave to
      bond_for_each_slave_rcu() and performance is better, but in slow path, it is ignored.
      so I remove the bond->lock and move the rtnl lock to protect the whole monitor function.
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d0dafb0
    • dingtianhong's avatar
      bonding: remove bond read lock for bond_mii_monitor() · 6b6c5261
      dingtianhong authored
      The bond slave list may change when the monitor is running, the slave list is no longer
      protected by bond->lock, only protected by rtnl lock(), so we have 3 ways to modify it:
      1.add bond_master_upper_dev_link() and bond_upper_dev_unlink() in bond->lock, but it is unsafe
      to call call_netdevice_notifiers() in write lock.
      2.remove unused bond->lock for monitor function, only use the existing rtnl lock().
      3.use rcu_read_lock() to protect it, of course, it will transform bond_for_each_slave to
      bond_for_each_slave_rcu() and performance is better, but in slow path, it is ignored.
      so I remove the bond->lock and move the rtnl lock to protect the whole monitor function.
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6b6c5261
  4. 26 Oct, 2013 1 commit
    • David S. Miller's avatar
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next · a00f6fcc
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates
      
      This series contains updates to igb, igbvf, i40e, ixgbe and ixgbevf.
      
      Dan Carpenter provides a patch for igbvf to fix a bug found by a static
      checker.  If the new MTU is very large, then "new_mtu + ETH_HLEN +
      ETH_FCS_LEN" can wrap and the check on the next line can underflow.
      
      Wei Yongjun provides 2 patches, the first against igbvf adds a missing
      iounmap() before the return from igbvf_probe().  The second against
      i40e, removes the include <linux/version.h> because it is not needed.
      
      Carolyn provides a patch for igb to fix a call to set the master/slave
      mode for all m88 generation 2 PHY's and removes the call for I210
      devices which do not need it.
      
      Stefan Assmann provides a patch for igb to fix an issue which was broke
      by:
         commit fa44f2f1
         Author: Greg Rose <gregory.v.rose@intel.com>
         Date:   Thu Jan 17 01:03:06 2013 -0800
         igb: Enable SR-IOV configuration via PCI sysfs interface
      which breaks the reloading of igb when VFs are assigned to a guest, in
      several ways.
      
      Jacob provides a patch for ixgbe and ixgbevf.  First, against ixgbe,
      cleans up ixgbe_enumerate_functions to reduce code complexity.  The
      second, against ixgbevf, adds support for ethtool's get_coalesce and
      set_coalesce command for the ixgbevf driver.
      
      Yijing Wang provides a patch for ixgbe to use pcie_capability_read_word()
      to simplify the code.
      
      Emil provides a ixgbe patch to fix an issue where the logic used to
      detect changes in rx-usecs was incorrect and was masked by the call to
      ixgbe_update_rsc().
      
      Don provides 2 patches for ixgbevf.  First creates a new function to set
      PSRTYPE.  The second bumps the ixgbevf driver version.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a00f6fcc
  5. 25 Oct, 2013 5 commits
    • Alexei Starovoitov's avatar
      net: fix rtnl notification in atomic context · 7f294054
      Alexei Starovoitov authored
      commit 991fb3f7 "dev: always advertise rx_flags changes via netlink"
      introduced rtnl notification from __dev_set_promiscuity(),
      which can be called in atomic context.
      
      Steps to reproduce:
      ip tuntap add dev tap1 mode tap
      ifconfig tap1 up
      tcpdump -nei tap1 &
      ip tuntap del dev tap1 mode tap
      
      [  271.627994] device tap1 left promiscuous mode
      [  271.639897] BUG: sleeping function called from invalid context at mm/slub.c:940
      [  271.664491] in_atomic(): 1, irqs_disabled(): 0, pid: 3394, name: ip
      [  271.677525] INFO: lockdep is turned off.
      [  271.690503] CPU: 0 PID: 3394 Comm: ip Tainted: G        W    3.12.0-rc3+ #73
      [  271.703996] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012
      [  271.731254]  ffffffff81a58506 ffff8807f0d57a58 ffffffff817544e5 ffff88082fa0f428
      [  271.760261]  ffff8808071f5f40 ffff8807f0d57a88 ffffffff8108bad1 ffffffff81110ff8
      [  271.790683]  0000000000000010 00000000000000d0 00000000000000d0 ffff8807f0d57af8
      [  271.822332] Call Trace:
      [  271.838234]  [<ffffffff817544e5>] dump_stack+0x55/0x76
      [  271.854446]  [<ffffffff8108bad1>] __might_sleep+0x181/0x240
      [  271.870836]  [<ffffffff81110ff8>] ? rcu_irq_exit+0x68/0xb0
      [  271.887076]  [<ffffffff811a80be>] kmem_cache_alloc_node+0x4e/0x2a0
      [  271.903368]  [<ffffffff810b4ddc>] ? vprintk_emit+0x1dc/0x5a0
      [  271.919716]  [<ffffffff81614d67>] ? __alloc_skb+0x57/0x2a0
      [  271.936088]  [<ffffffff810b4de0>] ? vprintk_emit+0x1e0/0x5a0
      [  271.952504]  [<ffffffff81614d67>] __alloc_skb+0x57/0x2a0
      [  271.968902]  [<ffffffff8163a0b2>] rtmsg_ifinfo+0x52/0x100
      [  271.985302]  [<ffffffff8162ac6d>] __dev_notify_flags+0xad/0xc0
      [  272.001642]  [<ffffffff8162ad0c>] __dev_set_promiscuity+0x8c/0x1c0
      [  272.017917]  [<ffffffff81731ea5>] ? packet_notifier+0x5/0x380
      [  272.033961]  [<ffffffff8162b109>] dev_set_promiscuity+0x29/0x50
      [  272.049855]  [<ffffffff8172e937>] packet_dev_mc+0x87/0xc0
      [  272.065494]  [<ffffffff81732052>] packet_notifier+0x1b2/0x380
      [  272.080915]  [<ffffffff81731ea5>] ? packet_notifier+0x5/0x380
      [  272.096009]  [<ffffffff81761c66>] notifier_call_chain+0x66/0x150
      [  272.110803]  [<ffffffff8108503e>] __raw_notifier_call_chain+0xe/0x10
      [  272.125468]  [<ffffffff81085056>] raw_notifier_call_chain+0x16/0x20
      [  272.139984]  [<ffffffff81620190>] call_netdevice_notifiers_info+0x40/0x70
      [  272.154523]  [<ffffffff816201d6>] call_netdevice_notifiers+0x16/0x20
      [  272.168552]  [<ffffffff816224c5>] rollback_registered_many+0x145/0x240
      [  272.182263]  [<ffffffff81622641>] rollback_registered+0x31/0x40
      [  272.195369]  [<ffffffff816229c8>] unregister_netdevice_queue+0x58/0x90
      [  272.208230]  [<ffffffff81547ca0>] __tun_detach+0x140/0x340
      [  272.220686]  [<ffffffff81547ed6>] tun_chr_close+0x36/0x60
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f294054
    • Hannes Frederic Sowa's avatar
      net: initialize hashrnd in flow_dissector with net_get_random_once · 66415cf8
      Hannes Frederic Sowa authored
      We also can defer the initialization of hashrnd in flow_dissector
      to its first use. Since net_get_random_once is irq safe now we don't
      have to audit the call paths if one of this functions get called by an
      interrupt handler.
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66415cf8
    • Hannes Frederic Sowa's avatar
      net: make net_get_random_once irq safe · f84be2bd
      Hannes Frederic Sowa authored
      I initial build non irq safe version of net_get_random_once because I
      would liked to have the freedom to defer even the extraction process of
      get_random_bytes until the nonblocking pool is fully seeded.
      
      I don't think this is a good idea anymore and thus this patch makes
      net_get_random_once irq safe. Now someone using net_get_random_once does
      not need to care from where it is called.
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f84be2bd
    • Nikolay Aleksandrov's avatar
      net: add missing dev_put() in __netdev_adjacent_dev_insert · 974daef7
      Nikolay Aleksandrov authored
      I think that a dev_put() is needed in the error path to preserve the
      proper dev refcount.
      
      CC: Veaceslav Falico <vfalico@redhat.com>
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@redhat.com>
      Acked-by: default avatarVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      974daef7
    • Hagen Paul Pfeifer's avatar
      netem: markov loss model transition fix · 4a3ad7b3
      Hagen Paul Pfeifer authored
      The transition from markov state "3 => lost packets within a burst
      period" to "1 => successfully transmitted packets within a gap period"
      has no *additional* loss event. The loss already happen for transition
      from 1 -> 3, this additional loss will make things go wild.
      
      E.g. transition probabilities:
      
      p13:   10%
      p31:  100%
      
      Expected:
      
      Ploss = p13 / (p13 + p31)
      Ploss = ~9.09%
      
      ... but it isn't. Even worse: we get a double loss - each time.
      So simple don't return true to indicate loss, rather break and return
      false.
      Signed-off-by: default avatarHagen Paul Pfeifer <hagen@jauu.net>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Stefano Salsano <stefano.salsano@uniroma2.it>
      Cc: Fabio Ludovici <fabio.ludovici@yahoo.it>
      Signed-off-by: default avatarHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4a3ad7b3
  6. 24 Oct, 2013 5 commits