1. 18 Feb, 2014 5 commits
    • Russell King's avatar
      NET: fec: only enable napi if we are successful · ce5eaf02
      Russell King authored
      If napi is left enabled after a failed attempt to bring the interface
      up, we BUG:
      
      fec 2188000.ethernet eth0: no PHY, assuming direct connection to switch
      libphy: PHY fixed-0:00 not found
      fec 2188000.ethernet eth0: could not attach to PHY
      ------------[ cut here ]------------
      kernel BUG at include/linux/netdevice.h:502!
      Internal error: Oops - BUG: 0 [#1] SMP ARM
      ...
      PC is at fec_enet_open+0x4d0/0x500
      LR is at __dev_open+0xa4/0xfc
      
      Only enable napi after we are past all the failure paths.
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ce5eaf02
    • Dan Carpenter's avatar
      af_packet: remove a stray tab in packet_set_ring() · d7cf0c34
      Dan Carpenter authored
      At first glance it looks like there is a missing curly brace but
      actually the code works the same either way.  I have adjusted the
      indenting but left the code the same.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7cf0c34
    • David S. Miller's avatar
      Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless · d3ec67c0
      David S. Miller authored
      John W. Linville says:
      
      ====================
      Please pull this batch of fixes intended for the 3.14 stream...
      
      For the iwlwifi one, Emmanuel says:
      
      "As explicitly written in the commit message, we prefer to disable Tx
      AMPDU on NICs supported by iwldvm. This feature gives a big boost in
      Tx performance, but the firmware is buggy and we can't rely on it.
      Our hope is that most of the users out there want wifi to surf on
      the web which means that they care more for Rx traffic than for Tx.
      People who want to enable it can do so with the help of a module
      parameter."
      
      On top of that...
      
      Dan Carpenter fixes a typo/thinko in ath5k.
      
      Olivier Langlois fixes a couple of rtlwifi issues, one which leaves
      IRQs disabled too long (causing a variety of problems elsewhere),
      and one which fixes an incorrect return code when failing to enable
      the NIC.
      
      Russell King fixes a NULL pointer dereference in hostap.
      
      Stanislaw Gruszka fixes a DMA coherence issue in the rtl8187 driver.
      
      Please let me know if there are problems!
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3ec67c0
    • Daniel Borkmann's avatar
      net: sctp: fix sctp_connectx abi for ia32 emulation/compat mode · ffd59393
      Daniel Borkmann authored
      SCTP's sctp_connectx() abi breaks for 64bit kernels compiled with 32bit
      emulation (e.g. ia32 emulation or x86_x32). Due to internal usage of
      'struct sctp_getaddrs_old' which includes a struct sockaddr pointer,
      sizeof(param) check will always fail in kernel as the structure in
      64bit kernel space is 4bytes larger than for user binaries compiled
      in 32bit mode. Thus, applications making use of sctp_connectx() won't
      be able to run under such circumstances.
      
      Introduce a compat interface in the kernel to deal with such
      situations by using a 'struct compat_sctp_getaddrs_old' structure
      where user data is copied into it, and then sucessively transformed
      into a 'struct sctp_getaddrs_old' structure with the help of
      compat_ptr(). That fixes sctp_connectx() abi without any changes
      needed in user space, and lets the SCTP test suite pass when compiled
      in 32bit and run on 64bit kernels.
      
      Fixes: f9c67811 ("sctp: Fix regression introduced by new sctp_connectx api")
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ffd59393
    • David S. Miller's avatar
      Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge · 7ffb0d31
      David S. Miller authored
      Included changes:
      - fix soft-interface MTU computation
      - fix bogus pointer mangling when parsing the TT-TVLV
        container. This bug led to a wrong memory access.
      - fix memory leak by properly releasing the VLAN object
        after CRC check
      - properly check pskb_may_pull() return value
      - avoid potential race condition while adding new neighbour
      - fix potential memory leak by removing all the references
        to the orig_node object in case of initialization failure
      - fix the TT CRC computation by ensuring that every node uses
        the same byte order when hosts with different endianess are
        part of the same network
      - fix severe memory leak by freeing skb after a successful
        TVLV parsing
      - avoid potential double free when orig_node initialization
        fails
      - fix potential kernel paging error caused by the usage of
        the old value of skb->data after skb reallocation
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7ffb0d31
  2. 17 Feb, 2014 24 commits
    • Duan Jiong's avatar
      ipv4: fix counter in_slow_tot · a6254864
      Duan Jiong authored
      since commit 89aef892("ipv4: Delete routing cache."), the counter
      in_slow_tot can't work correctly.
      
      The counter in_slow_tot increase by one when fib_lookup() return successfully
      in ip_route_input_slow(), but actually the dst struct maybe not be created and
      cached, so we can increase in_slow_tot after the dst struct is created.
      Signed-off-by: default avatarDuan Jiong <duanj.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a6254864
    • Tommie Gannert's avatar
      irtty-sir.c: Do not set_termios() on irtty_close() · 3eca5299
      Tommie Gannert authored
      Issuing set_termios() from irtty_close() causes kernel Oops for
      unplugged usb-serial devices.
      
      Since no other tty_ldisc calls set_termios() on close and no tty driver
      seem to check if tty->device_data is NULL or not on entry to set_termios(),
      the only solution I can come up with is to remove the irtty_stop_receiver()
      call, which only updates termios.
      Signed-off-by: default avatarTommie Gannert <tommie@gannert.se>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3eca5299
    • John W. Linville's avatar
      Merge branch 'master' of... · ff95fe38
      John W. Linville authored
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem
      ff95fe38
    • Jiri Bohac's avatar
      bonding: 802.3ad: make aggregator_identifier bond-private · 163c8ff3
      Jiri Bohac authored
      aggregator_identifier is used to assign unique aggregator identifiers
      to aggregators of a bond during device enslaving.
      
      aggregator_identifier is currently a global variable that is zeroed in
      bond_3ad_initialize().
      
      This sequence will lead to duplicate aggregator identifiers for eth1 and eth3:
      
      create bond0
      change bond0 mode to 802.3ad
      enslave eth0 to bond0 		//eth0 gets agg id 1
      enslave eth1 to bond0 		//eth1 gets agg id 2
      create bond1
      change bond1 mode to 802.3ad
      enslave eth2 to bond1		//aggregator_identifier is reset to 0
      				//eth2 gets agg id 1
      enslave eth3 to bond0 		//eth3 gets agg id 2
      
      Fix this by making aggregator_identifier private to the bond.
      Signed-off-by: default avatarJiri Bohac <jbohac@suse.cz>
      Acked-by: default avatarVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      163c8ff3
    • Emil Goode's avatar
      usbnet: remove generic hard_header_len check · eb85569f
      Emil Goode authored
      This patch removes a generic hard_header_len check from the usbnet
      module that is causing dropped packages under certain circumstances
      for devices that send rx packets that cross urb boundaries.
      
      One example is the AX88772B which occasionally send rx packets that
      cross urb boundaries where the remaining partial packet is sent with
      no hardware header. When the buffer with a partial packet is of less
      number of octets than the value of hard_header_len the buffer is
      discarded by the usbnet module.
      
      With AX88772B this can be reproduced by using ping with a packet
      size between 1965-1976.
      
      The bug has been reported here:
      
      https://bugzilla.kernel.org/show_bug.cgi?id=29082
      
      This patch introduces the following changes:
      - Removes the generic hard_header_len check in the rx_complete
        function in the usbnet module.
      - Introduces a ETH_HLEN check for skbs that are not cloned from
        within a rx_fixup callback.
      - For safety a hard_header_len check is added to each rx_fixup
        callback function that could be affected by this change.
        These extra checks could possibly be removed by someone
        who has the hardware to test.
      - Removes a call to dev_kfree_skb_any() and instead utilizes the
        dev->done list to queue skbs for cleanup.
      
      The changes place full responsibility on the rx_fixup callback
      functions that clone skbs to only pass valid skbs to the
      usbnet_skb_return function.
      Signed-off-by: default avatarEmil Goode <emilgoode@gmail.com>
      Reported-by: default avatarIgor Gnatenko <i.gnatenko.brain@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eb85569f
    • Nicolas Dichtel's avatar
      gre: add link local route when local addr is any · 08b44656
      Nicolas Dichtel authored
      This bug was reported by Steinar H. Gunderson and was introduced by commit
      f7cb8886 ("sit/gre6: don't try to add the same route two times").
      
      root@morgental:~# ip tunnel add foo mode gre remote 1.2.3.4 ttl 64
      root@morgental:~# ip link set foo up mtu 1468
      root@morgental:~# ip -6 route show dev foo
      fe80::/64  proto kernel  metric 256
      
      but after the above commit, no such route shows up.
      
      There is no link local route because dev->dev_addr is 0 (because local ipv4
      address is 0), hence no link local address is configured.
      
      In this scenario, the link local address is added manually: 'ip -6 addr add
      fe80::1 dev foo' and because prefix is /128, no link local route is added by the
      kernel.
      
      Even if the right things to do is to add the link local address with a /64
      prefix, we need to restore the previous behavior to avoid breaking userpace.
      Reported-by: default avatarSteinar H. Gunderson <sesse@samfundet.no>
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      08b44656
    • Antonio Quartulli's avatar
      batman-adv: fix potential kernel paging error for unicast transmissions · 70b271a7
      Antonio Quartulli authored
      batadv_send_skb_prepare_unicast(_4addr) might reallocate the
      skb's data. If it does then our ethhdr pointer is not valid
      anymore in batadv_send_skb_unicast(), resulting in a kernel
      paging error.
      
      Fixing this by refetching the ethhdr pointer after the
      potential reallocation.
      Signed-off-by: default avatarLinus Lüssing <linus.luessing@web.de>
      Signed-off-by: default avatarAntonio Quartulli <antonio@meshcoding.com>
      70b271a7
    • Antonio Quartulli's avatar
      batman-adv: avoid double free when orig_node initialization fails · a5a5cb8c
      Antonio Quartulli authored
      In the failure path of the orig_node initialization routine
      the orig_node->bat_iv.bcast_own field is free'd twice: first
      in batadv_iv_ogm_orig_get() and then later in
      batadv_orig_node_free_rcu().
      
      Fix it by removing the kfree in batadv_iv_ogm_orig_get().
      Signed-off-by: default avatarAntonio Quartulli <antonio@meshcoding.com>
      Signed-off-by: default avatarMarek Lindner <mareklindner@neomailbox.ch>
      a5a5cb8c
    • Antonio Quartulli's avatar
      batman-adv: free skb on TVLV parsing success · 05c3c8a6
      Antonio Quartulli authored
      When the TVLV parsing routine succeed the skb is left
      untouched thus leading to a memory leak.
      
      Fix this by consuming the skb in case of success.
      
      Introduced by ef261577
      ("batman-adv: tvlv - basic infrastructure")
      Reported-by: default avatarRussel Senior <russell@personaltelco.net>
      Signed-off-by: default avatarAntonio Quartulli <antonio@open-mesh.com>
      Tested-by: default avatarRussell Senior <russell@personaltelco.net>
      Signed-off-by: default avatarMarek Lindner <mareklindner@neomailbox.ch>
      05c3c8a6
    • Antonio Quartulli's avatar
      batman-adv: fix TT CRC computation by ensuring byte order · a30e22ca
      Antonio Quartulli authored
      When computing the CRC on a 2byte variable the order of
      the bytes obviously alters the final result. This means
      that computing the CRC over the same value on two archs
      having different endianess leads to different numbers.
      
      The global and local translation table CRC computation
      routine makes this mistake while processing the clients
      VIDs. The result is a continuous CRC mismatching between
      nodes having different endianess.
      
      Fix this by converting the VID to Network Order before
      processing it. This guarantees that every node uses the same
      byte order.
      
      Introduced by 7ea7b4a1
      ("batman-adv: make the TT CRC logic VLAN specific")
      Reported-by: default avatarRussel Senior <russell@personaltelco.net>
      Signed-off-by: default avatarAntonio Quartulli <antonio@open-mesh.com>
      Tested-by: default avatarRussell Senior <russell@personaltelco.net>
      Signed-off-by: default avatarMarek Lindner <mareklindner@neomailbox.ch>
      a30e22ca
    • Simon Wunderlich's avatar
      batman-adv: fix potential orig_node reference leak · b2262df7
      Simon Wunderlich authored
      Since batadv_orig_node_new() sets the refcount to two, assuming that
      the calling function will use a reference for putting the orig_node into
      a hash or similar, both references must be freed if initialization of
      the orig_node fails. Otherwise that object may be leaked in that error
      case.
      Reported-by: default avatarAntonio Quartulli <antonio@meshcoding.com>
      Signed-off-by: default avatarSimon Wunderlich <sw@simonwunderlich.de>
      Signed-off-by: default avatarMarek Lindner <mareklindner@neomailbox.ch>
      Signed-off-by: default avatarAntonio Quartulli <antonio@meshcoding.com>
      b2262df7
    • Antonio Quartulli's avatar
      batman-adv: avoid potential race condition when adding a new neighbour · 08bf0ed2
      Antonio Quartulli authored
      When adding a new neighbour it is important to atomically
      perform the following:
      - check if the neighbour already exists
      - append the neighbour to the proper list
      
      If the two operations are not performed in an atomic context
      it is possible that two concurrent insertions add the same
      neighbour twice.
      Signed-off-by: default avatarAntonio Quartulli <antonio@open-mesh.com>
      Signed-off-by: default avatarMarek Lindner <mareklindner@neomailbox.ch>
      08bf0ed2
    • Antonio Quartulli's avatar
      batman-adv: properly check pskb_may_pull return value · f1791425
      Antonio Quartulli authored
      pskb_may_pull() returns 1 on success and 0 in case of failure,
      therefore checking for the return value being negative does
      not make sense at all.
      
      This way if the function fails we will probably read beyond the current
      skb data buffer. Fix this by doing the proper check.
      Signed-off-by: default avatarAntonio Quartulli <antonio@meshcoding.com>
      Signed-off-by: default avatarMarek Lindner <mareklindner@neomailbox.ch>
      f1791425
    • Antonio Quartulli's avatar
      batman-adv: release vlan object after checking the CRC · 91c2b1a9
      Antonio Quartulli authored
      There is a refcounter unbalance in the CRC checking routine
      invoked on OGM reception. A vlan object is retrieved (thus
      its refcounter is increased by one) but it is never properly
      released. This leads to a memleak because the vlan object
      will never be free'd.
      
      Fix this by releasing the vlan object after having read the
      CRC.
      Reported-by: default avatarRussell Senior <russell@personaltelco.net>
      Reported-by: default avatarDaniel <daniel@makrotopia.org>
      Reported-by: default avatarcmsv <cmsv@wirelesspt.net>
      Signed-off-by: default avatarAntonio Quartulli <antonio@meshcoding.com>
      Signed-off-by: default avatarMarek Lindner <mareklindner@neomailbox.ch>
      91c2b1a9
    • Antonio Quartulli's avatar
      batman-adv: fix TT-TVLV parsing on OGM reception · e889241f
      Antonio Quartulli authored
      When accessing a TT-TVLV container in the OGM RX path
      the variable pointing to the list of changes to apply is
      altered by mistake.
      
      This makes the TT component read data at the wrong position
      in the OGM packet buffer.
      
      Fix it by removing the bogus pointer alteration.
      Signed-off-by: default avatarAntonio Quartulli <antonio@meshcoding.com>
      Signed-off-by: default avatarMarek Lindner <mareklindner@neomailbox.ch>
      e889241f
    • Antonio Quartulli's avatar
      batman-adv: fix soft-interface MTU computation · 930cd6e4
      Antonio Quartulli authored
      The current MTU computation always returns a value
      smaller than 1500bytes even if the real interfaces
      have an MTU large enough to compensate the batman-adv
      overhead.
      
      Fix the computation by properly returning the highest
      admitted value.
      
      Introduced by a19d3d85
      ("batman-adv: limit local translation table max size")
      Reported-by: default avatarRussell Senior <russell@personaltelco.net>
      Signed-off-by: default avatarAntonio Quartulli <antonio@meshcoding.com>
      Signed-off-by: default avatarMarek Lindner <mareklindner@neomailbox.ch>
      930cd6e4
    • Daniel Borkmann's avatar
      packet: check for ndo_select_queue during queue selection · 0fd5d57b
      Daniel Borkmann authored
      Mathias reported that on an AMD Geode LX embedded board (ALiX)
      with ath9k driver PACKET_QDISC_BYPASS, introduced in commit
      d346a3fa ("packet: introduce PACKET_QDISC_BYPASS socket
      option"), triggers a WARN_ON() coming from the driver itself
      via 066dae93 ("ath9k: rework tx queue selection and fix
      queue stopping/waking").
      
      The reason why this happened is that ndo_select_queue() call
      is not invoked from direct xmit path i.e. for ieee80211 subsystem
      that sets queue and TID (similar to 802.1d tag) which is being
      put into the frame through 802.11e (WMM, QoS). If that is not
      set, pending frame counter for e.g. ath9k can get messed up.
      
      So the WARN_ON() in ath9k is absolutely legitimate. Generally,
      the hw queue selection in ieee80211 depends on the type of
      traffic, and priorities are set according to ieee80211_ac_numbers
      mapping; working in a similar way as DiffServ only on a lower
      layer, so that the AP can favour frames that have "real-time"
      requirements like voice or video data frames.
      
      Therefore, check for presence of ndo_select_queue() in netdev
      ops and, if available, invoke it with a fallback handler to
      __packet_pick_tx_queue(), so that driver such as bnx2x, ixgbe,
      or mlx4 can still select a hw queue for transmission in
      relation to the current CPU while e.g. ieee80211 subsystem
      can make their own choices.
      Reported-by: default avatarMathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0fd5d57b
    • Daniel Borkmann's avatar
      netdevice: move netdev_cap_txqueue for shared usage to header · b9507bda
      Daniel Borkmann authored
      In order to allow users to invoke netdev_cap_txqueue, it needs to
      be moved into netdevice.h header file. While at it, also add kernel
      doc header to document the API.
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b9507bda
    • Daniel Borkmann's avatar
      netdevice: add queue selection fallback handler for ndo_select_queue · 99932d4f
      Daniel Borkmann authored
      Add a new argument for ndo_select_queue() callback that passes a
      fallback handler. This gets invoked through netdev_pick_tx();
      fallback handler is currently __netdev_pick_tx() as most drivers
      invoke this function within their customized implementation in
      case for skbs that don't need any special handling. This fallback
      handler can then be replaced on other call-sites with different
      queue selection methods (e.g. in packet sockets, pktgen etc).
      
      This also has the nice side-effect that __netdev_pick_tx() is
      then only invoked from netdev_pick_tx() and export of that
      function to modules can be undone.
      Suggested-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      99932d4f
    • Ingo Molnar's avatar
      drivers/net: tulip_remove_one needs to call pci_disable_device() · c321f7d7
      Ingo Molnar authored
      Otherwise the device is not completely shut down.
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c321f7d7
    • Matija Glavinic Pecotic's avatar
      net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer · ef2820a7
      Matija Glavinic Pecotic authored
      Implementation of (a)rwnd calculation might lead to severe performance issues
      and associations completely stalling. These problems are described and solution
      is proposed which improves lksctp's robustness in congestion state.
      
      1) Sudden drop of a_rwnd and incomplete window recovery afterwards
      
      Data accounted in sctp_assoc_rwnd_decrease takes only payload size (sctp data),
      but size of sk_buff, which is blamed against receiver buffer, is not accounted
      in rwnd. Theoretically, this should not be the problem as actual size of buffer
      is double the amount requested on the socket (SO_RECVBUF). Problem here is
      that this will have bad scaling for data which is less then sizeof sk_buff.
      E.g. in 4G (LTE) networks, link interfacing radio side will have a large portion
      of traffic of this size (less then 100B).
      
      An example of sudden drop and incomplete window recovery is given below. Node B
      exhibits problematic behavior. Node A initiates association and B is configured
      to advertise rwnd of 10000. A sends messages of size 43B (size of typical sctp
      message in 4G (LTE) network). On B data is left in buffer by not reading socket
      in userspace.
      
      Lets examine when we will hit pressure state and declare rwnd to be 0 for
      scenario with above stated parameters (rwnd == 10000, chunk size == 43, each
      chunk is sent in separate sctp packet)
      
      Logic is implemented in sctp_assoc_rwnd_decrease:
      
      socket_buffer (see below) is maximum size which can be held in socket buffer
      (sk_rcvbuf). current_alloced is amount of data currently allocated (rx_count)
      
      A simple expression is given for which it will be examined after how many
      packets for above stated parameters we enter pressure state:
      
      We start by condition which has to be met in order to enter pressure state:
      
      	socket_buffer < currently_alloced;
      
      currently_alloced is represented as size of sctp packets received so far and not
      yet delivered to userspace. x is the number of chunks/packets (since there is no
      bundling, and each chunk is delivered in separate packet, we can observe each
      chunk also as sctp packet, and what is important here, having its own sk_buff):
      
      	socket_buffer < x*each_sctp_packet;
      
      each_sctp_packet is sctp chunk size + sizeof(struct sk_buff). socket_buffer is
      twice the amount of initially requested size of socket buffer, which is in case
      of sctp, twice the a_rwnd requested:
      
      	2*rwnd < x*(payload+sizeof(struc sk_buff));
      
      sizeof(struct sk_buff) is 190 (3.13.0-rc4+). Above is stated that rwnd is 10000
      and each payload size is 43
      
      	20000 < x(43+190);
      
      	x > 20000/233;
      
      	x ~> 84;
      
      After ~84 messages, pressure state is entered and 0 rwnd is advertised while
      received 84*43B ~= 3612B sctp data. This is why external observer notices sudden
      drop from 6474 to 0, as it will be now shown in example:
      
      IP A.34340 > B.12345: sctp (1) [INIT] [init tag: 1875509148] [rwnd: 81920] [OS: 10] [MIS: 65535] [init TSN: 1096057017]
      IP B.12345 > A.34340: sctp (1) [INIT ACK] [init tag: 3198966556] [rwnd: 10000] [OS: 10] [MIS: 10] [init TSN: 902132839]
      IP A.34340 > B.12345: sctp (1) [COOKIE ECHO]
      IP B.12345 > A.34340: sctp (1) [COOKIE ACK]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057017] [SID: 0] [SSEQ 0] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057017] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057018] [SID: 0] [SSEQ 1] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057018] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057019] [SID: 0] [SSEQ 2] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057019] [a_rwnd 9914] [#gap acks 0] [#dup tsns 0]
      <...>
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057098] [SID: 0] [SSEQ 81] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057098] [a_rwnd 6517] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057099] [SID: 0] [SSEQ 82] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057099] [a_rwnd 6474] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057100] [SID: 0] [SSEQ 83] [PPID 0x18]
      
      --> Sudden drop
      
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
      
      At this point, rwnd_press stores current rwnd value so it can be later restored
      in sctp_assoc_rwnd_increase. This however doesn't happen as condition to start
      slowly increasing rwnd until rwnd_press is returned to rwnd is never met. This
      condition is not met since rwnd, after it hit 0, must first reach rwnd_press by
      adding amount which is read from userspace. Let us observe values in above
      example. Initial a_rwnd is 10000, pressure was hit when rwnd was ~6500 and the
      amount of actual sctp data currently waiting to be delivered to userspace
      is ~3500. When userspace starts to read, sctp_assoc_rwnd_increase will be blamed
      only for sctp data, which is ~3500. Condition is never met, and when userspace
      reads all data, rwnd stays on 3569.
      
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 1505] [#gap acks 0] [#dup tsns 0]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 3010] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057101] [SID: 0] [SSEQ 84] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057101] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]
      
      --> At this point userspace read everything, rwnd recovered only to 3569
      
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057102] [SID: 0] [SSEQ 85] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057102] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]
      
      Reproduction is straight forward, it is enough for sender to send packets of
      size less then sizeof(struct sk_buff) and receiver keeping them in its buffers.
      
      2) Minute size window for associations sharing the same socket buffer
      
      In case multiple associations share the same socket, and same socket buffer
      (sctp.rcvbuf_policy == 0), different scenarios exist in which congestion on one
      of the associations can permanently drop rwnd of other association(s).
      
      Situation will be typically observed as one association suddenly having rwnd
      dropped to size of last packet received and never recovering beyond that point.
      Different scenarios will lead to it, but all have in common that one of the
      associations (let it be association from 1)) nearly depleted socket buffer, and
      the other association blames socket buffer just for the amount enough to start
      the pressure. This association will enter pressure state, set rwnd_press and
      announce 0 rwnd.
      When data is read by userspace, similar situation as in 1) will occur, rwnd will
      increase just for the size read by userspace but rwnd_press will be high enough
      so that association doesn't have enough credit to reach rwnd_press and restore
      to previous state. This case is special case of 1), being worse as there is, in
      the worst case, only one packet in buffer for which size rwnd will be increased.
      Consequence is association which has very low maximum rwnd ('minute size', in
      our case down to 43B - size of packet which caused pressure) and as such
      unusable.
      
      Scenario happened in the field and labs frequently after congestion state (link
      breaks, different probabilities of packet drop, packet reordering) and with
      scenario 1) preceding. Here is given a deterministic scenario for reproduction:
      
      >From node A establish two associations on the same socket, with rcvbuf_policy
      being set to share one common buffer (sctp.rcvbuf_policy == 0). On association 1
      repeat scenario from 1), that is, bring it down to 0 and restore up. Observe
      scenario 1). Use small payload size (here we use 43). Once rwnd is 'recovered',
      bring it down close to 0, as in just one more packet would close it. This has as
      a consequence that association number 2 is able to receive (at least) one more
      packet which will bring it in pressure state. E.g. if association 2 had rwnd of
      10000, packet received was 43, and we enter at this point into pressure,
      rwnd_press will have 9957. Once payload is delivered to userspace, rwnd will
      increase for 43, but conditions to restore rwnd to original state, just as in
      1), will never be satisfied.
      
      --> Association 1, between A.y and B.12345
      
      IP A.55915 > B.12345: sctp (1) [INIT] [init tag: 836880897] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 4032536569]
      IP B.12345 > A.55915: sctp (1) [INIT ACK] [init tag: 2873310749] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3799315613]
      IP A.55915 > B.12345: sctp (1) [COOKIE ECHO]
      IP B.12345 > A.55915: sctp (1) [COOKIE ACK]
      
      --> Association 2, between A.z and B.12346
      
      IP A.55915 > B.12346: sctp (1) [INIT] [init tag: 534798321] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 2099285173]
      IP B.12346 > A.55915: sctp (1) [INIT ACK] [init tag: 516668823] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3676403240]
      IP A.55915 > B.12346: sctp (1) [COOKIE ECHO]
      IP B.12346 > A.55915: sctp (1) [COOKIE ACK]
      
      --> Deplete socket buffer by sending messages of size 43B over association 1
      
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315613] [SID: 0] [SSEQ 0] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315613] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
      
      <...>
      
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315696] [a_rwnd 6388] [#gap acks 0] [#dup tsns 0]
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315697] [SID: 0] [SSEQ 84] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315697] [a_rwnd 6345] [#gap acks 0] [#dup tsns 0]
      
      --> Sudden drop on 1
      
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315698] [SID: 0] [SSEQ 85] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315698] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
      
      --> Here userspace read, rwnd 'recovered' to 3698, now deplete again using
          association 1 so there is place in buffer for only one more packet
      
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315799] [SID: 0] [SSEQ 186] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315799] [a_rwnd 86] [#gap acks 0] [#dup tsns 0]
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315800] [SID: 0] [SSEQ 187] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]
      
      --> Socket buffer is almost depleted, but there is space for one more packet,
          send them over association 2, size 43B
      
      IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403240] [SID: 0] [SSEQ 0] [PPID 0x18]
      IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403240] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
      
      --> Immediate drop
      
      IP A.60995 > B.12346: sctp (1) [SACK] [cum ack 387491510] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
      
      --> Read everything from the socket, both association recover up to maximum rwnd
          they are capable of reaching, note that association 1 recovered up to 3698,
          and association 2 recovered only to 43
      
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 1548] [#gap acks 0] [#dup tsns 0]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 3053] [#gap acks 0] [#dup tsns 0]
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315801] [SID: 0] [SSEQ 188] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315801] [a_rwnd 3698] [#gap acks 0] [#dup tsns 0]
      IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403241] [SID: 0] [SSEQ 1] [PPID 0x18]
      IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403241] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]
      
      A careful reader might wonder why it is necessary to reproduce 1) prior
      reproduction of 2). It is simply easier to observe when to send packet over
      association 2 which will push association into the pressure state.
      
      Proposed solution:
      
      Both problems share the same root cause, and that is improper scaling of socket
      buffer with rwnd. Solution in which sizeof(sk_buff) is taken into concern while
      calculating rwnd is not possible due to fact that there is no linear
      relationship between amount of data blamed in increase/decrease with IP packet
      in which payload arrived. Even in case such solution would be followed,
      complexity of the code would increase. Due to nature of current rwnd handling,
      slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure state is
      entered is rationale, but it gives false representation to the sender of current
      buffer space. Furthermore, it implements additional congestion control mechanism
      which is defined on implementation, and not on standard basis.
      
      Proposed solution simplifies whole algorithm having on mind definition from rfc:
      
      o  Receiver Window (rwnd): This gives the sender an indication of the space
         available in the receiver's inbound buffer.
      
      Core of the proposed solution is given with these lines:
      
      sctp_assoc_rwnd_update:
      	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
      		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
      	else
      		asoc->rwnd = 0;
      
      We advertise to sender (half of) actual space we have. Half is in the braces
      depending whether you would like to observe size of socket buffer as SO_RECVBUF
      or twice the amount, i.e. size is the one visible from userspace, that is,
      from kernelspace.
      In this way sender is given with good approximation of our buffer space,
      regardless of the buffer policy - we always advertise what we have. Proposed
      solution fixes described problems and removes necessity for rwnd restoration
      algorithm. Finally, as proposed solution is simplification, some lines of code,
      along with some bytes in struct sctp_association are saved.
      
      Version 2 of the patch addressed comments from Vlad. Name of the function is set
      to be more descriptive, and two parts of code are changed, in one removing the
      superfluous call to sctp_assoc_rwnd_update since call would not result in update
      of rwnd, and the other being reordering of the code in a way that call to
      sctp_assoc_rwnd_update updates rwnd. Version 3 corrected change introduced in v2
      in a way that existing function is not reordered/copied in line, but it is
      correctly called. Thanks Vlad for suggesting.
      Signed-off-by: default avatarMatija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
      Reviewed-by: default avatarAlexander Sverdlin <alexander.sverdlin@nsn.com>
      Acked-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef2820a7
    • Duan Jiong's avatar
      ipv4: distinguish EHOSTUNREACH from the ENETUNREACH · cd0f0b95
      Duan Jiong authored
      since commit 251da413("ipv4: Cache ip_error() routes even when not forwarding."),
      the counter IPSTATS_MIB_INADDRERRORS can't work correctly, because the value of
      err was always set to ENETUNREACH.
      Signed-off-by: default avatarDuan Jiong <duanj.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd0f0b95
    • Haiyang Zhang's avatar
      hyperv: Fix the carrier status setting · 891de74d
      Haiyang Zhang authored
      Without this patch, the "cat /sys/class/net/ethN/operstate" shows
      "unknown", and "ethtool ethN" shows "Link detected: yes", when VM
      boots up with or without vNIC connected.
      
      This patch fixed the problem.
      Signed-off-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Reviewed-by: default avatarK. Y. Srinivasan <kys@microsoft.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      891de74d
    • Gerrit Renker's avatar
      dccp: re-enable debug macro · 09db3080
      Gerrit Renker authored
      dccp tfrc: revert
      
      This reverts 6aee49c5 ("dccp: make local variable static") since
      the variable tfrc_debug is referenced by the tfrc_pr_debug(fmt, ...)
      macro when TFRC debugging is enabled. If it is enabled, use of the
      macro produces a compilation error.
      Signed-off-by: default avatarGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      09db3080
  3. 14 Feb, 2014 4 commits
  4. 13 Feb, 2014 7 commits