1. 10 Mar, 2016 15 commits
    • Jay Vosburgh's avatar
      bonding: Fix ARP monitor validation · 58176316
      Jay Vosburgh authored
      [ Upstream commit 21a75f09 ]
      
      The current logic in bond_arp_rcv will accept an incoming ARP for
      validation if (a) the receiving slave is either "active" (which includes
      the currently active slave, or the current ARP slave) or, (b) there is a
      currently active slave, and it has received an ARP since it became active.
      For case (b), the receiving slave isn't the currently active slave, and is
      receiving the original broadcast ARP request, not an ARP reply from the
      target.
      
      	This logic can fail if there is no currently active slave.  In
      this situation, the ARP probe logic cycles through all slaves, assigning
      each in turn as the "current_arp_slave" for one arp_interval, then setting
      that one as "active," and sending an ARP probe from that slave.  The
      current logic expects the ARP reply to arrive on the sending
      current_arp_slave, however, due to switch FDB updating delays, the reply
      may be directed to another slave.
      
      	This can arise if the bonding slaves and switch are working, but
      the ARP target is not responding.  When the ARP target recovers, a
      condition may result wherein the ARP target host replies faster than the
      switch can update its forwarding table, causing each ARP reply to be sent
      to the previous current_arp_slave.  This will never pass the logic in
      bond_arp_rcv, as neither of the above conditions (a) or (b) are met.
      
      	Some experimentation on a LAN shows ARP reply round trips in the
      200 usec range, but my available switches never update their FDB in less
      than 4000 usec.
      
      	This patch changes the logic in bond_arp_rcv to additionally
      accept an ARP reply for validation on any slave if there is a current ARP
      slave and it sent an ARP probe during the previous arp_interval.
      
      Fixes: aeea64ac ("bonding: don't trust arp requests unless active slave really works")
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: default avatarJay Vosburgh <jay.vosburgh@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      58176316
    • Daniel Borkmann's avatar
      bpf: fix branch offset adjustment on backjumps after patching ctx expansion · 647dc288
      Daniel Borkmann authored
      [ Upstream commit a1b14d27 ]
      
      When ctx access is used, the kernel often needs to expand/rewrite
      instructions, so after that patching, branch offsets have to be
      adjusted for both forward and backward jumps in the new eBPF program,
      but for backward jumps it fails to account the delta. Meaning, for
      example, if the expansion happens exactly on the insn that sits at
      the jump target, it doesn't fix up the back jump offset.
      
      Analysis on what the check in adjust_branches() is currently doing:
      
        /* adjust offset of jmps if necessary */
        if (i < pos && i + insn->off + 1 > pos)
          insn->off += delta;
        else if (i > pos && i + insn->off + 1 < pos)
          insn->off -= delta;
      
      First condition (forward jumps):
      
        Before:                         After:
      
        insns[0]                        insns[0]
        insns[1] <--- i/insn            insns[1] <--- i/insn
        insns[2] <--- pos               insns[P] <--- pos
        insns[3]                        insns[P]  `------| delta
        insns[4] <--- target_X          insns[P]   `-----|
        insns[5]                        insns[3]
                                        insns[4] <--- target_X
                                        insns[5]
      
      First case is if we cross pos-boundary and the jump instruction was
      before pos. This is handeled correctly. I.e. if i == pos, then this
      would mean our jump that we currently check was the patchlet itself
      that we just injected. Since such patchlets are self-contained and
      have no awareness of any insns before or after the patched one, the
      delta is correctly not adjusted. Also, for the second condition in
      case of i + insn->off + 1 == pos, means we jump to that newly patched
      instruction, so no offset adjustment are needed. That part is correct.
      
      Second condition (backward jumps):
      
        Before:                         After:
      
        insns[0]                        insns[0]
        insns[1] <--- target_X          insns[1] <--- target_X
        insns[2] <--- pos <-- target_Y  insns[P] <--- pos <-- target_Y
        insns[3]                        insns[P]  `------| delta
        insns[4] <--- i/insn            insns[P]   `-----|
        insns[5]                        insns[3]
                                        insns[4] <--- i/insn
                                        insns[5]
      
      Second interesting case is where we cross pos-boundary and the jump
      instruction was after pos. Backward jump with i == pos would be
      impossible and pose a bug somewhere in the patchlet, so the first
      condition checking i > pos is okay only by itself. However, i +
      insn->off + 1 < pos does not always work as intended to trigger the
      adjustment. It works when jump targets would be far off where the
      delta wouldn't matter. But, for example, where the fixed insn->off
      before pointed to pos (target_Y), it now points to pos + delta, so
      that additional room needs to be taken into account for the check.
      This means that i) both tests here need to be adjusted into pos + delta,
      and ii) for the second condition, the test needs to be <= as pos
      itself can be a target in the backjump, too.
      
      Fixes: 9bac3d6d ("bpf: allow extended BPF programs access skb fields")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      647dc288
    • Alexander Duyck's avatar
      net: Copy inner L3 and L4 headers as unaligned on GRE TEB · 8d260fa2
      Alexander Duyck authored
      [ Upstream commit 78565208 ]
      
      This patch corrects the unaligned accesses seen on GRE TEB tunnels when
      generating hash keys.  Specifically what this patch does is make it so that
      we force the use of skb_copy_bits when the GRE inner headers will be
      unaligned due to NET_IP_ALIGNED being a non-zero value.
      Signed-off-by: default avatarAlexander Duyck <aduyck@mirantis.com>
      Acked-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      8d260fa2
    • Alexander Duyck's avatar
      flow_dissector: Fix unaligned access in __skb_flow_dissector when used by eth_get_headlen · 5212c0d2
      Alexander Duyck authored
      [ Upstream commit 461547f3, since
        we lack the flow dissector flags in this release we guard the
        flow label access using a test on 'skb' being NULL ]
      
      This patch fixes an issue with unaligned accesses when using
      eth_get_headlen on a page that was DMA aligned instead of being IP aligned.
      The fact is when trying to check the length we don't need to be looking at
      the flow label so we can reorder the checks to first check if we are
      supposed to gather the flow label and then make the call to actually get
      it.
      
      v2:  Updated path so that either STOP_AT_FLOW_LABEL or KEY_FLOW_LABEL can
           cause us to check for the flow label.
      Reported-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      5212c0d2
    • Xin Long's avatar
      sctp: translate network order to host order when users get a hmacid · 680d9a57
      Xin Long authored
      [ Upstream commit 7a84bd46 ]
      
      Commit ed5a377d ("sctp: translate host order to network order when
      setting a hmacid") corrected the hmacid byte-order when setting a hmacid.
      but the same issue also exists on getting a hmacid.
      
      We fix it by changing hmacids to host order when users get them with
      getsockopt.
      
      Fixes: Commit ed5a377d ("sctp: translate host order to network order when setting a hmacid")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      680d9a57
    • Siva Reddy Kallam's avatar
      tg3: Fix for tg3 transmit queue 0 timed out when too many gso_segs · fdc6b7a4
      Siva Reddy Kallam authored
      [ Upstream commit b7d98729 ]
      
      tg3_tso_bug() can hit a condition where the entire tx ring is not big
      enough to segment the GSO packet. For example, if MSS is very small,
      gso_segs can exceed the tx ring size. When we hit the condition, it
      will cause tx timeout.
      
      tg3_tso_bug() is called to handle TSO and DMA hardware bugs.
      For TSO bugs, if tg3_tso_bug() cannot succeed, we have to drop the packet.
      For DMA bugs, we can still fall back to linearize the SKB and let the
      hardware transmit the TSO packet.
      
      This patch adds a function tg3_tso_bug_gso_check() to check if there
      are enough tx descriptors for GSO before calling tg3_tso_bug().
      The caller will then handle the error appropriately - drop or
      lineraize the SKB.
      
      v2: Corrected patch description to avoid confusion.
      Signed-off-by: default avatarSiva Reddy Kallam <siva.kallam@broadcom.com>
      Signed-off-by: default avatarMichael Chan <mchan@broadcom.com>
      Acked-by: default avatarPrashant Sreedharan <prashant@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      fdc6b7a4
    • Hans Westgaard Ry's avatar
      net:Add sysctl_max_skb_frags · d52e872f
      Hans Westgaard Ry authored
      [ Upstream commit 5f74f82e ]
      
      Devices may have limits on the number of fragments in an skb they support.
      Current codebase uses a constant as maximum for number of fragments one
      skb can hold and use.
      When enabling scatter/gather and running traffic with many small messages
      the codebase uses the maximum number of fragments and may thereby violate
      the max for certain devices.
      The patch introduces a global variable as max number of fragments.
      Signed-off-by: default avatarHans Westgaard Ry <hans.westgaard.ry@oracle.com>
      Reviewed-by: default avatarHåkon Bugge <haakon.bugge@oracle.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      d52e872f
    • Hannes Frederic Sowa's avatar
      unix: correctly track in-flight fds in sending process user_struct · cb1e702e
      Hannes Frederic Sowa authored
      [ Upstream commit 415e3d3e ]
      
      The commit referenced in the Fixes tag incorrectly accounted the number
      of in-flight fds over a unix domain socket to the original opener
      of the file-descriptor. This allows another process to arbitrary
      deplete the original file-openers resource limit for the maximum of
      open files. Instead the sending processes and its struct cred should
      be credited.
      
      To do so, we add a reference counted struct user_struct pointer to the
      scm_fp_list and use it to account for the number of inflight unix fds.
      
      Fixes: 712f4aad ("unix: properly account for FDs passed over unix sockets")
      Reported-by: default avatarDavid Herrmann <dh.herrmann@gmail.com>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      cb1e702e
    • Eric Dumazet's avatar
      ipv6: fix a lockdep splat · 36b4caf4
      Eric Dumazet authored
      [ Upstream commit 44c3d0c1 ]
      
      Silence lockdep false positive about rcu_dereference() being
      used in the wrong context.
      
      First one should use rcu_dereference_protected() as we own the spinlock.
      
      Second one should be a normal assignation, as no barrier is needed.
      
      Fixes: 18367681 ("ipv6 flowlabel: Convert np->ipv6_fl_list to RCU.")
      Reported-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      36b4caf4
    • subashab@codeaurora.org's avatar
      ipv6: addrconf: Fix recursive spin lock call · f2e892de
      subashab@codeaurora.org authored
      [ Upstream commit 16186a82 ]
      
      A rcu stall with the following backtrace was seen on a system with
      forwarding, optimistic_dad and use_optimistic set. To reproduce,
      set these flags and allow ipv6 autoconf.
      
      This occurs because the device write_lock is acquired while already
      holding the read_lock. Back trace below -
      
      INFO: rcu_preempt self-detected stall on CPU { 1}  (t=2100 jiffies
       g=3992 c=3991 q=4471)
      <6> Task dump for CPU 1:
      <2> kworker/1:0     R  running task    12168    15   2 0x00000002
      <2> Workqueue: ipv6_addrconf addrconf_dad_work
      <6> Call trace:
      <2> [<ffffffc000084da8>] el1_irq+0x68/0xdc
      <2> [<ffffffc000cc4e0c>] _raw_write_lock_bh+0x20/0x30
      <2> [<ffffffc000bc5dd8>] __ipv6_dev_ac_inc+0x64/0x1b4
      <2> [<ffffffc000bcbd2c>] addrconf_join_anycast+0x9c/0xc4
      <2> [<ffffffc000bcf9f0>] __ipv6_ifa_notify+0x160/0x29c
      <2> [<ffffffc000bcfb7c>] ipv6_ifa_notify+0x50/0x70
      <2> [<ffffffc000bd035c>] addrconf_dad_work+0x314/0x334
      <2> [<ffffffc0000b64c8>] process_one_work+0x244/0x3fc
      <2> [<ffffffc0000b7324>] worker_thread+0x2f8/0x418
      <2> [<ffffffc0000bb40c>] kthread+0xe0/0xec
      
      v2: do addrconf_dad_kick inside read lock and then acquire write
      lock for ipv6_ifa_notify as suggested by Eric
      
      Fixes: 7fd2561e ("net: ipv6: Add a sysctl to make optimistic
      addresses useful candidates")
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Erik Kline <ek@google.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      f2e892de
    • Hangbin Liu's avatar
      net/ipv6: add sysctl option accept_ra_min_hop_limit · 42fd2eb6
      Hangbin Liu authored
      [ Upstream commit 8013d1d7 ]
      
      Commit 6fd99094 ("ipv6: Don't reduce hop limit for an interface")
      disabled accept hop limit from RA if it is smaller than the current hop
      limit for security stuff. But this behavior kind of break the RFC definition.
      
      RFC 4861, 6.3.4.  Processing Received Router Advertisements
         A Router Advertisement field (e.g., Cur Hop Limit, Reachable Time,
         and Retrans Timer) may contain a value denoting that it is
         unspecified.  In such cases, the parameter should be ignored and the
         host should continue using whatever value it is already using.
      
         If the received Cur Hop Limit value is non-zero, the host SHOULD set
         its CurHopLimit variable to the received value.
      
      So add sysctl option accept_ra_min_hop_limit to let user choose the minimum
      hop limit value they can accept from RA. And set default to 1 to meet RFC
      standards.
      Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Acked-by: default avatarYOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      42fd2eb6
    • Paolo Abeni's avatar
      ipv6/udp: use sticky pktinfo egress ifindex on connect() · 598fadfa
      Paolo Abeni authored
      [ Upstream commit 1cdda918 ]
      
      Currently, the egress interface index specified via IPV6_PKTINFO
      is ignored by __ip6_datagram_connect(), so that RFC 3542 section 6.7
      can be subverted when the user space application calls connect()
      before sendmsg().
      Fix it by initializing properly flowi6_oif in connect() before
      performing the route lookup.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      598fadfa
    • Paolo Abeni's avatar
      ipv6: enforce flowi6_oif usage in ip6_dst_lookup_tail() · 637a054f
      Paolo Abeni authored
      [ Upstream commit 6f21c96a ]
      
      The current implementation of ip6_dst_lookup_tail basically
      ignore the egress ifindex match: if the saddr is set,
      ip6_route_output() purposefully ignores flowi6_oif, due
      to the commit d46a9d67 ("net: ipv6: Dont add RT6_LOOKUP_F_IFACE
      flag if saddr set"), if the saddr is 'any' the first route lookup
      in ip6_dst_lookup_tail fails, but upon failure a second lookup will
      be performed with saddr set, thus ignoring the ifindex constraint.
      
      This commit adds an output route lookup function variant, which
      allows the caller to specify lookup flags, and modify
      ip6_dst_lookup_tail() to enforce the ifindex match on the second
      lookup via said helper.
      
      ip6_route_output() becames now a static inline function build on
      top of ip6_route_output_flags(); as a side effect, out-of-tree
      modules need now a GPL license to access the output route lookup
      functionality.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      637a054f
    • Eric Dumazet's avatar
      tcp: beware of alignments in tcp_get_info() · 80eb49ae
      Eric Dumazet authored
      [ Upstream commit ff5d7497 ]
      
      With some combinations of user provided flags in netlink command,
      it is possible to call tcp_get_info() with a buffer that is not 8-bytes
      aligned.
      
      It does matter on some arches, so we need to use put_unaligned() to
      store the u64 fields.
      
      Current iproute2 package does not trigger this particular issue.
      
      Fixes: 0df48c26 ("tcp: add tcpi_bytes_acked to tcp_info")
      Fixes: 977cb0ec ("tcp: add pacing_rate information into tcp_info")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      80eb49ae
    • Ido Schimmel's avatar
      switchdev: Require RTNL mutex to be held when sending FDB notifications · 6f91d5af
      Ido Schimmel authored
      [ Upstream commit 4f2c6ae5 ]
      
      When switchdev drivers process FDB notifications from the underlying
      device they resolve the netdev to which the entry points to and notify
      the bridge using the switchdev notifier.
      
      However, since the RTNL mutex is not held there is nothing preventing
      the netdev from disappearing in the middle, which will cause
      br_switchdev_event() to dereference a non-existing netdev.
      
      Make switchdev drivers hold the lock at the beginning of the
      notification processing session and release it once it ends, after
      notifying the bridge.
      
      Also, remove switchdev_mutex and fdb_lock, as they are no longer needed
      when RTNL mutex is held.
      
      Fixes: 03bf0c28 ("switchdev: introduce switchdev notifier")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      6f91d5af
  2. 09 Mar, 2016 25 commits