1. 28 Jul, 2013 40 commits
    • Kent Overstreet's avatar
      bcache: Shutdown fix · 63a53870
      Kent Overstreet authored
      commit 5caa52af upstream.
      
      Stopping a cache set is supposed to make it stop attached backing
      devices, but somewhere along the way that code got lost. Fixing this
      mainly has the effect of fixing our reboot notifier.
      Signed-off-by: default avatarKent Overstreet <kmo@daterainc.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      63a53870
    • Kent Overstreet's avatar
      bcache: Advertise that flushes are supported · 3fcbc176
      Kent Overstreet authored
      commit 54d12f2b upstream.
      
      Whoops - bcache's flush/FUA was mostly correct, but flushes get filtered
      out unless we say we support them...
      Signed-off-by: default avatarKent Overstreet <kmo@daterainc.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3fcbc176
    • Kent Overstreet's avatar
      bcache: Fix a dumb race · c0d8f455
      Kent Overstreet authored
      commit 6aa8f1a6 upstream.
      
      In the far-too-complicated closure code - closures can have destructors,
      for probably dubious reasons; they get run after the closure is no
      longer waiting on anything but before dropping the parent ref, intended
      just for freeing whatever memory the closure is embedded in.
      
      Trouble is, when remaining goes to 0 and we've got nothing more to run -
      we also have to unlock the closure, setting remaining to -1. If there's
      a destructor, that unlock isn't doing anything - nobody could be trying
      to lock it if we're about to free it - but if the unlock _is needed...
      that check for a destructor was racy. Argh.
      Signed-off-by: default avatarKent Overstreet <kmo@daterainc.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c0d8f455
    • Miklos Szeredi's avatar
      fuse: readdirplus: sanity checks · 223828d8
      Miklos Szeredi authored
      commit a28ef45c upstream.
      
      Add sanity checks before adding or updating an entry with data received
      from readdirplus.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      223828d8
    • Miklos Szeredi's avatar
      fuse: readdirplus: fix instantiate · dc2a6c2d
      Miklos Szeredi authored
      commit 2914941e upstream.
      
      Fuse does instantiation slightly differently from NFS/CIFS which use
      d_materialise_unique().
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dc2a6c2d
    • Niels de Vos's avatar
      fuse: readdirplus: fix dentry leak · 05ac7b3a
      Niels de Vos authored
      commit 53ce9a33 upstream.
      
      In case d_lookup() returns a dentry with d_inode == NULL, the dentry is not
      returned with dput(). This results in triggering a BUG() in
      shrink_dcache_for_umount_subtree():
      
        BUG: Dentry ...{i=0,n=...} still in use (1) [unmount of fuse fuse]
      
      [SzM: need to d_drop() as well]
      Reported-by: default avatarJustin Clift <jclift@redhat.com>
      Signed-off-by: default avatarNiels de Vos <ndevos@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Tested-by: default avatarBrian Foster <bfoster@redhat.com>
      Tested-by: default avatarNiels de Vos <ndevos@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      05ac7b3a
    • Ralf Baechle's avatar
      RAPIDIO: IDT_GEN2: Fix build error. · df18f5f9
      Ralf Baechle authored
      commit 27f62b9f upstream.
      
        CC      drivers/rapidio/switches/idt_gen2.o
      drivers/rapidio/switches/idt_gen2.c: In function ‘idtg2_show_errlog’:
      drivers/rapidio/switches/idt_gen2.c:379:30: error: ‘PAGE_SIZE’ undeclared (first use in this function)
      drivers/rapidio/switches/idt_gen2.c:379:30: note: each undeclared identifier is reported only once for each function it appears in
      Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Acked-by: default avatarAlexandre Bounine <alexandre.bounine@idt.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      df18f5f9
    • Ralf Baechle's avatar
      MIPS: Oceton: Fix build error. · cf6a37e7
      Ralf Baechle authored
      commit 39205750 upstream.
      
      If CONFIG_CAVIUM_OCTEON_LOCK_L2_TLB, CONFIG_CAVIUM_OCTEON_LOCK_L2_EXCEPTION,
      CONFIG_CAVIUM_OCTEON_LOCK_L2_LOW_LEVEL_INTERRUPT and
      CONFIG_CAVIUM_OCTEON_LOCK_L2_INTERRUPT are all undefined:
      
      arch/mips/cavium-octeon/setup.c: In function ‘prom_init’:
      arch/mips/cavium-octeon/setup.c:715:12: error: unused variable ‘ebase’ [-Werror=unused-variable]
      Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cf6a37e7
    • Eric Dumazet's avatar
      vlan: fix a race in egress prio management · 5110890c
      Eric Dumazet authored
      [ Upstream commit 3e3aac49 ]
      
      egress_priority_map[] hash table updates are protected by rtnl,
      and we never remove elements until device is dismantled.
      
      We have to make sure that before inserting an new element in hash table,
      all its fields are committed to memory or else another cpu could
      find corrupt values and crash.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5110890c
    • Eric Dumazet's avatar
      vlan: mask vlan prio bits · 37b25f3f
      Eric Dumazet authored
      [ Upstream commit d4b812de ]
      
      In commit 48cc32d3
      ("vlan: don't deliver frames for unknown vlans to protocols")
      Florian made sure we set pkt_type to PACKET_OTHERHOST
      if the vlan id is set and we could find a vlan device for this
      particular id.
      
      But we also have a problem if prio bits are set.
      
      Steinar reported an issue on a router receiving IPv6 frames with a
      vlan tag of 4000 (id 0, prio 2), and tunneled into a sit device,
      because skb->vlan_tci is set.
      
      Forwarded frame is completely corrupted : We can see (8100:4000)
      being inserted in the middle of IPv6 source address :
      
      16:48:00.780413 IP6 2001:16d8:8100:4000:ee1c:0:9d9:bc87 >
      9f94:4d95:2001:67c:29f4::: ICMP6, unknown icmp6 type (0), length 64
             0x0000:  0000 0029 8000 c7c3 7103 0001 a0ae e651
             0x0010:  0000 0000 ccce 0b00 0000 0000 1011 1213
             0x0020:  1415 1617 1819 1a1b 1c1d 1e1f 2021 2223
             0x0030:  2425 2627 2829 2a2b 2c2d 2e2f 3031 3233
      
      It seems we are not really ready to properly cope with this right now.
      
      We can probably do better in future kernels :
      vlan_get_ingress_priority() should be a netdev property instead of
      a per vlan_dev one.
      
      For stable kernels, lets clear vlan_tci to fix the bugs.
      Reported-by: default avatarSteinar H. Gunderson <sesse@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      37b25f3f
    • Jason Wang's avatar
      macvtap: do not zerocopy if iov needs more pages than MAX_SKB_FRAGS · 7d9e6dd8
      Jason Wang authored
      [ Upstream commit ece793fc ]
      
      We try to linearize part of the skb when the number of iov is greater than
      MAX_SKB_FRAGS. This is not enough since each single vector may occupy more than
      one pages, so zerocopy_sg_fromiovec() may still fail and may break the guest
      network.
      
      Solve this problem by calculate the pages needed for iov before trying to do
      zerocopy and switch to use copy instead of zerocopy if it needs more than
      MAX_SKB_FRAGS.
      
      This is done through introducing a new helper to count the pages for iov, and
      call uarg->callback() manually when switching from zerocopy to copy to notify
      vhost.
      
      We can do further optimization on top.
      
      This bug were introduced from b92946e2
      (macvtap: zerocopy: validate vectors before building skb).
      
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7d9e6dd8
    • Jason Wang's avatar
      tuntap: do not zerocopy if iov needs more pages than MAX_SKB_FRAGS · 05464d21
      Jason Wang authored
      [ Upstream commit 88529176 ]
      
      We try to linearize part of the skb when the number of iov is greater than
      MAX_SKB_FRAGS. This is not enough since each single vector may occupy more than
      one pages, so zerocopy_sg_fromiovec() may still fail and may break the guest
      network.
      
      Solve this problem by calculate the pages needed for iov before trying to do
      zerocopy and switch to use copy instead of zerocopy if it needs more than
      MAX_SKB_FRAGS.
      
      This is done through introducing a new helper to count the pages for iov, and
      call uarg->callback() manually when switching from zerocopy to copy to notify
      vhost.
      
      We can do further optimization on top.
      
      The bug were introduced from commit 0690899b
      (tun: experimental zero copy tx support)
      
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      05464d21
    • Paolo Valente's avatar
      pkt_sched: sch_qfq: remove a source of high packet delay/jitter · c1d220fb
      Paolo Valente authored
      [ Upstream commit 87f40dd6 ]
      
      QFQ+ inherits from QFQ a design choice that may cause a high packet
      delay/jitter and a severe short-term unfairness. As QFQ, QFQ+ uses a
      special quantity, the system virtual time, to track the service
      provided by the ideal system it approximates. When a packet is
      dequeued, this quantity must be incremented by the size of the packet,
      divided by the sum of the weights of the aggregates waiting to be
      served. Tracking this sum correctly is a non-trivial task, because, to
      preserve tight service guarantees, the decrement of this sum must be
      delayed in a special way [1]: this sum can be decremented only after
      that its value would decrease also in the ideal system approximated by
      QFQ+. For efficiency, QFQ+ keeps track only of the 'instantaneous'
      weight sum, increased and decreased immediately as the weight of an
      aggregate changes, and as an aggregate is created or destroyed (which,
      in its turn, happens as a consequence of some class being
      created/destroyed/changed). However, to avoid the problems caused to
      service guarantees by these immediate decreases, QFQ+ increments the
      system virtual time using the maximum value allowed for the weight
      sum, 2^10, in place of the dynamic, instantaneous value. The
      instantaneous value of the weight sum is used only to check whether a
      request of weight increase or a class creation can be satisfied.
      
      Unfortunately, the problems caused by this choice are worse than the
      temporary degradation of the service guarantees that may occur, when a
      class is changed or destroyed, if the instantaneous value of the
      weight sum was used to update the system virtual time. In fact, the
      fraction of the link bandwidth guaranteed by QFQ+ to each aggregate is
      equal to the ratio between the weight of the aggregate and the sum of
      the weights of the competing aggregates. The packet delay guaranteed
      to the aggregate is instead inversely proportional to the guaranteed
      bandwidth. By using the maximum possible value, and not the actual
      value of the weight sum, QFQ+ provides each aggregate with the worst
      possible service guarantees, and not with service guarantees related
      to the actual set of competing aggregates. To see the consequences of
      this fact, consider the following simple example.
      
      Suppose that only the following aggregates are backlogged, i.e., that
      only the classes in the following aggregates have packets to transmit:
      one aggregate with weight 10, say A, and ten aggregates with weight 1,
      say B1, B2, ..., B10. In particular, suppose that these aggregates are
      always backlogged. Given the weight distribution, the smoothest and
      fairest service order would be:
      A B1 A B2 A B3 A B4 A B5 A B6 A B7 A B8 A B9 A B10 A B1 A B2 ...
      
      QFQ+ would provide exactly this optimal service if it used the actual
      value for the weight sum instead of the maximum possible value, i.e.,
      11 instead of 2^10. In contrast, since QFQ+ uses the latter value, it
      serves aggregates as follows (easy to prove and to reproduce
      experimentally):
      A B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 A A A A A A A A A A B1 B2 ... B10 A A ...
      
      By replacing 10 with N in the above example, and by increasing N, one
      can increase at will the maximum packet delay and the jitter
      experienced by the classes in aggregate A.
      
      This patch addresses this issue by just using the above
      'instantaneous' value of the weight sum, instead of the maximum
      possible value, when updating the system virtual time.  After the
      instantaneous weight sum is decreased, QFQ+ may deviate from the ideal
      service for a time interval in the order of the time to serve one
      maximum-size packet for each backlogged class. The worst-case extent
      of the deviation exhibited by QFQ+ during this time interval [1] is
      basically the same as of the deviation described above (but, without
      this patch, QFQ+ suffers from such a deviation all the time). Finally,
      this patch modifies the comment to the function qfq_slot_insert, to
      make it coherent with the fact that the weight sum used by QFQ+ can
      now be lower than the maximum possible value.
      
      [1] P. Valente, "Extending WF2Q+ to support a dynamic traffic mix",
      Proceedings of AAA-IDEA'05, June 2005.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@unimore.it>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c1d220fb
    • Haiyang Zhang's avatar
      hyperv: Fix the NETIF_F_SG flag setting in netvsc · 98bec4a1
      Haiyang Zhang authored
      [ Upstream commit f4570820 ]
      
      SG mode is not currently supported by netvsc, so remove this flag for now.
      Otherwise, it will be unconditionally enabled by commit ec5f0615
          "Kill link between CSUM and SG features"
      Previously, the SG feature is disabled because CSUM is not set here.
      Signed-off-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Reviewed-by: default avatarK. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      98bec4a1
    • Sarveshwar Bandi's avatar
      be2net: Fix to avoid hardware workaround when not needed · e0ca176c
      Sarveshwar Bandi authored
      [ Upstream commit 52fe29e4 ]
      
      Hardware workaround requesting hardware to skip vlan insertion is necessary
      only when umc or qnq is enabled. Enabling this workaround in other scenarios
      could cause controller to stall.
      Signed-off-by: default avatarSarveshwar Bandi <sarveshwar.bandi@emulex.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e0ca176c
    • Eric Dumazet's avatar
      ipv4: set transport header earlier · b3923f82
      Eric Dumazet authored
      [ Upstream commit 21d1196a ]
      
      commit 45f00f99 ("ipv4: tcp: clean up tcp_v4_early_demux()") added a
      performance regression for non GRO traffic, basically disabling
      IP early demux.
      
      IPv6 stack resets transport header in ip6_rcv() before calling
      IP early demux in ip6_rcv_finish(), while IPv4 does this only in
      ip_local_deliver_finish(), _after_ IP early demux.
      
      GRO traffic happened to enable IP early demux because transport header
      is also set in inet_gro_receive()
      
      Instead of reverting the faulty commit, we can make IPv4/IPv6 behave the
      same : transport_header should be set in ip_rcv() instead of
      ip_local_deliver_finish()
      
      ip_local_deliver_finish() can also use skb_network_header_len() which is
      faster than ip_hdrlen()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b3923f82
    • Neil Horman's avatar
      atl1e: unmap partially mapped skb on dma error and free skb · da7e35ce
      Neil Horman authored
      [ Upstream commit 584ec435 ]
      
      Ben Hutchings pointed out that my recent update to atl1e
      in commit 352900b5
      ("atl1e: fix dma mapping warnings") was missing a bit of code.
      
      Specifically it reset the hardware tx ring to its origional state when
      we hit a dma error, but didn't unmap any exiting mappings from the
      operation.  This patch fixes that up.  It also remembers to free the
      skb in the event that an error occurs, so we don't leak.  Untested, as
      I don't have hardware.  I think its pretty straightforward, but please
      review closely.
      Signed-off-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      CC: Jay Cliburn <jcliburn@gmail.com>
      CC: Chris Snook <chris.snook@gmail.com>
      CC: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      da7e35ce
    • Neil Horman's avatar
      atl1e: fix dma mapping warnings · dc419b2d
      Neil Horman authored
      [ Upstream commit 352900b5 ]
      
      Recently had this backtrace reported:
      WARNING: at lib/dma-debug.c:937 check_unmap+0x47d/0x930()
      Hardware name: System Product Name
      ATL1E 0000:02:00.0: DMA-API: device driver failed to check map error[device
      address=0x00000000cbfd1000] [size=90 bytes] [mapped as single]
      Modules linked in: xt_conntrack nf_conntrack ebtable_filter ebtables
      ip6table_filter ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek iTCO_wdt
      iTCO_vendor_support snd_hda_intel acpi_cpufreq mperf coretemp btrfs zlib_deflate
      snd_hda_codec snd_hwdep microcode raid6_pq libcrc32c snd_seq usblp serio_raw xor
      snd_seq_device joydev snd_pcm snd_page_alloc snd_timer snd lpc_ich i2c_i801
      soundcore mfd_core atl1e asus_atk0110 ata_generic pata_acpi radeon i2c_algo_bit
      drm_kms_helper ttm drm i2c_core pata_marvell uinput
      Pid: 314, comm: systemd-journal Not tainted 3.9.0-0.rc6.git2.3.fc19.x86_64 #1
      Call Trace:
       <IRQ>  [<ffffffff81069106>] warn_slowpath_common+0x66/0x80
       [<ffffffff8106916c>] warn_slowpath_fmt+0x4c/0x50
       [<ffffffff8138151d>] check_unmap+0x47d/0x930
       [<ffffffff810ad048>] ? sched_clock_cpu+0xa8/0x100
       [<ffffffff81381a2f>] debug_dma_unmap_page+0x5f/0x70
       [<ffffffff8137ce30>] ? unmap_single+0x20/0x30
       [<ffffffffa01569a1>] atl1e_intr+0x3a1/0x5b0 [atl1e]
       [<ffffffff810d53fd>] ? trace_hardirqs_off+0xd/0x10
       [<ffffffff81119636>] handle_irq_event_percpu+0x56/0x390
       [<ffffffff811199ad>] handle_irq_event+0x3d/0x60
       [<ffffffff8111cb6a>] handle_fasteoi_irq+0x5a/0x100
       [<ffffffff8101c36f>] handle_irq+0xbf/0x150
       [<ffffffff811dcb2f>] ? file_sb_list_del+0x3f/0x50
       [<ffffffff81073b10>] ? irq_enter+0x50/0xa0
       [<ffffffff8172738d>] do_IRQ+0x4d/0xc0
       [<ffffffff811dcb2f>] ? file_sb_list_del+0x3f/0x50
       [<ffffffff8171c6b2>] common_interrupt+0x72/0x72
       <EOI>  [<ffffffff810db5b2>] ? lock_release+0xc2/0x310
       [<ffffffff8109ea04>] lg_local_unlock_cpu+0x24/0x50
       [<ffffffff811dcb2f>] file_sb_list_del+0x3f/0x50
       [<ffffffff811dcb6d>] fput+0x2d/0xc0
       [<ffffffff811d8ea1>] filp_close+0x61/0x90
       [<ffffffff811fae4d>] __close_fd+0x8d/0x150
       [<ffffffff811d8ef0>] sys_close+0x20/0x50
       [<ffffffff81725699>] system_call_fastpath+0x16/0x1b
      
      The usual straighforward failure to check for dma_mapping_error after a map
      operation is completed.
      
      This patch should fix it, the reporter wandered off after filing this bz:
      https://bugzilla.redhat.com/show_bug.cgi?id=954170
      
      and I don't have hardware to test, but the fix is pretty straightforward, so I
      figured I'd post it for review.
      Signed-off-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      CC: Jay Cliburn <jcliburn@gmail.com>
      CC: Chris Snook <chris.snook@gmail.com>
      CC: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dc419b2d
    • Hannes Frederic Sowa's avatar
      ipv6: only static routes qualify for equal cost multipathing · d6245ef7
      Hannes Frederic Sowa authored
      [ Upstream commit 307f2fb9 ]
      
      Static routes in this case are non-expiring routes which did not get
      configured by autoconf or by icmpv6 redirects.
      
      To make sure we actually get an ecmp route while searching for the first
      one in this fib6_node's leafs, also make sure it matches the ecmp route
      assumptions.
      
      v2:
      a) Removed RTF_EXPIRE check in dst.from chain. The check of RTF_ADDRCONF
         already ensures that this route, even if added again without
         RTF_EXPIRES (in case of a RA announcement with infinite timeout),
         does not cause the rt6i_nsiblings logic to go wrong if a later RA
         updates the expiration time later.
      
      v3:
      a) Allow RTF_EXPIRES routes to enter the ecmp route set. We have to do so,
         because an pmtu event could update the RTF_EXPIRES flag and we would
         not count this route, if another route joins this set. We now filter
         only for RTF_GATEWAY|RTF_ADDRCONF|RTF_DYNAMIC, which are flags that
         don't get changed after rt6_info construction.
      
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d6245ef7
    • Alexander Duyck's avatar
      gre: Fix MTU sizing check for gretap tunnels · 6afbcb59
      Alexander Duyck authored
      [ Upstream commit 8c91e162 ]
      
      This change fixes an MTU sizing issue seen with gretap tunnels when non-gso
      packets are sent from the interface.
      
      In my case I was able to reproduce the issue by simply sending a ping of
      1421 bytes with the gretap interface created on a device with a standard
      1500 mtu.
      
      This fix is based on the fact that the tunnel mtu is already adjusted by
      dev->hard_header_len so it would make sense that any packets being compared
      against that mtu should also be adjusted by hard_header_len and the tunnel
      header instead of just the tunnel header.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Reported-by: default avatarCong Wang <amwang@redhat.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6afbcb59
    • dingtianhong's avatar
      ifb: fix oops when loading the ifb failed · f84ddbc5
      dingtianhong authored
      [ Upstream commit f2966cd5 ]
      
      If __rtnl_link_register() return faild when loading the ifb, it will
      take the wrong path and get oops, so fix it just like dummy.
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f84ddbc5
    • dingtianhong's avatar
      dummy: fix oops when loading the dummy failed · 629b1d3d
      dingtianhong authored
      [ Upstream commit 2c8a0189 ]
      
      We rename the dummy in modprobe.conf like this:
      
      install dummy0 /sbin/modprobe -o dummy0 --ignore-install dummy
      install dummy1 /sbin/modprobe -o dummy1 --ignore-install dummy
      
      We got oops when we run the command:
      
      modprobe dummy0
      modprobe dummy1
      
      ------------[ cut here ]------------
      
      [ 3302.187584] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      [ 3302.195411] IP: [<ffffffff813fe62a>] __rtnl_link_unregister+0x9a/0xd0
      [ 3302.201844] PGD 85c94a067 PUD 8517bd067 PMD 0
      [ 3302.206305] Oops: 0002 [#1] SMP
      [ 3302.299737] task: ffff88105ccea300 ti: ffff880eba4a0000 task.ti: ffff880eba4a0000
      [ 3302.307186] RIP: 0010:[<ffffffff813fe62a>]  [<ffffffff813fe62a>] __rtnl_link_unregister+0x9a/0xd0
      [ 3302.316044] RSP: 0018:ffff880eba4a1dd8  EFLAGS: 00010246
      [ 3302.321332] RAX: 0000000000000000 RBX: ffffffff81a9d738 RCX: 0000000000000002
      [ 3302.328436] RDX: 0000000000000000 RSI: ffffffffa04d602c RDI: ffff880eba4a1dd8
      [ 3302.335541] RBP: ffff880eba4a1e18 R08: dead000000200200 R09: dead000000100100
      [ 3302.342644] R10: 0000000000000080 R11: 0000000000000003 R12: ffffffff81a9d788
      [ 3302.349748] R13: ffffffffa04d7020 R14: ffffffff81a9d670 R15: ffff880eba4a1dd8
      [ 3302.364910] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 3302.370630] CR2: 0000000000000008 CR3: 000000085e15e000 CR4: 00000000000427e0
      [ 3302.377734] DR0: 0000000000000003 DR1: 00000000000000b0 DR2: 0000000000000001
      [ 3302.384838] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [ 3302.391940] Stack:
      [ 3302.393944]  ffff880eba4a1dd8 ffff880eba4a1dd8 ffff880eba4a1e18 ffffffffa04d70c0
      [ 3302.401350]  00000000ffffffef ffffffffa01a8000 0000000000000000 ffffffff816111c8
      [ 3302.408758]  ffff880eba4a1e48 ffffffffa01a80be ffff880eba4a1e48 ffffffffa04d70c0
      [ 3302.416164] Call Trace:
      [ 3302.418605]  [<ffffffffa01a8000>] ? 0xffffffffa01a7fff
      [ 3302.423727]  [<ffffffffa01a80be>] dummy_init_module+0xbe/0x1000 [dummy0]
      [ 3302.430405]  [<ffffffffa01a8000>] ? 0xffffffffa01a7fff
      [ 3302.435535]  [<ffffffff81000322>] do_one_initcall+0x152/0x1b0
      [ 3302.441263]  [<ffffffff810ab24b>] do_init_module+0x7b/0x200
      [ 3302.446824]  [<ffffffff810ad3d2>] load_module+0x4e2/0x530
      [ 3302.452215]  [<ffffffff8127ae40>] ? ddebug_dyndbg_boot_param_cb+0x60/0x60
      [ 3302.458979]  [<ffffffff810ad5f1>] SyS_init_module+0xd1/0x130
      [ 3302.464627]  [<ffffffff814b9652>] system_call_fastpath+0x16/0x1b
      [ 3302.490090] RIP  [<ffffffff813fe62a>] __rtnl_link_unregister+0x9a/0xd0
      [ 3302.496607]  RSP <ffff880eba4a1dd8>
      [ 3302.500084] CR2: 0000000000000008
      [ 3302.503466] ---[ end trace 8342d49cd49f78ed ]---
      
      The reason is that when loading dummy, if __rtnl_link_register() return failed,
      the init_module should return and avoid take the wrong path.
      Signed-off-by: default avatarTan Xiaojun <tanxiaojun@huawei.com>
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      629b1d3d
    • Hannes Frederic Sowa's avatar
      ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF · b4f1489e
      Hannes Frederic Sowa authored
      [ Upstream commit afc154e9 ]
      
      This is a follow-up patch to 3630d400
      ("ipv6: rt6_check_neigh should successfully verify neigh if no NUD
      information are available").
      
      Since the removal of rt->n in rt6_info we can end up with a dst ==
      NULL in rt6_check_neigh. In case the kernel is not compiled with
      CONFIG_IPV6_ROUTER_PREF we should also select a route with unkown
      NUD state but we must not avoid doing round robin selection on routes
      with the same target. So introduce and pass down a boolean ``do_rr'' to
      indicate when we should update rt->rr_ptr. As soon as no route is valid
      we do backtracking and do a lookup on a higher level in the fib trie.
      
      v2:
      a) Improved rt6_check_neigh logic (no need to create neighbour there)
         and documented return values.
      
      v3:
      a) Introduce enum rt6_nud_state to get rid of the magic numbers
         (thanks to David Miller).
      b) Update and shorten commit message a bit to actualy reflect
         the source.
      Reported-by: default avatarPierre Emeriaud <petrus.lt@gmail.com>
      Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b4f1489e
    • Maarten Lankhorst's avatar
      alx: fix lockdep annotation · 61b6f128
      Maarten Lankhorst authored
      [ Upstream commit a8798a5c ]
      
      Move spin_lock_init to be called before the spinlocks are used, preventing a lockdep splat.
      Signed-off-by: default avatarMaarten Lankhorst <maarten.lankhorst@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      61b6f128
    • Sasha Levin's avatar
      9p: fix off by one causing access violations and memory corruption · 53effe16
      Sasha Levin authored
      [ Upstream commit 110ecd69 ]
      
      p9_release_pages() would attempt to dereference one value past the end of
      pages[]. This would cause the following crashes:
      
      [ 6293.171817] BUG: unable to handle kernel paging request at ffff8807c96f3000
      [ 6293.174146] IP: [<ffffffff8412793b>] p9_release_pages+0x3b/0x60
      [ 6293.176447] PGD 79c5067 PUD 82c1e3067 PMD 82c197067 PTE 80000007c96f3060
      [ 6293.180060] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [ 6293.180060] Modules linked in:
      [ 6293.180060] CPU: 62 PID: 174043 Comm: modprobe Tainted: G        W    3.10.0-next-20130710-sasha #3954
      [ 6293.180060] task: ffff8807b803b000 ti: ffff880787dde000 task.ti: ffff880787dde000
      [ 6293.180060] RIP: 0010:[<ffffffff8412793b>]  [<ffffffff8412793b>] p9_release_pages+0x3b/0x60
      [ 6293.214316] RSP: 0000:ffff880787ddfc28  EFLAGS: 00010202
      [ 6293.214316] RAX: 0000000000000001 RBX: ffff8807c96f2ff8 RCX: 0000000000000000
      [ 6293.222017] RDX: ffff8807b803b000 RSI: 0000000000000001 RDI: ffffea001c7e3d40
      [ 6293.222017] RBP: ffff880787ddfc48 R08: 0000000000000000 R09: 0000000000000000
      [ 6293.222017] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000001
      [ 6293.222017] R13: 0000000000000001 R14: ffff8807cc50c070 R15: ffff8807cc50c070
      [ 6293.222017] FS:  00007f572641d700(0000) GS:ffff8807f3600000(0000) knlGS:0000000000000000
      [ 6293.256784] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 6293.256784] CR2: ffff8807c96f3000 CR3: 00000007c8e81000 CR4: 00000000000006e0
      [ 6293.256784] Stack:
      [ 6293.256784]  ffff880787ddfcc8 ffff880787ddfcc8 0000000000000000 ffff880787ddfcc8
      [ 6293.256784]  ffff880787ddfd48 ffffffff84128be8 ffff880700000002 0000000000000001
      [ 6293.256784]  ffff8807b803b000 ffff880787ddfce0 0000100000000000 0000000000000000
      [ 6293.256784] Call Trace:
      [ 6293.256784]  [<ffffffff84128be8>] p9_virtio_zc_request+0x598/0x630
      [ 6293.256784]  [<ffffffff8115c610>] ? wake_up_bit+0x40/0x40
      [ 6293.256784]  [<ffffffff841209b1>] p9_client_zc_rpc+0x111/0x3a0
      [ 6293.256784]  [<ffffffff81174b78>] ? sched_clock_cpu+0x108/0x120
      [ 6293.256784]  [<ffffffff84122a21>] p9_client_read+0xe1/0x2c0
      [ 6293.256784]  [<ffffffff81708a90>] v9fs_file_read+0x90/0xc0
      [ 6293.256784]  [<ffffffff812bd073>] vfs_read+0xc3/0x130
      [ 6293.256784]  [<ffffffff811a78bd>] ? trace_hardirqs_on+0xd/0x10
      [ 6293.256784]  [<ffffffff812bd5a2>] SyS_read+0x62/0xa0
      [ 6293.256784]  [<ffffffff841a1a00>] tracesys+0xdd/0xe2
      [ 6293.256784] Code: 66 90 48 89 fb 41 89 f5 48 8b 3f 48 85 ff 74 29 85 f6 74 25 45 31 e4 66 0f 1f 84 00 00 00 00 00 e8 eb 14 12 fd 41 ff c4 49 63 c4 <48> 8b 3c c3 48 85 ff 74 05 45 39 e5 75 e7 48 83 c4 08 5b 41 5c
      [ 6293.256784] RIP  [<ffffffff8412793b>] p9_release_pages+0x3b/0x60
      [ 6293.256784]  RSP <ffff880787ddfc28>
      [ 6293.256784] CR2: ffff8807c96f3000
      [ 6293.256784] ---[ end trace 50822ee72cd360fc ]---
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      53effe16
    • Hannes Frederic Sowa's avatar
      ipv6: in case of link failure remove route directly instead of letting it expire · a025e28a
      Hannes Frederic Sowa authored
      [ Upstream commit 1eb4f758 ]
      
      We could end up expiring a route which is part of an ecmp route set. Doing
      so would invalidate the rt->rt6i_nsiblings calculations and could provoke
      the following panic:
      
      [   80.144667] ------------[ cut here ]------------
      [   80.145172] kernel BUG at net/ipv6/ip6_fib.c:733!
      [   80.145172] invalid opcode: 0000 [#1] SMP
      [   80.145172] Modules linked in: 8021q nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables
      +snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_page_alloc snd_timer virtio_balloon snd soundcore i2c_piix4 i2c_core virtio_net virtio_blk
      [   80.145172] CPU: 1 PID: 786 Comm: ping6 Not tainted 3.10.0+ #118
      [   80.145172] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [   80.145172] task: ffff880117fa0000 ti: ffff880118770000 task.ti: ffff880118770000
      [   80.145172] RIP: 0010:[<ffffffff815f3b5d>]  [<ffffffff815f3b5d>] fib6_add+0x75d/0x830
      [   80.145172] RSP: 0018:ffff880118771798  EFLAGS: 00010202
      [   80.145172] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88011350e480
      [   80.145172] RDX: ffff88011350e238 RSI: 0000000000000004 RDI: ffff88011350f738
      [   80.145172] RBP: ffff880118771848 R08: ffff880117903280 R09: 0000000000000001
      [   80.145172] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88011350f680
      [   80.145172] R13: ffff880117903280 R14: ffff880118771890 R15: ffff88011350ef90
      [   80.145172] FS:  00007f02b5127740(0000) GS:ffff88011fd00000(0000) knlGS:0000000000000000
      [   80.145172] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [   80.145172] CR2: 00007f981322a000 CR3: 00000001181b1000 CR4: 00000000000006e0
      [   80.145172] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   80.145172] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [   80.145172] Stack:
      [   80.145172]  0000000000000001 ffff880100000000 ffff880100000000 ffff880117903280
      [   80.145172]  0000000000000000 ffff880119a4cf00 0000000000000400 00000000000007fa
      [   80.145172]  0000000000000000 0000000000000000 0000000000000000 ffff88011350f680
      [   80.145172] Call Trace:
      [   80.145172]  [<ffffffff815eeceb>] ? rt6_bind_peer+0x4b/0x90
      [   80.145172]  [<ffffffff815ed985>] __ip6_ins_rt+0x45/0x70
      [   80.145172]  [<ffffffff815eee35>] ip6_ins_rt+0x35/0x40
      [   80.145172]  [<ffffffff815ef1e4>] ip6_pol_route.isra.44+0x3a4/0x4b0
      [   80.145172]  [<ffffffff815ef34a>] ip6_pol_route_output+0x2a/0x30
      [   80.145172]  [<ffffffff81616077>] fib6_rule_action+0xd7/0x210
      [   80.145172]  [<ffffffff815ef320>] ? ip6_pol_route_input+0x30/0x30
      [   80.145172]  [<ffffffff81553026>] fib_rules_lookup+0xc6/0x140
      [   80.145172]  [<ffffffff81616374>] fib6_rule_lookup+0x44/0x80
      [   80.145172]  [<ffffffff815ef320>] ? ip6_pol_route_input+0x30/0x30
      [   80.145172]  [<ffffffff815edea3>] ip6_route_output+0x73/0xb0
      [   80.145172]  [<ffffffff815dfdf3>] ip6_dst_lookup_tail+0x2c3/0x2e0
      [   80.145172]  [<ffffffff813007b1>] ? list_del+0x11/0x40
      [   80.145172]  [<ffffffff81082a4c>] ? remove_wait_queue+0x3c/0x50
      [   80.145172]  [<ffffffff815dfe4d>] ip6_dst_lookup_flow+0x3d/0xa0
      [   80.145172]  [<ffffffff815fda77>] rawv6_sendmsg+0x267/0xc20
      [   80.145172]  [<ffffffff815a8a83>] inet_sendmsg+0x63/0xb0
      [   80.145172]  [<ffffffff8128eb93>] ? selinux_socket_sendmsg+0x23/0x30
      [   80.145172]  [<ffffffff815218d6>] sock_sendmsg+0xa6/0xd0
      [   80.145172]  [<ffffffff81524a68>] SYSC_sendto+0x128/0x180
      [   80.145172]  [<ffffffff8109825c>] ? update_curr+0xec/0x170
      [   80.145172]  [<ffffffff81041d09>] ? kvm_clock_get_cycles+0x9/0x10
      [   80.145172]  [<ffffffff810afd1e>] ? __getnstimeofday+0x3e/0xd0
      [   80.145172]  [<ffffffff8152509e>] SyS_sendto+0xe/0x10
      [   80.145172]  [<ffffffff8164efd9>] system_call_fastpath+0x16/0x1b
      [   80.145172] Code: fe ff ff 41 f6 45 2a 06 0f 85 ca fe ff ff 49 8b 7e 08 4c 89 ee e8 94 ef ff ff e9 b9 fe ff ff 48 8b 82 28 05 00 00 e9 01 ff ff ff <0f> 0b 49 8b 54 24 30 0d 00 00 40 00 89 83 14 01 00 00 48 89 53
      [   80.145172] RIP  [<ffffffff815f3b5d>] fib6_add+0x75d/0x830
      [   80.145172]  RSP <ffff880118771798>
      [   80.387413] ---[ end trace 02f20b7a8b81ed95 ]---
      [   80.390154] Kernel panic - not syncing: Fatal exception in interrupt
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a025e28a
    • Jason Wang's avatar
      macvtap: correctly linearize skb when zerocopy is used · bd31fdd2
      Jason Wang authored
      [ Upstream commit 61d46bf9 ]
      
      Userspace may produce vectors greater than MAX_SKB_FRAGS. When we try to
      linearize parts of the skb to let the rest of iov to be fit in
      the frags, we need count copylen into linear when calling macvtap_alloc_skb()
      instead of partly counting it into data_len. Since this breaks
      zerocopy_sg_from_iovec() since its inner counter assumes nr_frags should
      be zero at beginning. This cause nr_frags to be increased wrongly without
      setting the correct frags.
      
      This bug were introduced from b92946e2
      (macvtap: zerocopy: validate vectors before building skb).
      
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bd31fdd2
    • Jason Wang's avatar
      tuntap: correctly linearize skb when zerocopy is used · d09ec76a
      Jason Wang authored
      [ Upstream commit 3dd5c330 ]
      
      Userspace may produce vectors greater than MAX_SKB_FRAGS. When we try to
      linearize parts of the skb to let the rest of iov to be fit in
      the frags, we need count copylen into linear when calling tun_alloc_skb()
      instead of partly counting it into data_len. Since this breaks
      zerocopy_sg_from_iovec() since its inner counter assumes nr_frags should
      be zero at beginning. This cause nr_frags to be increased wrongly without
      setting the correct frags.
      
      This bug were introduced from 0690899b
      (tun: experimental zero copy tx support)
      
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d09ec76a
    • dingtianhong's avatar
      ifb: fix rcu_sched self-detected stalls · 6683151a
      dingtianhong authored
      [ Upstream commit 440d57bc ]
      
      According to the commit 16b0dc29
      (dummy: fix rcu_sched self-detected stalls)
      
      Eric Dumazet fix the problem in dummy, but the ifb will occur the
      same problem like the dummy modules.
      
      Trying to "modprobe ifb numifbs=30000" triggers :
      
      INFO: rcu_sched self-detected stall on CPU
      
      After this splat, RTNL is locked and reboot is needed.
      
      We must call cond_resched() to avoid this, even holding RTNL.
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6683151a
    • Dave Kleikamp's avatar
      sunvnet: vnet_port_remove must call unregister_netdev · c51a7a30
      Dave Kleikamp authored
      [ Upstream commit aabb9875 ]
      
      The missing call to unregister_netdev() leaves the interface active
      after the driver is unloaded by rmmod.
      Signed-off-by: default avatarDave Kleikamp <dave.kleikamp@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c51a7a30
    • Michael S. Tsirkin's avatar
      vhost-net: fix use-after-free in vhost_net_flush · f5ce1d25
      Michael S. Tsirkin authored
      [ Upstream commit c38e39c3 ]
      
      vhost_net_ubuf_put_and_wait has a confusing name:
      it will actually also free it's argument.
      Thus since commit 1280c27f
          "vhost-net: flush outstanding DMAs on memory change"
      vhost_net_flush tries to use the argument after passing it
      to vhost_net_ubuf_put_and_wait, this results
      in use after free.
      To fix, don't free the argument in vhost_net_ubuf_put_and_wait,
      add an new API for callers that want to free ubufs.
      Acked-by: default avatarAsias He <asias@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f5ce1d25
    • Michael S. Tsirkin's avatar
      virtio_net: fix race in RX VQ processing · 2b0e8a4f
      Michael S. Tsirkin authored
      [ Upstream commit cbdadbbf ]
      
      virtio net called virtqueue_enable_cq on RX path after napi_complete, so
      with NAPI_STATE_SCHED clear - outside the implicit napi lock.
      This violates the requirement to synchronize virtqueue_enable_cq wrt
      virtqueue_add_buf.  In particular, used event can move backwards,
      causing us to lose interrupts.
      In a debug build, this can trigger panic within START_USE.
      
      Jason Wang reports that he can trigger the races artificially,
      by adding udelay() in virtqueue_enable_cb() after virtio_mb().
      
      However, we must call napi_complete to clear NAPI_STATE_SCHED before
      polling the virtqueue for used buffers, otherwise napi_schedule_prep in
      a callback will fail, causing us to lose RX events.
      
      To fix, call virtqueue_enable_cb_prepare with NAPI_STATE_SCHED
      set (under napi lock), later call virtqueue_poll with
      NAPI_STATE_SCHED clear (outside the lock).
      Reported-by: default avatarJason Wang <jasowang@redhat.com>
      Tested-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2b0e8a4f
    • Michael S. Tsirkin's avatar
      virtio: support unlocked queue poll · c23b1ece
      Michael S. Tsirkin authored
      [ Upstream commit cc229884 ]
      
      This adds a way to check ring empty state after enable_cb outside any
      locks. Will be used by virtio_net.
      
      Note: there's room for more optimization: caller is likely to have a
      memory barrier already, which means we might be able to get rid of a
      barrier here.  Deferring this optimization until we do some
      benchmarking.
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c23b1ece
    • Jongsung Kim's avatar
    • Ben Hutchings's avatar
      sfc: Fix memory leak when discarding scattered packets · 1c6d3d1d
      Ben Hutchings authored
      [ Upstream commit 734d4e15 ]
      
      Commit 2768935a ('sfc: reuse pages to avoid DMA mapping/unmapping
      costs') did not fully take account of DMA scattering which was
      introduced immediately before.  If a received packet is invalid and
      must be discarded, we only drop a reference to the first buffer's
      page, but we need to drop a reference for each buffer the packet
      used.
      
      I think this bug was missed partly because efx_recycle_rx_buffers()
      was not renamed and so no longer does what its name says.  It does not
      change the state of buffers, but only prepares the underlying pages
      for recycling.  Rename it accordingly.
      Signed-off-by: default avatarBen Hutchings <bhutchings@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1c6d3d1d
    • Hannes Frederic Sowa's avatar
      ipv6: rt6_check_neigh should successfully verify neigh if no NUD information are available · c3a54912
      Hannes Frederic Sowa authored
      [ Upstream commit 3630d400 ]
      
      After the removal of rt->n we do not create a neighbour entry at route
      insertion time (rt6_bind_neighbour is gone). As long as no neighbour is
      created because of "useful traffic" we skip this routing entry because
      rt6_check_neigh cannot pick up a valid neighbour (neigh == NULL) and
      thus returns false.
      
      This change was introduced by commit
      887c95cc ("ipv6: Complete neighbour
      entry removal from dst_entry.")
      
      To quote RFC4191:
      "If the host has no information about the router's reachability, then
      the host assumes the router is reachable."
      
      and also:
      "A host MUST NOT probe a router's reachability in the absence of useful
      traffic that the host would have sent to the router if it were reachable."
      
      So, just assume the router is reachable and let's rt6_probe do the
      rest. We don't need to create a neighbour on route insertion time.
      
      If we don't compile with CONFIG_IPV6_ROUTER_PREF (RFC4191 support)
      a neighbour is only valid if its nud_state is NUD_VALID. I did not find
      any references that we should probe the router on route insertion time
      via the other RFCs. So skip this route in that case.
      
      v2:
      a) use IS_ENABLED instead of #ifdefs (thanks to Sergei Shtylyov)
      Reported-by: default avatarPierre Emeriaud <petrus.lt@gmail.com>
      Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c3a54912
    • Hannes Frederic Sowa's avatar
      ipv6: ip6_append_data_mtu did not care about pmtudisc and frag_size · 7852c5bf
      Hannes Frederic Sowa authored
      [ Upstream commit 75a493e6 ]
      
      If the socket had an IPV6_MTU value set, ip6_append_data_mtu lost track
      of this when appending the second frame on a corked socket. This results
      in the following splat:
      
      [37598.993962] ------------[ cut here ]------------
      [37598.994008] kernel BUG at net/core/skbuff.c:2064!
      [37598.994008] invalid opcode: 0000 [#1] SMP
      [37598.994008] Modules linked in: tcp_lp uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev media vfat fat usb_storage fuse ebtable_nat xt_CHECKSUM bridge stp llc ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat
      +nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi
      +scsi_transport_iscsi rfcomm bnep iTCO_wdt iTCO_vendor_support snd_hda_codec_conexant arc4 iwldvm mac80211 snd_hda_intel acpi_cpufreq mperf coretemp snd_hda_codec microcode cdc_wdm cdc_acm
      [37598.994008]  snd_hwdep cdc_ether snd_seq snd_seq_device usbnet mii joydev btusb snd_pcm bluetooth i2c_i801 e1000e lpc_ich mfd_core ptp iwlwifi pps_core snd_page_alloc mei cfg80211 snd_timer thinkpad_acpi snd tpm_tis soundcore rfkill tpm tpm_bios vhost_net tun macvtap macvlan kvm_intel kvm uinput binfmt_misc
      +dm_crypt i915 i2c_algo_bit drm_kms_helper drm i2c_core wmi video
      [37598.994008] CPU 0
      [37598.994008] Pid: 27320, comm: t2 Not tainted 3.9.6-200.fc18.x86_64 #1 LENOVO 27744PG/27744PG
      [37598.994008] RIP: 0010:[<ffffffff815443a5>]  [<ffffffff815443a5>] skb_copy_and_csum_bits+0x325/0x330
      [37598.994008] RSP: 0018:ffff88003670da18  EFLAGS: 00010202
      [37598.994008] RAX: ffff88018105c018 RBX: 0000000000000004 RCX: 00000000000006c0
      [37598.994008] RDX: ffff88018105a6c0 RSI: ffff88018105a000 RDI: ffff8801e1b0aa00
      [37598.994008] RBP: ffff88003670da78 R08: 0000000000000000 R09: ffff88018105c040
      [37598.994008] R10: ffff8801e1b0aa00 R11: 0000000000000000 R12: 000000000000fff8
      [37598.994008] R13: 00000000000004fc R14: 00000000ffff0504 R15: 0000000000000000
      [37598.994008] FS:  00007f28eea59740(0000) GS:ffff88023bc00000(0000) knlGS:0000000000000000
      [37598.994008] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [37598.994008] CR2: 0000003d935789e0 CR3: 00000000365cb000 CR4: 00000000000407f0
      [37598.994008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [37598.994008] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [37598.994008] Process t2 (pid: 27320, threadinfo ffff88003670c000, task ffff88022c162ee0)
      [37598.994008] Stack:
      [37598.994008]  ffff88022e098a00 ffff88020f973fc0 0000000000000008 00000000000004c8
      [37598.994008]  ffff88020f973fc0 00000000000004c4 ffff88003670da78 ffff8801e1b0a200
      [37598.994008]  0000000000000018 00000000000004c8 ffff88020f973fc0 00000000000004c4
      [37598.994008] Call Trace:
      [37598.994008]  [<ffffffff815fc21f>] ip6_append_data+0xccf/0xfe0
      [37598.994008]  [<ffffffff8158d9f0>] ? ip_copy_metadata+0x1a0/0x1a0
      [37598.994008]  [<ffffffff81661f66>] ? _raw_spin_lock_bh+0x16/0x40
      [37598.994008]  [<ffffffff8161548d>] udpv6_sendmsg+0x1ed/0xc10
      [37598.994008]  [<ffffffff812a2845>] ? sock_has_perm+0x75/0x90
      [37598.994008]  [<ffffffff815c3693>] inet_sendmsg+0x63/0xb0
      [37598.994008]  [<ffffffff812a2973>] ? selinux_socket_sendmsg+0x23/0x30
      [37598.994008]  [<ffffffff8153a450>] sock_sendmsg+0xb0/0xe0
      [37598.994008]  [<ffffffff810135d1>] ? __switch_to+0x181/0x4a0
      [37598.994008]  [<ffffffff8153d97d>] sys_sendto+0x12d/0x180
      [37598.994008]  [<ffffffff810dfb64>] ? __audit_syscall_entry+0x94/0xf0
      [37598.994008]  [<ffffffff81020ed1>] ? syscall_trace_enter+0x231/0x240
      [37598.994008]  [<ffffffff8166a7e7>] tracesys+0xdd/0xe2
      [37598.994008] Code: fe 07 00 00 48 c7 c7 04 28 a6 81 89 45 a0 4c 89 4d b8 44 89 5d a8 e8 1b ac b1 ff 44 8b 5d a8 4c 8b 4d b8 8b 45 a0 e9 cf fe ff ff <0f> 0b 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 48
      [37598.994008] RIP  [<ffffffff815443a5>] skb_copy_and_csum_bits+0x325/0x330
      [37598.994008]  RSP <ffff88003670da18>
      [37599.007323] ---[ end trace d69f6a17f8ac8eee ]---
      
      While there, also check if path mtu discovery is activated for this
      socket. The logic was adapted from ip6_append_data when first writing
      on the corked socket.
      
      This bug was introduced with commit
      0c183379 ("ipv6: fix incorrect ipsec
      fragment").
      
      v2:
      a) Replace IPV6_PMTU_DISC_DO with IPV6_PMTUDISC_PROBE.
      b) Don't pass ipv6_pinfo to ip6_append_data_mtu (suggestion by Gao
         feng, thanks!).
      c) Change mtu to unsigned int, else we get a warning about
         non-matching types because of the min()-macro type-check.
      Acked-by: default avatarGao feng <gaofeng@cn.fujitsu.com>
      Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7852c5bf
    • Hannes Frederic Sowa's avatar
      ipv6: call udp_push_pending_frames when uncorking a socket with AF_INET pending data · 07243c3d
      Hannes Frederic Sowa authored
      [ Upstream commit 8822b64a ]
      
      We accidentally call down to ip6_push_pending_frames when uncorking
      pending AF_INET data on a ipv6 socket. This results in the following
      splat (from Dave Jones):
      
      skbuff: skb_under_panic: text:ffffffff816765f6 len:48 put:40 head:ffff88013deb6df0 data:ffff88013deb6dec tail:0x2c end:0xc0 dev:<NULL>
      ------------[ cut here ]------------
      kernel BUG at net/core/skbuff.c:126!
      invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      Modules linked in: dccp_ipv4 dccp 8021q garp bridge stp dlci mpoa snd_seq_dummy sctp fuse hidp tun bnep nfnetlink scsi_transport_iscsi rfcomm can_raw can_bcm af_802154 appletalk caif_socket can caif ipt_ULOG x25 rose af_key pppoe pppox ipx phonet irda llc2 ppp_generic slhc p8023 psnap p8022 llc crc_ccitt atm bluetooth
      +netrom ax25 nfc rfkill rds af_rxrpc coretemp hwmon kvm_intel kvm crc32c_intel snd_hda_codec_realtek ghash_clmulni_intel microcode pcspkr snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep usb_debug snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc snd_timer ptp snd pps_core soundcore xfs libcrc32c
      CPU: 2 PID: 8095 Comm: trinity-child2 Not tainted 3.10.0-rc7+ #37
      task: ffff8801f52c2520 ti: ffff8801e6430000 task.ti: ffff8801e6430000
      RIP: 0010:[<ffffffff816e759c>]  [<ffffffff816e759c>] skb_panic+0x63/0x65
      RSP: 0018:ffff8801e6431de8  EFLAGS: 00010282
      RAX: 0000000000000086 RBX: ffff8802353d3cc0 RCX: 0000000000000006
      RDX: 0000000000003b90 RSI: ffff8801f52c2ca0 RDI: ffff8801f52c2520
      RBP: ffff8801e6431e08 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000001 R12: ffff88022ea0c800
      R13: ffff88022ea0cdf8 R14: ffff8802353ecb40 R15: ffffffff81cc7800
      FS:  00007f5720a10740(0000) GS:ffff880244c00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000005862000 CR3: 000000022843c000 CR4: 00000000001407e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
      Stack:
       ffff88013deb6dec 000000000000002c 00000000000000c0 ffffffff81a3f6e4
       ffff8801e6431e18 ffffffff8159a9aa ffff8801e6431e90 ffffffff816765f6
       ffffffff810b756b 0000000700000002 ffff8801e6431e40 0000fea9292aa8c0
      Call Trace:
       [<ffffffff8159a9aa>] skb_push+0x3a/0x40
       [<ffffffff816765f6>] ip6_push_pending_frames+0x1f6/0x4d0
       [<ffffffff810b756b>] ? mark_held_locks+0xbb/0x140
       [<ffffffff81694919>] udp_v6_push_pending_frames+0x2b9/0x3d0
       [<ffffffff81694660>] ? udplite_getfrag+0x20/0x20
       [<ffffffff8162092a>] udp_lib_setsockopt+0x1aa/0x1f0
       [<ffffffff811cc5e7>] ? fget_light+0x387/0x4f0
       [<ffffffff816958a4>] udpv6_setsockopt+0x34/0x40
       [<ffffffff815949f4>] sock_common_setsockopt+0x14/0x20
       [<ffffffff81593c31>] SyS_setsockopt+0x71/0xd0
       [<ffffffff816f5d54>] tracesys+0xdd/0xe2
      Code: 00 00 48 89 44 24 10 8b 87 d8 00 00 00 48 89 44 24 08 48 8b 87 e8 00 00 00 48 c7 c7 c0 04 aa 81 48 89 04 24 31 c0 e8 e1 7e ff ff <0f> 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55
      RIP  [<ffffffff816e759c>] skb_panic+0x63/0x65
       RSP <ffff8801e6431de8>
      
      This patch adds a check if the pending data is of address family AF_INET
      and directly calls udp_push_ending_frames from udp_v6_push_pending_frames
      if that is the case.
      
      This bug was found by Dave Jones with trinity.
      
      (Also move the initialization of fl6 below the AF_INET check, even if
      not strictly necessary.)
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      07243c3d
    • Cong Wang's avatar
      ipip: fix a regression in ioctl · 0e7eadef
      Cong Wang authored
      [ Upstream commit 3b7b514f ]
      
      This is a regression introduced by
      commit fd58156e (IPIP: Use ip-tunneling code.)
      
      Similar to GRE tunnel, previously we only check the parameters
      for SIOCADDTUNNEL and SIOCCHGTUNNEL, after that commit, the
      check is moved for all commands.
      
      So, just check for SIOCADDTUNNEL and SIOCCHGTUNNEL.
      
      Also, the check for i_key, o_key etc. is suspicious too,
      which did not exist before, reset them before passing
      to ip_tunnel_ioctl().
      Signed-off-by: default avatarCong Wang <amwang@redhat.com>
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0e7eadef
    • Wei Yongjun's avatar
      l2tp: add missing .owner to struct pppox_proto · ed7f614a
      Wei Yongjun authored
      [ Upstream commit e1558a93 ]
      
      Add missing .owner of struct pppox_proto. This prevents the
      module from being removed from underneath its users.
      Signed-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ed7f614a