1. 14 Aug, 2014 40 commits
    • Greg Kroah-Hartman's avatar
      Linux 3.14.17 · 946de0e6
      Greg Kroah-Hartman authored
      946de0e6
    • Dave Chinner's avatar
      xfs: log vector rounding leaks log space · 23647e98
      Dave Chinner authored
      commit 110dc24a upstream.
      
      The addition of direct formatting of log items into the CIL
      linear buffer added alignment restrictions that the start of each
      vector needed to be 64 bit aligned. Hence padding was added in
      xlog_finish_iovec() to round up the vector length to ensure the next
      vector started with the correct alignment.
      
      This adds a small number of bytes to the size of
      the linear buffer that is otherwise unused. The issue is that we
      then use the linear buffer size to determine the log space used by
      the log item, and this includes the unused space. Hence when we
      account for space used by the log item, it's more than is actually
      written into the iclogs, and hence we slowly leak this space.
      
      This results on log hangs when reserving space, with threads getting
      stuck with these stack traces:
      
      Call Trace:
      [<ffffffff81d15989>] schedule+0x29/0x70
      [<ffffffff8150d3a2>] xlog_grant_head_wait+0xa2/0x1a0
      [<ffffffff8150d55d>] xlog_grant_head_check+0xbd/0x140
      [<ffffffff8150ee33>] xfs_log_reserve+0x103/0x220
      [<ffffffff814b7f05>] xfs_trans_reserve+0x2f5/0x310
      .....
      
      The 4 bytes is significant. Brain Foster did all the hard work in
      tracking down a reproducable leak to inode chunk allocation (it went
      away with the ikeep mount option). His rough numbers were that
      creating 50,000 inodes leaked 11 log blocks. This turns out to be
      roughly 800 inode chunks or 1600 inode cluster buffers. That
      works out at roughly 4 bytes per cluster buffer logged, and at that
      I started looking for a 4 byte leak in the buffer logging code.
      
      What I found was that a struct xfs_buf_log_format structure for an
      inode cluster buffer is 28 bytes in length. This gets rounded up to
      32 bytes, but the vector length remains 28 bytes. Hence the CIL
      ticket reservation is decremented by 32 bytes (via lv->lv_buf_len)
      for that vector rather than 28 bytes which are written into the log.
      
      The fix for this problem is to separately track the bytes used by
      the log vectors in the item and use that instead of the buffer
      length when accounting for the log space that will be used by the
      formatted log item.
      
      Again, thanks to Brian Foster for doing all the hard work and long
      hours to isolate this leak and make finding the bug relatively
      simple.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Cc: Bill <billstuff2001@sbcglobal.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      23647e98
    • Andrey Utkin's avatar
      arch/sparc/math-emu/math_32.c: drop stray break operator · a7129df1
      Andrey Utkin authored
      [ Upstream commit 093758e3 ]
      
      This commit is a guesswork, but it seems to make sense to drop this
      break, as otherwise the following line is never executed and becomes
      dead code. And that following line actually saves the result of
      local calculation by the pointer given in function argument. So the
      proposed change makes sense if this code in the whole makes sense (but I
      am unable to analyze it in the whole).
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81641Reported-by: default avatarDavid Binderman <dcb314@hotmail.com>
      Signed-off-by: default avatarAndrey Utkin <andrey.krieger.utkin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a7129df1
    • Sowmini Varadhan's avatar
      sparc64: ldc_connect() should not return EINVAL when handshake is in progress. · bb4c9762
      Sowmini Varadhan authored
      [ Upstream commit 4ec1b010 ]
      
      The LDC handshake could have been asynchronously triggered
      after ldc_bind() enables the ldc_rx() receive interrupt-handler
      (and thus intercepts incoming control packets)
      and before vio_port_up() calls ldc_connect(). If that is the case,
      ldc_connect() should return 0 and let the state-machine
      progress.
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: default avatarKarl Volz <karl.volz@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bb4c9762
    • Christopher Alexander Tobias Schulze's avatar
      sunsab: Fix detection of BREAK on sunsab serial console · f8cfa5d8
      Christopher Alexander Tobias Schulze authored
      [ Upstream commit fe418231 ]
      
      Fix detection of BREAK on sunsab serial console: BREAK detection was only
      performed when there were also serial characters received simultaneously.
      To handle all BREAKs correctly, the check for BREAK and the corresponding
      call to uart_handle_break() must also be done if count == 0, therefore
      duplicate this code fragment and pull it out of the loop over the received
      characters.
      
      Patch applies to 3.16-rc6.
      Signed-off-by: default avatarChristopher Alexander Tobias Schulze <cat.schulze@alice-dsl.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f8cfa5d8
    • Christopher Alexander Tobias Schulze's avatar
      bbc-i2c: Fix BBC I2C envctrl on SunBlade 2000 · eb191d89
      Christopher Alexander Tobias Schulze authored
      [ Upstream commit 5cdceab3 ]
      
      Fix regression in bbc i2c temperature and fan control on some Sun systems
      that causes the driver to refuse to load due to the bbc_i2c_bussel resource not
      being present on the (second) i2c bus where the temperature sensors and fan
      control are located. (The check for the number of resources was removed when
      the driver was ported to a pure OF driver in mid 2008.)
      Signed-off-by: default avatarChristopher Alexander Tobias Schulze <cat.schulze@alice-dsl.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eb191d89
    • David S. Miller's avatar
      sparc64: Guard against flushing openfirmware mappings. · 49e6ec05
      David S. Miller authored
      [ Upstream commit 4ca9a237 ]
      
      Based almost entirely upon a patch by Christopher Alexander Tobias
      Schulze.
      
      In commit db64fe02 ("mm: rewrite vmap
      layer") lazy VMAP tlb flushing was added to the vmalloc layer.  This
      causes problems on sparc64.
      
      Sparc64 has two VMAP mapped regions and they are not contiguous with
      eachother.  First we have the malloc mapping area, then another
      unrelated region, then the vmalloc region.
      
      This "another unrelated region" is where the firmware is mapped.
      
      If the lazy TLB flushing logic in the vmalloc code triggers after
      we've had both a module unload and a vfree or similar, it will pass an
      address range that goes from somewhere inside the malloc region to
      somewhere inside the vmalloc region, and thus covering the
      openfirmware area entirely.
      
      The sparc64 kernel learns about openfirmware's dynamic mappings in
      this region early in the boot, and then services TLB misses in this
      area.  But openfirmware has some locked TLB entries which are not
      mentioned in those dynamic mappings and we should thus not disturb
      them.
      
      These huge lazy TLB flush ranges causes those openfirmware locked TLB
      entries to be removed, resulting in all kinds of problems including
      hard hangs and crashes during reboot/reset.
      
      Besides causing problems like this, such huge TLB flush ranges are
      also incredibly inefficient.  A plea has been made with the author of
      the VMAP lazy TLB flushing code, but for now we'll put a safety guard
      into our flush_tlb_kernel_range() implementation.
      
      Since the implementation has become non-trivial, stop defining it as a
      macro and instead make it a function in a C source file.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      49e6ec05
    • David S. Miller's avatar
      sparc64: Do not insert non-valid PTEs into the TSB hash table. · b4a8febe
      David S. Miller authored
      [ Upstream commit 18f38132 ]
      
      The assumption was that update_mmu_cache() (and the equivalent for PMDs) would
      only be called when the PTE being installed will be accessible by the user.
      
      This is not true for code paths originating from remove_migration_pte().
      
      There are dire consequences for placing a non-valid PTE into the TSB.  The TLB
      miss frramework assumes thatwhen a TSB entry matches we can just load it into
      the TLB and return from the TLB miss trap.
      
      So if a non-valid PTE is in there, we will deadlock taking the TLB miss over
      and over, never satisfying the miss.
      
      Just exit early from update_mmu_cache() and friends in this situation.
      
      Based upon a report and patch from Christopher Alexander Tobias Schulze.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b4a8febe
    • David S. Miller's avatar
      sparc64: Add membar to Niagara2 memcpy code. · a8973758
      David S. Miller authored
      [ Upstream commit 5aa4ecfd ]
      
      This is the prevent previous stores from overlapping the block stores
      done by the memcpy loop.
      
      Based upon a glibc patch by Jose E. Marchesi
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a8973758
    • David S. Miller's avatar
      sparc64: Fix huge TSB mapping on pre-UltraSPARC-III cpus. · c184f95b
      David S. Miller authored
      [ Upstream commit b18eb2d7 ]
      
      Access to the TSB hash tables during TLB misses requires that there be
      an atomic 128-bit quad load available so that we fetch a matching TAG
      and DATA field at the same time.
      
      On cpus prior to UltraSPARC-III only virtual address based quad loads
      are available.  UltraSPARC-III and later provide physical address
      based variants which are easier to use.
      
      When we only have virtual address based quad loads available this
      means that we have to lock the TSB into the TLB at a fixed virtual
      address on each cpu when it runs that process.  We can't just access
      the PAGE_OFFSET based aliased mapping of these TSBs because we cannot
      take a recursive TLB miss inside of the TLB miss handler without
      risking running out of hardware trap levels (some trap combinations
      can be deep, such as those generated by register window spill and fill
      traps).
      
      Without huge pages it's working perfectly fine, but when the huge TSB
      got added another chunk of fixed virtual address space was not
      allocated for this second TSB mapping.
      
      So we were mapping both the 8K and 4MB TSBs to the same exact virtual
      address, causing multiple TLB matches which gives undefined behavior.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c184f95b
    • David S. Miller's avatar
      sparc64: Don't bark so loudly about 32-bit tasks generating 64-bit fault addresses. · 8f0a3c34
      David S. Miller authored
      [ Upstream commit e5c460f4 ]
      
      This was found using Dave Jone's trinity tool.
      
      When a user process which is 32-bit performs a load or a store, the
      cpu chops off the top 32-bits of the effective address before
      translating it.
      
      This is because we run 32-bit tasks with the PSTATE_AM (address
      masking) bit set.
      
      We can't run the kernel with that bit set, so when the kernel accesses
      userspace no address masking occurs.
      
      Since a 32-bit process will have no mappings in that region we will
      properly fault, so we don't try to handle this using access_ok(),
      which can safely just be a NOP on sparc64.
      
      Real faults from 32-bit processes should never generate such addresses
      so a bug check was added long ago, and it barks in the logs if this
      happens.
      
      But it also barks when a kernel user access causes this condition, and
      that _can_ happen.  For example, if a pointer passed into a system call
      is "0xfffffffc" and the kernel access 4 bytes offset from that pointer.
      
      Just handle such faults normally via the exception entries.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8f0a3c34
    • David S. Miller's avatar
      sparc64: Give more detailed information in {pgd,pmd}_ERROR() and kill pte_ERROR(). · 89e69c2e
      David S. Miller authored
      [ Upstream commit fe866433 ]
      
      pte_ERROR() is not used anywhere, delete it.
      
      For pgd_ERROR() and pmd_ERROR(), output something similar to x86, giving the address
      of the pgd/pmd as well as it's value.
      
      Also provide the caller, since these macros are invoked from pgd_clear_bad() and
      pmd_clear_bad() which provides little context as to what high level operation was
      occuring when the BAD state was detected.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      89e69c2e
    • David S. Miller's avatar
      sparc64: Add basic validations to {pud,pmd}_bad(). · e4f2eef2
      David S. Miller authored
      [ Upstream commit 26cf4325 ]
      
      Instead of returning false we should at least check the most basic
      things, otherwise page table corruptions will be very difficult to
      debug.
      
      PMD and PTE tables are of size PAGE_SIZE, so none of the sub-PAGE_SIZE
      bits should be set.
      
      We also complement this with a check that the physical address the
      pud/pmd points to is valid memory.
      
      PowerPC was used as a guide while implementating this.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e4f2eef2
    • David S. Miller's avatar
      eb8ab873
    • David S. Miller's avatar
      sparc64: Fix range check in kern_addr_valid(). · 97f73178
      David S. Miller authored
      [ Upstream commit ee73887e ]
      
      In commit b2d43834 ("sparc64: Make
      PAGE_OFFSET variable."), the MAX_PHYS_ADDRESS_BITS value was increased
      (to 47).
      
      This constant reference to '41UL' was missed.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      97f73178
    • David S. Miller's avatar
      sparc64: Fix top-level fault handling bugs. · b97e1868
      David S. Miller authored
      [ Upstream commit 70ffc6eb ]
      
      Make get_user_insn() able to cope with huge PMDs.
      
      Next, make do_fault_siginfo() more robust when get_user_insn() can't
      actually fetch the instruction.  In particular, use the MMU announced
      fault address when that happens, instead of calling
      compute_effective_address() and computing garbage.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b97e1868
    • David S. Miller's avatar
      sparc64: Handle 32-bit tasks properly in compute_effective_address(). · d5ad2c01
      David S. Miller authored
      [ Upstream commit d037d163 ]
      
      If we have a 32-bit task we must chop off the top 32-bits of the
      64-bit value just as the cpu would.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d5ad2c01
    • David S. Miller's avatar
      83ad186c
    • David S. Miller's avatar
      sparc64: Fix hex values in comment above pte_modify(). · 4f5e9cf3
      David S. Miller authored
      [ Upstream commit c2e4e676 ]
      
      When _PAGE_SPECIAL and _PAGE_PMD_HUGE were added to the mask, the
      comment was not updated.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4f5e9cf3
    • David S. Miller's avatar
      sparc64: Fix bugs in get_user_pages_fast() wrt. THP. · b38b6f28
      David S. Miller authored
      [ Upstream commit 04df419d ]
      
      The large PMD path needs to check _PAGE_VALID not _PAGE_PRESENT, to
      decide if it needs to bail and return 0.
      
      pmd_large() should therefore just check _PAGE_PMD_HUGE.
      
      Calls to gup_huge_pmd() are guarded with a check of pmd_large(), so we
      just need to add a valid bit check.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b38b6f28
    • David S. Miller's avatar
      sparc64: Fix huge PMD invalidation. · a73d6bcc
      David S. Miller authored
      [ Upstream commit 51e5ef1b ]
      
      On sparc64 "present" and "valid" are seperate PTE bits, this allows us to
      naturally distinguish between the user explicitly asking for PROT_NONE
      with mprotect() and other situations.
      
      However we weren't handling this properly in the huge PMD paths.
      
      First of all, the page table walker in the TSB miss path only checks
      for _PAGE_PMD_HUGE.  So the generic pmdp_invalidate() would clear
      _PAGE_PRESENT but the TLB miss paths would still load it into the TLB
      as a valid huge PMD.
      
      Fix this by clearing the valid bit in pmdp_invalidate(), and also
      checking the valid bit in USER_PGTABLE_CHECK_PMD_HUGE using "brgez"
      since _PAGE_VALID is bit 63 in both the sun4u and sun4v pte layouts.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a73d6bcc
    • David S. Miller's avatar
      sparc64: Fix executable bit testing in set_pmd_at() paths. · 24924bc7
      David S. Miller authored
      [ Upstream commit 5b1e94fa ]
      
      This code was mistakenly using the exec bit from the PMD in all
      cases, even when the PMD isn't a huge PMD.
      
      If it's not a huge PMD, test the exec bit in the individual ptes down
      in tlb_batch_pmd_scan().
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      24924bc7
    • Kirill Tkhai's avatar
      sparc64: Make itc_sync_lock raw · 7f6073dc
      Kirill Tkhai authored
      [ Upstream commit 49b6c01f ]
      
      One more place where we must not be able
      to be preempted or to be interrupted in RT.
      
      Always actually disable interrupts during
      synchronization cycle.
      Signed-off-by: default avatarKirill Tkhai <tkhai@yandex.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7f6073dc
    • David S. Miller's avatar
      sparc64: Fix argument sign extension for compat_sys_futex(). · 4e81df2a
      David S. Miller authored
      [ Upstream commit aa3449ee ]
      
      Only the second argument, 'op', is signed.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4e81df2a
    • Eric Dumazet's avatar
      sctp: fix possible seqlock seadlock in sctp_packet_transmit() · 662a0e38
      Eric Dumazet authored
      [ Upstream commit 757efd32 ]
      
      Dave reported following splat, caused by improper use of
      IP_INC_STATS_BH() in process context.
      
      BUG: using __this_cpu_add() in preemptible [00000000] code: trinity-c117/14551
      caller is __this_cpu_preempt_check+0x13/0x20
      CPU: 3 PID: 14551 Comm: trinity-c117 Not tainted 3.16.0+ #33
       ffffffff9ec898f0 0000000047ea7e23 ffff88022d32f7f0 ffffffff9e7ee207
       0000000000000003 ffff88022d32f818 ffffffff9e397eaa ffff88023ee70b40
       ffff88022d32f970 ffff8801c026d580 ffff88022d32f828 ffffffff9e397ee3
      Call Trace:
       [<ffffffff9e7ee207>] dump_stack+0x4e/0x7a
       [<ffffffff9e397eaa>] check_preemption_disabled+0xfa/0x100
       [<ffffffff9e397ee3>] __this_cpu_preempt_check+0x13/0x20
       [<ffffffffc0839872>] sctp_packet_transmit+0x692/0x710 [sctp]
       [<ffffffffc082a7f2>] sctp_outq_flush+0x2a2/0xc30 [sctp]
       [<ffffffff9e0d985c>] ? mark_held_locks+0x7c/0xb0
       [<ffffffff9e7f8c6d>] ? _raw_spin_unlock_irqrestore+0x5d/0x80
       [<ffffffffc082b99a>] sctp_outq_uncork+0x1a/0x20 [sctp]
       [<ffffffffc081e112>] sctp_cmd_interpreter.isra.23+0x1142/0x13f0 [sctp]
       [<ffffffffc081c86b>] sctp_do_sm+0xdb/0x330 [sctp]
       [<ffffffff9e0b8f1b>] ? preempt_count_sub+0xab/0x100
       [<ffffffffc083b350>] ? sctp_cname+0x70/0x70 [sctp]
       [<ffffffffc08389ca>] sctp_primitive_ASSOCIATE+0x3a/0x50 [sctp]
       [<ffffffffc083358f>] sctp_sendmsg+0x88f/0xe30 [sctp]
       [<ffffffff9e0d673a>] ? lock_release_holdtime.part.28+0x9a/0x160
       [<ffffffff9e0d62ce>] ? put_lock_stats.isra.27+0xe/0x30
       [<ffffffff9e73b624>] inet_sendmsg+0x104/0x220
       [<ffffffff9e73b525>] ? inet_sendmsg+0x5/0x220
       [<ffffffff9e68ac4e>] sock_sendmsg+0x9e/0xe0
       [<ffffffff9e1c0c09>] ? might_fault+0xb9/0xc0
       [<ffffffff9e1c0bae>] ? might_fault+0x5e/0xc0
       [<ffffffff9e68b234>] SYSC_sendto+0x124/0x1c0
       [<ffffffff9e0136b0>] ? syscall_trace_enter+0x250/0x330
       [<ffffffff9e68c3ce>] SyS_sendto+0xe/0x10
       [<ffffffff9e7f9be4>] tracesys+0xdd/0xe2
      
      This is a followup of commits f1d8cba6 ("inet: fix possible
      seqlock deadlocks") and 7f88c6b2 ("ipv6: fix possible seqlock
      deadlock in ip6_finish_output2")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      662a0e38
    • Sven Eckelmann's avatar
      batman-adv: Fix out-of-order fragmentation support · 1ae7e76c
      Sven Eckelmann authored
      [ Upstream commit d9124268 ]
      
      batadv_frag_insert_packet was unable to handle out-of-order packets because it
      dropped them directly. This is caused by the way the fragmentation lists is
      checked for the correct place to insert a fragmentation entry.
      
      The fragmentation code keeps the fragments in lists. The fragmentation entries
      are kept in descending order of sequence number. The list is traversed and each
      entry is compared with the new fragment. If the current entry has a smaller
      sequence number than the new fragment then the new one has to be inserted
      before the current entry. This ensures that the list is still in descending
      order.
      
      An out-of-order packet with a smaller sequence number than all entries in the
      list still has to be added to the end of the list. The used hlist has no
      information about the last entry in the list inside hlist_head and thus the
      last entry has to be calculated differently. Currently the code assumes that
      the iterator variable of hlist_for_each_entry can be used for this purpose
      after the hlist_for_each_entry finished. This is obviously wrong because the
      iterator variable is always NULL when the list was completely traversed.
      
      Instead the information about the last entry has to be stored in a different
      variable.
      
      This problem was introduced in 610bfc6b
      ("batman-adv: Receive fragmented packets and merge").
      Signed-off-by: default avatarSven Eckelmann <sven@narfation.org>
      Signed-off-by: default avatarMarek Lindner <mareklindner@neomailbox.ch>
      Signed-off-by: default avatarAntonio Quartulli <antonio@meshcoding.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1ae7e76c
    • Sasha Levin's avatar
      iovec: make sure the caller actually wants anything in memcpy_fromiovecend · b79cc879
      Sasha Levin authored
      [ Upstream commit 06ebb06d ]
      
      Check for cases when the caller requests 0 bytes instead of running off
      and dereferencing potentially invalid iovecs.
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b79cc879
    • Vlad Yasevich's avatar
      net: Correctly set segment mac_len in skb_segment(). · aa85c9d4
      Vlad Yasevich authored
      [ Upstream commit fcdfe3a7 ]
      
      When performing segmentation, the mac_len value is copied right
      out of the original skb.  However, this value is not always set correctly
      (like when the packet is VLAN-tagged) and we'll end up copying a bad
      value.
      
      One way to demonstrate this is to configure a VM which tags
      packets internally and turn off VLAN acceleration on the forwarding
      bridge port.  The packets show up corrupt like this:
      16:18:24.985548 52:54:00:ab:be:25 > 52:54:00:26:ce:a3, ethertype 802.1Q
      (0x8100), length 1518: vlan 100, p 0, ethertype 0x05e0,
              0x0000:  8cdb 1c7c 8cdb 0064 4006 b59d 0a00 6402 ...|...d@.....d.
              0x0010:  0a00 6401 9e0d b441 0a5e 64ec 0330 14fa ..d....A.^d..0..
              0x0020:  29e3 01c9 f871 0000 0101 080a 000a e833)....q.........3
              0x0030:  000f 8c75 6e65 7470 6572 6600 6e65 7470 ...unetperf.netp
              0x0040:  6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
              0x0050:  6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
              0x0060:  6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
              ...
      
      This also leads to awful throughput as GSO packets are dropped and
      cause retransmissions.
      
      The solution is to set the mac_len using the values already available
      in then new skb.  We've already adjusted all of the header offset, so we
      might as well correctly figure out the mac_len using skb_reset_mac_len().
      After this change, packets are segmented correctly and performance
      is restored.
      
      CC: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aa85c9d4
    • Vlad Yasevich's avatar
      macvlan: Initialize vlan_features to turn on offload support. · 6d7a7d36
      Vlad Yasevich authored
      [ Upstream commit 081e83a7 ]
      
      Macvlan devices do not initialize vlan_features.  As a result,
      any vlan devices configured on top of macvlans perform very poorly.
      Initialize vlan_features based on the vlan features of the lower-level
      device.
      Signed-off-by: default avatarVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6d7a7d36
    • Daniel Borkmann's avatar
      net: sctp: inherit auth_capable on INIT collisions · 672fcd4d
      Daniel Borkmann authored
      [ Upstream commit 1be9a950 ]
      
      Jason reported an oops caused by SCTP on his ARM machine with
      SCTP authentication enabled:
      
      Internal error: Oops: 17 [#1] ARM
      CPU: 0 PID: 104 Comm: sctp-test Not tainted 3.13.0-68744-g3632f30c9b20-dirty #1
      task: c6eefa40 ti: c6f52000 task.ti: c6f52000
      PC is at sctp_auth_calculate_hmac+0xc4/0x10c
      LR is at sg_init_table+0x20/0x38
      pc : [<c024bb80>]    lr : [<c00f32dc>]    psr: 40000013
      sp : c6f538e8  ip : 00000000  fp : c6f53924
      r10: c6f50d80  r9 : 00000000  r8 : 00010000
      r7 : 00000000  r6 : c7be4000  r5 : 00000000  r4 : c6f56254
      r3 : c00c8170  r2 : 00000001  r1 : 00000008  r0 : c6f1e660
      Flags: nZcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
      Control: 0005397f  Table: 06f28000  DAC: 00000015
      Process sctp-test (pid: 104, stack limit = 0xc6f521c0)
      Stack: (0xc6f538e8 to 0xc6f54000)
      [...]
      Backtrace:
      [<c024babc>] (sctp_auth_calculate_hmac+0x0/0x10c) from [<c0249af8>] (sctp_packet_transmit+0x33c/0x5c8)
      [<c02497bc>] (sctp_packet_transmit+0x0/0x5c8) from [<c023e96c>] (sctp_outq_flush+0x7fc/0x844)
      [<c023e170>] (sctp_outq_flush+0x0/0x844) from [<c023ef78>] (sctp_outq_uncork+0x24/0x28)
      [<c023ef54>] (sctp_outq_uncork+0x0/0x28) from [<c0234364>] (sctp_side_effects+0x1134/0x1220)
      [<c0233230>] (sctp_side_effects+0x0/0x1220) from [<c02330b0>] (sctp_do_sm+0xac/0xd4)
      [<c0233004>] (sctp_do_sm+0x0/0xd4) from [<c023675c>] (sctp_assoc_bh_rcv+0x118/0x160)
      [<c0236644>] (sctp_assoc_bh_rcv+0x0/0x160) from [<c023d5bc>] (sctp_inq_push+0x6c/0x74)
      [<c023d550>] (sctp_inq_push+0x0/0x74) from [<c024a6b0>] (sctp_rcv+0x7d8/0x888)
      
      While we already had various kind of bugs in that area
      ec0223ec ("net: sctp: fix sctp_sf_do_5_1D_ce to verify if
      we/peer is AUTH capable") and b14878cc ("net: sctp: cache
      auth_enable per endpoint"), this one is a bit of a different
      kind.
      
      Giving a bit more background on why SCTP authentication is
      needed can be found in RFC4895:
      
        SCTP uses 32-bit verification tags to protect itself against
        blind attackers. These values are not changed during the
        lifetime of an SCTP association.
      
        Looking at new SCTP extensions, there is the need to have a
        method of proving that an SCTP chunk(s) was really sent by
        the original peer that started the association and not by a
        malicious attacker.
      
      To cause this bug, we're triggering an INIT collision between
      peers; normal SCTP handshake where both sides intent to
      authenticate packets contains RANDOM; CHUNKS; HMAC-ALGO
      parameters that are being negotiated among peers:
      
        ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->
        <------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] ---------
        -------------------- COOKIE-ECHO -------------------->
        <-------------------- COOKIE-ACK ---------------------
      
      RFC4895 says that each endpoint therefore knows its own random
      number and the peer's random number *after* the association
      has been established. The local and peer's random number along
      with the shared key are then part of the secret used for
      calculating the HMAC in the AUTH chunk.
      
      Now, in our scenario, we have 2 threads with 1 non-blocking
      SEQ_PACKET socket each, setting up common shared SCTP_AUTH_KEY
      and SCTP_AUTH_ACTIVE_KEY properly, and each of them calling
      sctp_bindx(3), listen(2) and connect(2) against each other,
      thus the handshake looks similar to this, e.g.:
      
        ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->
        <------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] ---------
        <--------- INIT[RANDOM; CHUNKS; HMAC-ALGO] -----------
        -------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] -------->
        ...
      
      Since such collisions can also happen with verification tags,
      the RFC4895 for AUTH rather vaguely says under section 6.1:
      
        In case of INIT collision, the rules governing the handling
        of this Random Number follow the same pattern as those for
        the Verification Tag, as explained in Section 5.2.4 of
        RFC 2960 [5]. Therefore, each endpoint knows its own Random
        Number and the peer's Random Number after the association
        has been established.
      
      In RFC2960, section 5.2.4, we're eventually hitting Action B:
      
        B) In this case, both sides may be attempting to start an
           association at about the same time but the peer endpoint
           started its INIT after responding to the local endpoint's
           INIT. Thus it may have picked a new Verification Tag not
           being aware of the previous Tag it had sent this endpoint.
           The endpoint should stay in or enter the ESTABLISHED
           state but it MUST update its peer's Verification Tag from
           the State Cookie, stop any init or cookie timers that may
           running and send a COOKIE ACK.
      
      In other words, the handling of the Random parameter is the
      same as behavior for the Verification Tag as described in
      Action B of section 5.2.4.
      
      Looking at the code, we exactly hit the sctp_sf_do_dupcook_b()
      case which triggers an SCTP_CMD_UPDATE_ASSOC command to the
      side effect interpreter, and in fact it properly copies over
      peer_{random, hmacs, chunks} parameters from the newly created
      association to update the existing one.
      
      Also, the old asoc_shared_key is being released and based on
      the new params, sctp_auth_asoc_init_active_key() updated.
      However, the issue observed in this case is that the previous
      asoc->peer.auth_capable was 0, and has *not* been updated, so
      that instead of creating a new secret, we're doing an early
      return from the function sctp_auth_asoc_init_active_key()
      leaving asoc->asoc_shared_key as NULL. However, we now have to
      authenticate chunks from the updated chunk list (e.g. COOKIE-ACK).
      
      That in fact causes the server side when responding with ...
      
        <------------------ AUTH; COOKIE-ACK -----------------
      
      ... to trigger a NULL pointer dereference, since in
      sctp_packet_transmit(), it discovers that an AUTH chunk is
      being queued for xmit, and thus it calls sctp_auth_calculate_hmac().
      
      Since the asoc->active_key_id is still inherited from the
      endpoint, and the same as encoded into the chunk, it uses
      asoc->asoc_shared_key, which is still NULL, as an asoc_key
      and dereferences it in ...
      
        crypto_hash_setkey(desc.tfm, &asoc_key->data[0], asoc_key->len)
      
      ... causing an oops. All this happens because sctp_make_cookie_ack()
      called with the *new* association has the peer.auth_capable=1
      and therefore marks the chunk with auth=1 after checking
      sctp_auth_send_cid(), but it is *actually* sent later on over
      the then *updated* association's transport that didn't initialize
      its shared key due to peer.auth_capable=0. Since control chunks
      in that case are not sent by the temporary association which
      are scheduled for deletion, they are issued for xmit via
      SCTP_CMD_REPLY in the interpreter with the context of the
      *updated* association. peer.auth_capable was 0 in the updated
      association (which went from COOKIE_WAIT into ESTABLISHED state),
      since all previous processing that performed sctp_process_init()
      was being done on temporary associations, that we eventually
      throw away each time.
      
      The correct fix is to update to the new peer.auth_capable
      value as well in the collision case via sctp_assoc_update(),
      so that in case the collision migrated from 0 -> 1,
      sctp_auth_asoc_init_active_key() can properly recalculate
      the secret. This therefore fixes the observed server panic.
      
      Fixes: 730fc3d0 ("[SCTP]: Implete SCTP-AUTH parameter processing")
      Reported-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Tested-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Acked-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      672fcd4d
    • Ivan Vecera's avatar
      bna: fix performance regression · e4a6605a
      Ivan Vecera authored
      [ Upstream commit c36c9d50 ]
      
      The recent commit "e29aa339 bna: Enable Multi Buffer RX" is causing
      a performance regression. It does not properly update 'cmpl' pointer
      at the end of the loop in NAPI handler bnad_cq_process(). The result is
      only one packet / per NAPI-schedule is processed.
      Signed-off-by: default avatarIvan Vecera <ivecera@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e4a6605a
    • Christoph Paasch's avatar
      tcp: Fix integer-overflow in TCP vegas · c796399c
      Christoph Paasch authored
      [ Upstream commit 1f74e613 ]
      
      In vegas we do a multiplication of the cwnd and the rtt. This
      may overflow and thus their result is stored in a u64. However, we first
      need to cast the cwnd so that actually 64-bit arithmetic is done.
      
      Then, we need to do do_div to allow this to be used on 32-bit arches.
      
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Doug Leith <doug.leith@nuim.ie>
      Fixes: 8d3a564d (tcp: tcp_vegas cong avoid fix)
      Signed-off-by: default avatarChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c796399c
    • Christoph Paasch's avatar
      tcp: Fix integer-overflows in TCP veno · 11eea601
      Christoph Paasch authored
      [ Upstream commit 45a07695 ]
      
      In veno we do a multiplication of the cwnd and the rtt. This
      may overflow and thus their result is stored in a u64. However, we first
      need to cast the cwnd so that actually 64-bit arithmetic is done.
      
      A first attempt at fixing 76f10177 ([TCP]: TCP Veno congestion
      control) was made by 15913114 (tcp: Overflow bug in Vegas), but it
      failed to add the required cast in tcp_veno_cong_avoid().
      
      Fixes: 76f10177 ([TCP]: TCP Veno congestion control)
      Signed-off-by: default avatarChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      11eea601
    • Dmitry Popov's avatar
      ip_tunnel(ipv4): fix tunnels with "local any remote $remote_ip" · ec7e7275
      Dmitry Popov authored
      [ Upstream commit 95cb5745 ]
      
      Ipv4 tunnels created with "local any remote $ip" didn't work properly since
      7d442fab (ipv4: Cache dst in tunnels). 99% of packets sent via those tunnels
      had src addr = 0.0.0.0. That was because only dst_entry was cached, although
      fl4.saddr has to be cached too. Every time ip_tunnel_xmit used cached dst_entry
      (tunnel_rtable_get returned non-NULL), fl4.saddr was initialized with
      tnl_params->saddr (= 0 in our case), and wasn't changed until iptunnel_xmit().
      
      This patch adds saddr to ip_tunnel->dst_cache, fixing this issue.
      Reported-by: default avatarSergey Popov <pinkbyte@gentoo.org>
      Signed-off-by: default avatarDmitry Popov <ixaphire@qrator.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ec7e7275
    • Florian Fainelli's avatar
      net: phy: re-apply PHY fixups during phy_register_device · 3c9d3600
      Florian Fainelli authored
      [ Upstream commit d92f5dec ]
      
      Commit 87aa9f9c ("net: phy: consolidate PHY reset in phy_init_hw()")
      moved the call to phy_scan_fixups() in phy_init_hw() after a software
      reset is performed.
      
      By the time phy_init_hw() is called in phy_device_register(), no driver
      has been bound to this PHY yet, so all the checks in phy_init_hw()
      against the PHY driver and the PHY driver's config_init function will
      return 0. We will therefore never call phy_scan_fixups() as we should.
      
      Fix this by calling phy_scan_fixups() and check for its return value to
      restore the intended functionality.
      
      This broke PHY drivers which do register an early PHY fixup callback to
      intercept the PHY probing and do things like changing the 32-bits unique
      PHY identifier when a pseudo-PHY address has been used, as well as
      board-specific PHY fixups that need to be applied during driver probe
      time.
      Reported-by: default avatarHauke Merthens <hauke-m@hauke-m.de>
      Reported-by: default avatarJonas Gorski <jogo@openwrt.org>
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3c9d3600
    • Andrey Ryabinin's avatar
      net: sendmsg: fix NULL pointer dereference · 4ae82122
      Andrey Ryabinin authored
      [ Upstream commit 40eea803 ]
      
      Sasha's report:
      	> While fuzzing with trinity inside a KVM tools guest running the latest -next
      	> kernel with the KASAN patchset, I've stumbled on the following spew:
      	>
      	> [ 4448.949424] ==================================================================
      	> [ 4448.951737] AddressSanitizer: user-memory-access on address 0
      	> [ 4448.952988] Read of size 2 by thread T19638:
      	> [ 4448.954510] CPU: 28 PID: 19638 Comm: trinity-c76 Not tainted 3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #813
      	> [ 4448.956823]  ffff88046d86ca40 0000000000000000 ffff880082f37e78 ffff880082f37a40
      	> [ 4448.958233]  ffffffffb6e47068 ffff880082f37a68 ffff880082f37a58 ffffffffb242708d
      	> [ 4448.959552]  0000000000000000 ffff880082f37a88 ffffffffb24255b1 0000000000000000
      	> [ 4448.961266] Call Trace:
      	> [ 4448.963158] dump_stack (lib/dump_stack.c:52)
      	> [ 4448.964244] kasan_report_user_access (mm/kasan/report.c:184)
      	> [ 4448.965507] __asan_load2 (mm/kasan/kasan.c:352)
      	> [ 4448.966482] ? netlink_sendmsg (net/netlink/af_netlink.c:2339)
      	> [ 4448.967541] netlink_sendmsg (net/netlink/af_netlink.c:2339)
      	> [ 4448.968537] ? get_parent_ip (kernel/sched/core.c:2555)
      	> [ 4448.970103] sock_sendmsg (net/socket.c:654)
      	> [ 4448.971584] ? might_fault (mm/memory.c:3741)
      	> [ 4448.972526] ? might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3740)
      	> [ 4448.973596] ? verify_iovec (net/core/iovec.c:64)
      	> [ 4448.974522] ___sys_sendmsg (net/socket.c:2096)
      	> [ 4448.975797] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
      	> [ 4448.977030] ? lock_release_holdtime (kernel/locking/lockdep.c:273)
      	> [ 4448.978197] ? lock_release_non_nested (kernel/locking/lockdep.c:3434 (discriminator 1))
      	> [ 4448.979346] ? check_chain_key (kernel/locking/lockdep.c:2188)
      	> [ 4448.980535] __sys_sendmmsg (net/socket.c:2181)
      	> [ 4448.981592] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
      	> [ 4448.982773] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
      	> [ 4448.984458] ? syscall_trace_enter (arch/x86/kernel/ptrace.c:1500 (discriminator 2))
      	> [ 4448.985621] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
      	> [ 4448.986754] SyS_sendmmsg (net/socket.c:2201)
      	> [ 4448.987708] tracesys (arch/x86/kernel/entry_64.S:542)
      	> [ 4448.988929] ==================================================================
      
      This reports means that we've come to netlink_sendmsg() with msg->msg_name == NULL and msg->msg_namelen > 0.
      
      After this report there was no usual "Unable to handle kernel NULL pointer dereference"
      and this gave me a clue that address 0 is mapped and contains valid socket address structure in it.
      
      This bug was introduced in f3d33426
      (net: rework recvmsg handler msg_name and msg_namelen logic).
      Commit message states that:
      	"Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
      	 non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
      	 affect sendto as it would bail out earlier while trying to copy-in the
      	 address."
      But in fact this affects sendto when address 0 is mapped and contains
      socket address structure in it. In such case copy-in address will succeed,
      verify_iovec() function will successfully exit with msg->msg_namelen > 0
      and msg->msg_name == NULL.
      
      This patch fixes it by setting msg_namelen to 0 if msg_name == NULL.
      
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarAndrey Ryabinin <a.ryabinin@samsung.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4ae82122
    • Eric Dumazet's avatar
      ip: make IP identifiers less predictable · b733ea82
      Eric Dumazet authored
      [ Upstream commit 04ca6973 ]
      
      In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and
      Jedidiah describe ways exploiting linux IP identifier generation to
      infer whether two machines are exchanging packets.
      
      With commit 73f156a6 ("inetpeer: get rid of ip_id_count"), we
      changed IP id generation, but this does not really prevent this
      side-channel technique.
      
      This patch adds a random amount of perturbation so that IP identifiers
      for a given destination [1] are no longer monotonically increasing after
      an idle period.
      
      Note that prandom_u32_max(1) returns 0, so if generator is used at most
      once per jiffy, this patch inserts no hole in the ID suite and do not
      increase collision probability.
      
      This is jiffies based, so in the worst case (HZ=1000), the id can
      rollover after ~65 seconds of idle time, which should be fine.
      
      We also change the hash used in __ip_select_ident() to not only hash
      on daddr, but also saddr and protocol, so that ICMP probes can not be
      used to infer information for other protocols.
      
      For IPv6, adds saddr into the hash as well, but not nexthdr.
      
      If I ping the patched target, we can see ID are now hard to predict.
      
      21:57:11.008086 IP (...)
          A > target: ICMP echo request, seq 1, length 64
      21:57:11.010752 IP (... id 2081 ...)
          target > A: ICMP echo reply, seq 1, length 64
      
      21:57:12.013133 IP (...)
          A > target: ICMP echo request, seq 2, length 64
      21:57:12.015737 IP (... id 3039 ...)
          target > A: ICMP echo reply, seq 2, length 64
      
      21:57:13.016580 IP (...)
          A > target: ICMP echo request, seq 3, length 64
      21:57:13.019251 IP (... id 3437 ...)
          target > A: ICMP echo reply, seq 3, length 64
      
      [1] TCP sessions uses a per flow ID generator not changed by this patch.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarJeffrey Knockel <jeffk@cs.unm.edu>
      Reported-by: default avatarJedidiah R. Crandall <crandall@cs.unm.edu>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Hannes Frederic Sowa <hannes@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b733ea82
    • Eric Dumazet's avatar
      inetpeer: get rid of ip_id_count · 265459c3
      Eric Dumazet authored
      [ Upstream commit 73f156a6 ]
      
      Ideally, we would need to generate IP ID using a per destination IP
      generator.
      
      linux kernels used inet_peer cache for this purpose, but this had a huge
      cost on servers disabling MTU discovery.
      
      1) each inet_peer struct consumes 192 bytes
      
      2) inetpeer cache uses a binary tree of inet_peer structs,
         with a nominal size of ~66000 elements under load.
      
      3) lookups in this tree are hitting a lot of cache lines, as tree depth
         is about 20.
      
      4) If server deals with many tcp flows, we have a high probability of
         not finding the inet_peer, allocating a fresh one, inserting it in
         the tree with same initial ip_id_count, (cf secure_ip_id())
      
      5) We garbage collect inet_peer aggressively.
      
      IP ID generation do not have to be 'perfect'
      
      Goal is trying to avoid duplicates in a short period of time,
      so that reassembly units have a chance to complete reassembly of
      fragments belonging to one message before receiving other fragments
      with a recycled ID.
      
      We simply use an array of generators, and a Jenkin hash using the dst IP
      as a key.
      
      ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
      belongs (it is only used from this file)
      
      secure_ip_id() and secure_ipv6_id() no longer are needed.
      
      Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
      unnecessary decrement/increment of the number of segments.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      265459c3
    • Dmitry Kravkov's avatar
      bnx2x: fix crash during TSO tunneling · 910c396e
      Dmitry Kravkov authored
      [ Upstream commit fe26566d ]
      
      When TSO packet is transmitted additional BD w/o mapping is used
      to describe the packed. The BD needs special handling in tx
      completion.
      
      kernel: Call Trace:
      kernel: <IRQ>  [<ffffffff815e19ba>] dump_stack+0x19/0x1b
      kernel: [<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80
      kernel: [<ffffffff8105df5c>] warn_slowpath_fmt+0x5c/0x80
      kernel: [<ffffffff814a8c0d>] ? find_iova+0x4d/0x90
      kernel: [<ffffffff814ab0e2>] intel_unmap_page.part.36+0x142/0x160
      kernel: [<ffffffff814ad0e6>] intel_unmap_page+0x26/0x30
      kernel: [<ffffffffa01f55d7>] bnx2x_free_tx_pkt+0x157/0x2b0 [bnx2x]
      kernel: [<ffffffffa01f8dac>] bnx2x_tx_int+0xac/0x220 [bnx2x]
      kernel: [<ffffffff8101a0d9>] ? read_tsc+0x9/0x20
      kernel: [<ffffffffa01f8fdb>] bnx2x_poll+0xbb/0x3c0 [bnx2x]
      kernel: [<ffffffff814d041a>] net_rx_action+0x15a/0x250
      kernel: [<ffffffff81067047>] __do_softirq+0xf7/0x290
      kernel: [<ffffffff815f3a5c>] call_softirq+0x1c/0x30
      kernel: [<ffffffff81014d25>] do_softirq+0x55/0x90
      kernel: [<ffffffff810673e5>] irq_exit+0x115/0x120
      kernel: [<ffffffff815f4358>] do_IRQ+0x58/0xf0
      kernel: [<ffffffff815e94ad>] common_interrupt+0x6d/0x6d
      kernel: <EOI>  [<ffffffff810bbff7>] ? clockevents_notify+0x127/0x140
      kernel: [<ffffffff814834df>] ? cpuidle_enter_state+0x4f/0xc0
      kernel: [<ffffffff81483615>] cpuidle_idle_call+0xc5/0x200
      kernel: [<ffffffff8101bc7e>] arch_cpu_idle+0xe/0x30
      kernel: [<ffffffff810b4725>] cpu_startup_entry+0xf5/0x290
      kernel: [<ffffffff815cfee1>] start_secondary+0x265/0x27b
      kernel: ---[ end trace 11aa7726f18d7e80 ]---
      
      Fixes: a848ade4 ("bnx2x: add CSUM and TSO support for encapsulation protocols")
      Reported-by: default avatarYulong Pei <ypei@redhat.com>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Signed-off-by: default avatarDmitry Kravkov <Dmitry.Kravkov@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      910c396e
    • Tobias Brunner's avatar
      xfrm: Fix installation of AH IPsec SAs · 2a363681
      Tobias Brunner authored
      [ Upstream commit a0e5ef53 ]
      
      The SPI check introduced in ea9884b3
      was intended for IPComp SAs but actually prevented AH SAs from getting
      installed (depending on the SPI).
      
      Fixes: ea9884b3 ("xfrm: check user specified spi for IPComp")
      Cc: Fan Du <fan.du@windriver.com>
      Signed-off-by: default avatarTobias Brunner <tobias@strongswan.org>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2a363681