1. 27 Aug, 2014 40 commits
    • willy tarreau's avatar
      net: mvneta: do not schedule in mvneta_tx_timeout · f1b0e3b7
      willy tarreau authored
      If a queue timeout is reported, we can oops because of some
      schedules while the caller is atomic, as shown below :
      
        mvneta d0070000.ethernet eth0: tx timeout
        BUG: scheduling while atomic: bash/1528/0x00000100
        Modules linked in: slhttp_ethdiv(C) [last unloaded: slhttp_ethdiv]
        CPU: 2 PID: 1528 Comm: bash Tainted: G        WC   3.13.0-rc4-mvebu-nf #180
        [<c0011bd9>] (unwind_backtrace+0x1/0x98) from [<c000f1ab>] (show_stack+0xb/0xc)
        [<c000f1ab>] (show_stack+0xb/0xc) from [<c02ad323>] (dump_stack+0x4f/0x64)
        [<c02ad323>] (dump_stack+0x4f/0x64) from [<c02abe67>] (__schedule_bug+0x37/0x4c)
        [<c02abe67>] (__schedule_bug+0x37/0x4c) from [<c02ae261>] (__schedule+0x325/0x3ec)
        [<c02ae261>] (__schedule+0x325/0x3ec) from [<c02adb97>] (schedule_timeout+0xb7/0x118)
        [<c02adb97>] (schedule_timeout+0xb7/0x118) from [<c0020a67>] (msleep+0xf/0x14)
        [<c0020a67>] (msleep+0xf/0x14) from [<c01dcbe5>] (mvneta_stop_dev+0x21/0x194)
        [<c01dcbe5>] (mvneta_stop_dev+0x21/0x194) from [<c01dcfe9>] (mvneta_tx_timeout+0x19/0x24)
        [<c01dcfe9>] (mvneta_tx_timeout+0x19/0x24) from [<c024afc7>] (dev_watchdog+0x18b/0x1c4)
        [<c024afc7>] (dev_watchdog+0x18b/0x1c4) from [<c0020b53>] (call_timer_fn.isra.27+0x17/0x5c)
        [<c0020b53>] (call_timer_fn.isra.27+0x17/0x5c) from [<c0020cad>] (run_timer_softirq+0x115/0x170)
        [<c0020cad>] (run_timer_softirq+0x115/0x170) from [<c001ccb9>] (__do_softirq+0xbd/0x1a8)
        [<c001ccb9>] (__do_softirq+0xbd/0x1a8) from [<c001cfad>] (irq_exit+0x61/0x98)
        [<c001cfad>] (irq_exit+0x61/0x98) from [<c000d4bf>] (handle_IRQ+0x27/0x60)
        [<c000d4bf>] (handle_IRQ+0x27/0x60) from [<c000843b>] (armada_370_xp_handle_irq+0x33/0xc8)
        [<c000843b>] (armada_370_xp_handle_irq+0x33/0xc8) from [<c000fba9>] (__irq_usr+0x49/0x60)
      
      Ben Hutchings attempted to propose a better fix consisting in using a
      scheduled work for this, but while it fixed this panic, it caused other
      random freezes and panics proving that the reset sequence in the driver
      is unreliable and that additional fixes should be investigated.
      
      When sending multiple streams over a link limited to 100 Mbps, Tx timeouts
      happen from time to time, and the driver correctly recovers only when the
      function is disabled.
      
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Tested-by: default avatarArnaud Ebalard <arno@natisbad.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit 29021366)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      f1b0e3b7
    • willy tarreau's avatar
      net: mvneta: use per_cpu stats to fix an SMP lock up · 6a026d8f
      willy tarreau authored
      Stats writers are mvneta_rx() and mvneta_tx(). They don't lock anything
      when they update the stats, and as a result, it randomly happens that
      the stats freeze on SMP if two updates happen during stats retrieval.
      This is very easily reproducible by starting two HTTP servers and binding
      each of them to a different CPU, then consulting /proc/net/dev in loops
      during transfers, the interface should immediately lock up. This issue
      also randomly happens upon link state changes during transfers, because
      the stats are collected in this situation, but it takes more attempts to
      reproduce it.
      
      The comments in netdevice.h suggest using per_cpu stats instead to get
      rid of this issue.
      
      This patch implements this. It merges both rx_stats and tx_stats into
      a single "stats" member with a single syncp. Both mvneta_rx() and
      mvneta_rx() now only update the a single CPU's counters.
      
      In turn, mvneta_get_stats64() does the summing by iterating over all CPUs
      to get their respective stats.
      
      With this change, stats are still correct and no more lockup is encountered.
      
      Note that this bug was present since the first import of the mvneta
      driver.  It might make sense to backport it to some stable trees. If
      so, it depends on "d33dc73 net: mvneta: increase the 64-bit rx/tx stats
      out of the hot path".
      
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Tested-by: default avatarArnaud Ebalard <arno@natisbad.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit 74c41b04)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      6a026d8f
    • willy tarreau's avatar
      net: mvneta: increase the 64-bit rx/tx stats out of the hot path · 7d798913
      willy tarreau authored
      Better count packets and bytes in the stack and on 32 bit then
      accumulate them at the end for once. This saves two memory writes
      and two memory barriers per packet. The incoming packet rate was
      increased by 4.7% on the Openblocks AX3 thanks to this.
      
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Tested-by: default avatarArnaud Ebalard <arno@natisbad.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit dc4277dd)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      7d798913
    • Johannes Berg's avatar
      Revert "mac80211: move "bufferable MMPDU" check to fix AP mode scan" · 9e1bbd3f
      Johannes Berg authored
      This reverts commit 277d916f as it was
      at least breaking iwlwifi by setting the IEEE80211_TX_CTL_NO_PS_BUFFER
      flag in all kinds of interface modes, not only for AP mode where it is
      appropriate.
      
      To avoid reintroducing the original problem, explicitly check for probe
      request frames in the multicast buffering code.
      
      Cc: stable@vger.kernel.org
      Fixes: 277d916f ("mac80211: move "bufferable MMPDU" check to fix AP mode scan")
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      
      (cherry picked from commit 08b99399)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      9e1bbd3f
    • Andy Lutomirski's avatar
      x86_64/entry/xen: Do not invoke espfix64 on Xen · 350cc3fc
      Andy Lutomirski authored
      This moves the espfix64 logic into native_iret.  To make this work,
      it gets rid of the native patch for INTERRUPT_RETURN:
      INTERRUPT_RETURN on native kernels is now 'jmp native_iret'.
      
      This changes the 16-bit SS behavior on Xen from OOPSing to leaking
      some bits of the Xen hypervisor's RSP (I think).
      
      [ hpa: this is a nonzero cost on native, but probably not enough to
        measure. Xen needs to fix this in their own code, probably doing
        something equivalent to espfix64. ]
      Signed-off-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Link: http://lkml.kernel.org/r/7b8f1d8ef6597cb16ae004a43c56980a7de3cf94.1406129132.git.luto@amacapital.netSigned-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      
      (cherry picked from commit 7209a75d)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      350cc3fc
    • H. Peter Anvin's avatar
      x86, espfix: Make it possible to disable 16-bit support · 5190bdb7
      H. Peter Anvin authored
      Embedded systems, which may be very memory-size-sensitive, are
      extremely unlikely to ever encounter any 16-bit software, so make it
      a CONFIG_EXPERT option to turn off support for any 16-bit software
      whatsoever.
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
      
      (cherry picked from commit 34273f41)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      5190bdb7
    • H. Peter Anvin's avatar
      x86, espfix: Make espfix64 a Kconfig option, fix UML · 54d052e0
      H. Peter Anvin authored
      Make espfix64 a hidden Kconfig option.  This fixes the x86-64 UML
      build which had broken due to the non-existence of init_espfix_bsp()
      in UML: since UML uses its own Kconfig, this option does not appear in
      the UML build.
      
      This also makes it possible to make support for 16-bit segments a
      configuration option, for the people who want to minimize the size of
      the kernel.
      Reported-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Cc: Richard Weinberger <richard@nod.at>
      Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
      
      (cherry picked from commit 197725de)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      54d052e0
    • H. Peter Anvin's avatar
      x86, espfix: Fix broken header guard · 04d5b66d
      H. Peter Anvin authored
      Header guard is #ifndef, not #ifdef...
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      
      (cherry picked from commit 20b68535)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      04d5b66d
    • H. Peter Anvin's avatar
      x86, espfix: Move espfix definitions into a separate header file · 90bfa721
      H. Peter Anvin authored
      Sparse warns that the percpu variables aren't declared before they are
      defined.  Rather than hacking around it, move espfix definitions into
      a proper header file.
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      
      (cherry picked from commit e1fe9ed8)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      90bfa721
    • H. Peter Anvin's avatar
      x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stack · 7f76a341
      H. Peter Anvin authored
      The IRET instruction, when returning to a 16-bit segment, only
      restores the bottom 16 bits of the user space stack pointer.  This
      causes some 16-bit software to break, but it also leaks kernel state
      to user space.  We have a software workaround for that ("espfix") for
      the 32-bit kernel, but it relies on a nonzero stack segment base which
      is not available in 64-bit mode.
      
      In checkin:
      
          b3b42ac2 x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
      
      we "solved" this by forbidding 16-bit segments on 64-bit kernels, with
      the logic that 16-bit support is crippled on 64-bit kernels anyway (no
      V86 support), but it turns out that people are doing stuff like
      running old Win16 binaries under Wine and expect it to work.
      
      This works around this by creating percpu "ministacks", each of which
      is mapped 2^16 times 64K apart.  When we detect that the return SS is
      on the LDT, we copy the IRET frame to the ministack and use the
      relevant alias to return to userspace.  The ministacks are mapped
      readonly, so if IRET faults we promote #GP to #DF which is an IST
      vector and thus has its own stack; we then do the fixup in the #DF
      handler.
      
      (Making #GP an IST exception would make the msr_safe functions unsafe
      in NMI/MC context, and quite possibly have other effects.)
      
      Special thanks to:
      
      - Andy Lutomirski, for the suggestion of using very small stack slots
        and copy (as opposed to map) the IRET frame there, and for the
        suggestion to mark them readonly and let the fault promote to #DF.
      - Konrad Wilk for paravirt fixup and testing.
      - Borislav Petkov for testing help and useful comments.
      Reported-by: default avatarBrian Gerst <brgerst@gmail.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andrew Lutomriski <amluto@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Dirk Hohndel <dirk@hohndel.org>
      Cc: Arjan van de Ven <arjan.van.de.ven@intel.com>
      Cc: comex <comexk@gmail.com>
      Cc: Alexander van Heukelum <heukelum@fastmail.fm>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: <stable@vger.kernel.org> # consider after upstream merge
      
      (cherry picked from commit 3891a04a)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      7f76a341
    • Lars-Peter Clausen's avatar
      iio: buffer: Fix demux table creation · 86e6d017
      Lars-Peter Clausen authored
      When creating the demux table we need to iterate over the selected scan mask for
      the buffer to get the samples which should be copied to destination buffer.
      Right now the code uses the mask which contains all active channels, which means
      the demux table contains entries which causes it to copy all the samples from
      source to destination buffer one by one without doing any demuxing.
      Signed-off-by: default avatarLars-Peter Clausen <lars@metafoo.de>
      Signed-off-by: default avatarJonathan Cameron <jic23@kernel.org>
      Cc: Stable@vger.kernel.org
      
      (cherry picked from commit 61bd55ce)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      86e6d017
    • Alexandre Bounine's avatar
      rapidio/tsi721_dma: fix failure to obtain transaction descriptor · 7760818d
      Alexandre Bounine authored
      This is a bug fix for the situation when function tsi721_desc_get() fails
      to obtain a free transaction descriptor.
      
      The bug usually results in a memory access crash dump when data transfer
      scatter-gather list has more entries than size of hardware buffer
      descriptors ring.  This fix ensures that error is properly returned to a
      caller instead of an invalid entry.
      
      This patch is applicable to kernel versions starting from v3.5.
      Signed-off-by: default avatarAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com>
      Cc: Stef van Os <stef.van.os@prodrive-technologies.com>
      Cc: Vinod Koul <vinod.koul@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>	[3.5+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      
      (cherry picked from commit 0193ed82)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      7760818d
    • Eliad Peller's avatar
      cfg80211: fix mic_failure tracing · e17b018d
      Eliad Peller authored
      tsc can be NULL (mac80211 currently always passes NULL),
      resulting in NULL-dereference. check before copying it.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarEliad Peller <eliadx.peller@intel.com>
      Signed-off-by: default avatarEmmanuel Grumbach <emmanuel.grumbach@intel.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      
      (cherry picked from commit 8c26d458)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      e17b018d
    • Michael Brown's avatar
      x86/efi: Include a .bss section within the PE/COFF headers · cb1f6235
      Michael Brown authored
      The PE/COFF headers currently describe only the initialised-data
      portions of the image, and result in no space being allocated for the
      uninitialised-data portions.  Consequently, the EFI boot stub will end
      up overwriting unexpected areas of memory, with unpredictable results.
      
      Fix by including a .bss section in the PE/COFF headers (functionally
      equivalent to the init_size field in the bzImage header).
      Signed-off-by: default avatarMichael Brown <mbrown@fensystems.co.uk>
      Cc: Thomas Bächler <thomas@archlinux.org>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarMatt Fleming <matt.fleming@intel.com>
      
      (cherry picked from commit c7fb93ec)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      cb1f6235
    • John David Anglin's avatar
      parisc: Remove SA_RESTORER define · 1ef56f20
      John David Anglin authored
      The sa_restorer field in struct sigaction is obsolete and no longer in
      the parisc implementation.  However, the core code assumes the field is
      present if SA_RESTORER is defined. So, the define needs to be removed.
      Signed-off-by: default avatarJohn David Anglin <dave.anglin@bell.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      
      (cherry picked from commit 20dbea49)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      1ef56f20
    • Dmitry Torokhov's avatar
      Input: fix defuzzing logic · 0880d08e
      Dmitry Torokhov authored
      We attempt to remove noise from coordinates reported by devices in
      input_handle_abs_event(), unfortunately, unless we were dropping the
      event altogether, we were ignoring the adjusted value and were passing
      on the original value instead.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarAndrew de los Reyes <adlr@chromium.org>
      Reviewed-by: default avatarBenson Leung <bleung@chromium.org>
      Reviewed-by: default avatarDavid Herrmann <dh.herrmann@gmail.com>
      Reviewed-by: default avatarHenrik Rydberg <rydberg@euromail.se>
      Signed-off-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      
      (cherry picked from commit 50c5d36d)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      0880d08e
    • Mikulas Patocka's avatar
      slab_common: fix the check for duplicate slab names · cde50abc
      Mikulas Patocka authored
      The patch 3e374919 is supposed to fix the
      problem where kmem_cache_create incorrectly reports duplicate cache name
      and fails. The problem is described in the header of that patch.
      
      However, the patch doesn't really fix the problem because of these
      reasons:
      
      * the logic to test for debugging is reversed. It was intended to perform
        the check only if slub debugging is enabled (which implies that caches
        with the same parameters are not merged). Therefore, there should be
        #if !defined(CONFIG_SLUB) || defined(CONFIG_SLUB_DEBUG_ON)
        The current code has the condition reversed and performs the test if
        debugging is disabled.
      
      * slub debugging may be enabled or disabled based on kernel command line,
        CONFIG_SLUB_DEBUG_ON is just the default settings. Therefore the test
        based on definition of CONFIG_SLUB_DEBUG_ON is unreliable.
      
      This patch fixes the problem by removing the test
      "!defined(CONFIG_SLUB_DEBUG_ON)". Therefore, duplicate names are never
      checked if the SLUB allocator is used.
      
      Note to stable kernel maintainers: when backporint this patch, please
      backport also the patch 3e374919.
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org	# 3.6+
      Signed-off-by: default avatarPekka Enberg <penberg@kernel.org>
      
      (cherry picked from commit 69461747)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      cde50abc
    • Christoph Lameter's avatar
      slab_common: Do not check for duplicate slab names · 890c7afd
      Christoph Lameter authored
      SLUB can alias multiple slab kmem_create_requests to one slab cache to save
      memory and increase the cache hotness. As a result the name of the slab can be
      stale. Only check the name for duplicates if we are in debug mode where we do
      not merge multiple caches.
      
      This fixes the following problem reported by Jonathan Brassow:
      
        The problem with kmem_cache* is this:
      
        *) Assume CONFIG_SLUB is set
        1) kmem_cache_create(name="foo-a")
        - creates new kmem_cache structure
        2) kmem_cache_create(name="foo-b")
        - If identical cache characteristics, it will be merged with the previously
          created cache associated with "foo-a".  The cache's refcount will be
          incremented and an alias will be created via sysfs_slab_alias().
        3) kmem_cache_destroy(<ptr>)
        - Attempting to destroy cache associated with "foo-a", but instead the
          refcount is simply decremented.  I don't even think the sysfs aliases are
          ever removed...
        4) kmem_cache_create(name="foo-a")
        - This FAILS because kmem_cache_sanity_check colides with the existing
          name ("foo-a") associated with the non-removed cache.
      
        This is a problem for RAID (specifically dm-raid) because the name used
        for the kmem_cache_create is ("raid%d-%p", level, mddev).  If the cache
        persists for long enough, the memory address of an old mddev will be
        reused for a new mddev - causing an identical formulation of the cache
        name.  Even though kmem_cache_destory had long ago been used to delete
        the old cache, the merging of caches has cause the name and cache of that
        old instance to be preserved and causes a colision (and thus failure) in
        kmem_cache_create().  I see this regularly in my testing.
      Reported-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarPekka Enberg <penberg@kernel.org>
      
      (cherry picked from commit 3e374919)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      890c7afd
    • Tejun Heo's avatar
      blkcg: don't call into policy draining if root_blkg is already gone · 5558b256
      Tejun Heo authored
      While a queue is being destroyed, all the blkgs are destroyed and its
      ->root_blkg pointer is set to NULL.  If someone else starts to drain
      while the queue is in this state, the following oops happens.
      
        NULL pointer dereference at 0000000000000028
        IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
        PGD e4a1067 PUD b773067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
        CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
        RIP: 0010:[<ffffffff8144e944>]  [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
        RSP: 0018:ffff88000efd7bf0  EFLAGS: 00010046
        RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
        RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
        R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
        R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
        FS:  00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
        Stack:
         ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
         ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
         ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
        Call Trace:
         [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60
         [<ffffffff81427641>] __blk_drain_queue+0x71/0x180
         [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0
         [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120
         [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50
         [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40
         [<ffffffff8142d476>] blk_release_queue+0x26/0xd0
         [<ffffffff81454968>] kobject_cleanup+0x38/0x70
         [<ffffffff81454848>] kobject_put+0x28/0x60
         [<ffffffff81427505>] blk_put_queue+0x15/0x20
         [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0
         [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0
         [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20
         [<ffffffff817930e2>] device_release+0x32/0xa0
         [<ffffffff81454968>] kobject_cleanup+0x38/0x70
         [<ffffffff81454848>] kobject_put+0x28/0x60
         [<ffffffff817934d7>] put_device+0x17/0x20
         [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0
         [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40
         [<ffffffff817d1257>] sdev_store_delete+0x27/0x30
         [<ffffffff81792ca8>] dev_attr_store+0x18/0x30
         [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50
         [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170
         [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0
         [<ffffffff811f69bd>] SyS_write+0x4d/0xc0
         [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b
      
      776687bc ("block, blk-mq: draining can't be skipped even if
      bypass_depth was non-zero") made it easier to trigger this bug by
      making blk_queue_bypass_start() drain even when it loses the first
      bypass test to blk_cleanup_queue(); however, the bug has always been
      there even before the commit as blk_queue_bypass_start() could race
      against queue destruction, win the initial bypass test but perform the
      actual draining after blk_cleanup_queue() already destroyed all blkgs.
      
      Fix it by skippping calling into policy draining if all the blkgs are
      already gone.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarShirish Pargaonkar <spargaonkar@suse.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Reported-by: default avatarJet Chen <jet.chen@intel.com>
      Cc: stable@vger.kernel.org
      Tested-by: default avatarShirish Pargaonkar <spargaonkar@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      Revert "bio: modify __bio_add_page() to accept pages that don't start a new segment"
      
      This reverts commit 254c4407.
      
      It causes crashes with cryptsetup, even after a few iterations and
      updates. Drop it for now.
      
      blkcg: don't call into policy draining if root_blkg is already gone
      
      While a queue is being destroyed, all the blkgs are destroyed and its
      ->root_blkg pointer is set to NULL.  If someone else starts to drain
      while the queue is in this state, the following oops happens.
      
        NULL pointer dereference at 0000000000000028
        IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
        PGD e4a1067 PUD b773067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
        CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
        RIP: 0010:[<ffffffff8144e944>]  [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
        RSP: 0018:ffff88000efd7bf0  EFLAGS: 00010046
        RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
        RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
        R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
        R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
        FS:  00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
        Stack:
         ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
         ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
         ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
        Call Trace:
         [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60
         [<ffffffff81427641>] __blk_drain_queue+0x71/0x180
         [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0
         [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120
         [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50
         [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40
         [<ffffffff8142d476>] blk_release_queue+0x26/0xd0
         [<ffffffff81454968>] kobject_cleanup+0x38/0x70
         [<ffffffff81454848>] kobject_put+0x28/0x60
         [<ffffffff81427505>] blk_put_queue+0x15/0x20
         [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0
         [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0
         [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20
         [<ffffffff817930e2>] device_release+0x32/0xa0
         [<ffffffff81454968>] kobject_cleanup+0x38/0x70
         [<ffffffff81454848>] kobject_put+0x28/0x60
         [<ffffffff817934d7>] put_device+0x17/0x20
         [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0
         [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40
         [<ffffffff817d1257>] sdev_store_delete+0x27/0x30
         [<ffffffff81792ca8>] dev_attr_store+0x18/0x30
         [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50
         [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170
         [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0
         [<ffffffff811f69bd>] SyS_write+0x4d/0xc0
         [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b
      
      776687bc ("block, blk-mq: draining can't be skipped even if
      bypass_depth was non-zero") made it easier to trigger this bug by
      making blk_queue_bypass_start() drain even when it loses the first
      bypass test to blk_cleanup_queue(); however, the bug has always been
      there even before the commit as blk_queue_bypass_start() could race
      against queue destruction, win the initial bypass test but perform the
      actual draining after blk_cleanup_queue() already destroyed all blkgs.
      
      Fix it by skippping calling into policy draining if all the blkgs are
      already gone.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarShirish Pargaonkar <spargaonkar@suse.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Reported-by: default avatarJet Chen <jet.chen@intel.com>
      Cc: stable@vger.kernel.org
      Tested-by: default avatarShirish Pargaonkar <spargaonkar@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      bio: modify __bio_add_page() to accept pages that don't start a new segment
      
      The original behaviour is to refuse to add a new page if the maximum
      number of segments has been reached, regardless of the fact the page we
      are going to add can be merged into the last segment or not.
      
      Unfortunately, when the system runs under heavy memory fragmentation
      conditions, a driver may try to add multiple pages to the last segment.
      The original code won't accept them and EBUSY will be reported to
      userspace.
      
      This patch modifies the function so it refuses to add a page only in case
      the latter starts a new segment and the maximum number of segments has
      already been reached.
      
      The bug can be easily reproduced with the st driver:
      
      1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE  to 16
      2) modprobe st buffer_kbs=1024
      3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10
         dd: error writing `/dev/st0': Device or resource busy
      
      [ming.lei@canonical.com: update bi_iter.bi_size before recounting segments]
      Signed-off-by: default avatarMaurizio Lombardi <mlombard@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Tested-by: default avatarDongsu Park <dongsu.park@profitbricks.com>
      Tested-by: default avatarJet Chen <jet.chen@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      block: fix SG_[GS]ET_RESERVED_SIZE ioctl when max_sectors is huge
      
      SG_GET_RESERVED_SIZE and SG_SET_RESERVED_SIZE ioctls access a reserved
      buffer in bytes as int type.  The value needs to be capped at the request
      queue's max_sectors.  But integer overflow is not correctly handled in
      the calculation when converting max_sectors from sectors to bytes.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Douglas Gilbert <dgilbert@interlog.com>
      Cc: linux-scsi@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      block: fix BLKSECTGET ioctl when max_sectors is greater than USHRT_MAX
      
      BLKSECTGET ioctl loads the request queue's max_sectors as unsigned
      short value to the argument pointer.  So if the max_sector is greater
      than USHRT_MAX, the upper 16 bits of that is just discarded.
      
      In such case, USHRT_MAX is more preferable than the lower 16 bits of
      max_sectors.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Douglas Gilbert <dgilbert@interlog.com>
      Cc: linux-scsi@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      block/partitions/efi.c: kerneldoc fixing
      
      Adding function documentation and fixing kerneldoc warnings
      ('field: description' uniformization).
      
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      block/partitions/msdos.c: code clean-up
      
      checkpatch fixing:
      WARNING: Missing a blank line after declarations
      WARNING: space prohibited between function name and open parenthesis '('
      ERROR: spaces required around that '<' (ctx:VxV)
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      block/partitions/amiga.c: replace nolevel printk by pr_err
      
      Also add no prefix pr_fmt to avoid any future default format update
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      block/partitions/aix.c: replace count*size kzalloc by kcalloc
      
      kcalloc manages count*sizeof overflow.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      bio-integrity: add "bip_max_vcnt" into struct bio_integrity_payload
      
      Commit 08778795 ("block: Fix nr_vecs for inline integrity vectors") from
      Martin introduces the function bip_integrity_vecs(get the useful vectors)
      to fix the issue about nr_vecs for inline integrity vectors that reported
      by David Milburn.
      
      But it seems that bip_integrity_vecs() will return the wrong number if the
      bio is not based on any bio_set for some reason(bio->bi_pool == NULL),
      because in that case, the bip_inline_vecs[0] is malloced directly.  So
      here we add the bip_max_vcnt to record the count of vector slots, and
      cleanup the function bip_integrity_vecs().
      Signed-off-by: default avatarGu Zheng <guz.fnst@cn.fujitsu.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      blk-mq: use percpu_ref for mq usage count
      
      Currently, blk-mq uses a percpu_counter to keep track of how many
      usages are in flight.  The percpu_counter is drained while freezing to
      ensure that no usage is left in-flight after freezing is complete.
      blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
      per-cpu gating mechanism.
      
      This type of code has relatively high chance of subtle bugs which are
      extremely difficult to trigger and it's way too hairy to be open coded
      in blk-mq.  percpu_ref can serve the same purpose after the recent
      changes.  This patch replaces the open-coded per-cpu usage counting
      and draining mechanism with percpu_ref.
      
      blk_mq_queue_enter() performs tryget_live on the ref and exit()
      performs put.  blk_mq_freeze_queue() kills the ref and waits until the
      reference count reaches zero.  blk_mq_unfreeze_queue() revives the ref
      and wakes up the waiters.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue()
      
      Keeping __blk_mq_drain_queue() as a separate function doesn't buy us
      anything and it's gonna be further simplified.  Let's flatten it into
      its caller.
      
      This patch doesn't make any functional change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      blk-mq: decouble blk-mq freezing from generic bypassing
      
      blk_mq freezing is entangled with generic bypassing which bypasses
      blkcg and io scheduler and lets IO requests fall through the block
      layer to the drivers in FIFO order.  This allows forward progress on
      IOs with the advanced features disabled so that those features can be
      configured or altered without worrying about stalling IO which may
      lead to deadlock through memory allocation.
      
      However, generic bypassing doesn't quite fit blk-mq.  blk-mq currently
      doesn't make use of blkcg or ioscheds and it maps bypssing to
      freezing, which blocks request processing and drains all the in-flight
      ones.  This causes problems as bypassing assumes that request
      processing is online.  blk-mq works around this by conditionally
      allowing request processing for the problem case - during queue
      initialization.
      
      Another weirdity is that except for during queue cleanup, bypassing
      started on the generic side prevents blk-mq from processing new
      requests but doesn't drain the in-flight ones.  This shouldn't break
      anything but again highlights that something isn't quite right here.
      
      The root cause is conflating blk-mq freezing and generic bypassing
      which are two different mechanisms.  The only intersecting purpose
      that they serve is during queue cleanup.  Let's properly separate
      blk-mq freezing from generic bypassing and simply use it where
      necessary.
      
      * request_queue->mq_freeze_depth is added and
        blk_mq_[un]freeze_queue() now operate on this counter instead of
        ->bypass_depth.  The replacement for QUEUE_FLAG_BYPASS isn't added
        but the counter is tested directly.  This will be further updated by
        later changes.
      
      * blk_mq_drain_queue() is dropped and "__" prefix is dropped from
        blk_mq_freeze_queue().  Queue cleanup path now calls
        blk_mq_freeze_queue() directly.
      
      * blk_queue_enter()'s fast path condition is simplified to simply
        check @q->mq_freeze_depth.  Previously, the condition was
      
      	!blk_queue_dying(q) &&
      	    (!blk_queue_bypass(q) || !blk_queue_init_done(q))
      
        mq_freeze_depth is incremented right after dying is set and
        blk_queue_init_done() exception isn't necessary as blk-mq doesn't
        start frozen, which only leaves the blk_queue_bypass() test which
        can be replaced by @q->mq_freeze_depth test.
      
      This change simplifies the code and reduces confusion in the area.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      block, blk-mq: draining can't be skipped even if bypass_depth was non-zero
      
      Currently, both blk_queue_bypass_start() and blk_mq_freeze_queue()
      skip queue draining if bypass_depth was already above zero.  The
      assumption is that the one which bumped the bypass_depth should have
      performed draining already; however, there's nothing which prevents a
      new instance of bypassing/freezing from starting before the previous
      one finishes draining.  The current code may allow the later
      bypassing/freezing instances to complete while there still are
      in-flight requests which haven't finished draining.
      
      Fix it by draining regardless of bypass_depth.  We still skip draining
      from blk_queue_bypass_start() while the queue is initializing to avoid
      introducing excessive delays during boot.  INIT_DONE setting is moved
      above the initial blk_queue_bypass_end() so that bypassing attempts
      can't slip inbetween.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      blk-mq: fix a memory ordering bug in blk_mq_queue_enter()
      
      blk-mq uses a percpu_counter to keep track of how many usages are in
      flight.  The percpu_counter is drained while freezing to ensure that
      no usage is left in-flight after freezing is complete.
      
      blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
      per-cpu gating mechanism; unfortunately, it contains a subtle bug -
      smp_wmb() in blk_mq_queue_enter() doesn't prevent prevent the cpu from
      fetching @q->bypass_depth before incrementing @q->mq_usage_counter and
      if freezing happens inbetween the caller can slip through and freezing
      can be complete while there are active users.
      
      Use smp_mb() instead so that bypass_depth and mq_usage_counter
      modifications and tests are properly interlocked.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into for-3.17/core
      
      Merge the percpu_ref changes from Tejun, he says they are stable now.
      
      percpu-refcount: implement percpu_ref_reinit() and percpu_ref_is_zero()
      
      Now that explicit invocation of percpu_ref_exit() is necessary to free
      the percpu counter, we can implement percpu_ref_reinit() which
      reinitializes a released percpu_ref.  This can be used implement
      scalable gating switch which can be drained and then re-opened without
      worrying about memory allocation failures.
      
      percpu_ref_is_zero() is added to be used in a sanity check in
      percpu_ref_exit().  As this function will be useful for other purposes
      too, make it a public interface.
      
      v2: Use smp_read_barrier_depends() instead of smp_load_acquire().  We
          only need data dep barrier and smp_load_acquire() is stronger and
          heavier on some archs.  Spotted by Lai Jiangshan.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      
      percpu-refcount: require percpu_ref to be exited explicitly
      
      Currently, a percpu_ref undoes percpu_ref_init() automatically by
      freeing the allocated percpu area when the percpu_ref is killed.
      While seemingly convenient, this has the following niggles.
      
      * It's impossible to re-init a released reference counter without
        going through re-allocation.
      
      * In the similar vein, it's impossible to initialize a percpu_ref
        count with static percpu variables.
      
      * We need and have an explicit destructor anyway for failure paths -
        percpu_ref_cancel_init().
      
      This patch removes the automatic percpu counter freeing in
      percpu_ref_kill_rcu() and repurposes percpu_ref_cancel_init() into a
      generic destructor now named percpu_ref_exit().  percpu_ref_destroy()
      is considered but it gets confusing with percpu_ref_kill() while
      "exit" clearly indicates that it's the counterpart of
      percpu_ref_init().
      
      All percpu_ref_cancel_init() users are updated to invoke
      percpu_ref_exit() instead and explicit percpu_ref_exit() calls are
      added to the destruction path of all percpu_ref users.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Cc: Li Zefan <lizefan@huawei.com>
      
      percpu-refcount: use unsigned long for pcpu_count pointer
      
      percpu_ref->pcpu_count is a percpu pointer with a status flag in its
      lowest bit.  As such, it always goes through arithmetic operations
      which is very cumbersome to do on a pointer.  It has to be first
      casted to unsigned long and then back.
      
      Let's just make the field unsigned long so that we can skip the first
      casts.  While at it, rename it to pcpu_counter_ptr to clarify that
      it's a pointer value.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      
      percpu-refcount: add helpers for ->percpu_count accesses
      
      * All four percpu_ref_*() operations implemented in the header file
        perform the same operation to determine whether the percpu_ref is
        alive and extract the percpu pointer.  Factor out the common logic
        into __pcpu_ref_alive().  This doesn't change the generated code.
      
      * There are a couple places in percpu-refcount.c which masks out
        PCPU_REF_DEAD to obtain the percpu pointer.  Factor it out into
        pcpu_count_ptr().
      
      * The above changes make the WARN_ON_ONCE() conditional at the top of
        percpu_ref_kill_and_confirm() the only user of REF_STATUS().  Test
        PCPU_REF_DEAD directly and remove REF_STATUS().
      
      This patch doesn't introduce any functional change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      
      percpu-refcount: one bit is enough for REF_STATUS
      
      percpu-refcount currently reserves two lowest bits of its percpu
      pointer to indicate its state; however, only one bit is used for
      PCPU_REF_DEAD.
      
      Simplify it by removing PCPU_STATUS_BITS/MASK and testing
      PCPU_REF_DEAD directly.  This also allows the compiler to choose a
      more efficient instruction depending on the architecture.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      
      percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc()
      
      ioctx_alloc() reaches inside percpu_ref and directly frees
      ->pcpu_count in its failure path, which is quite gross.  percpu_ref
      has been providing a proper interface to do this,
      percpu_ref_cancel_init(), for quite some time now.  Let's use that
      instead.
      
      This patch doesn't introduce any behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Cc: Kent Overstreet <kmo@daterainc.com>
      
      workqueue: stronger test in process_one_work()
      
      After the recent changes, when POOL_DISASSOCIATED is cleared, the
      running worker's local CPU should be the same as pool->cpu without any
      exception even during cpu-hotplug.  Update the sanity check in
      process_one_work() accordingly.
      
      This patch changes "(proposition_A && proposition_B && proposition_C)"
      to "(proposition_B && proposition_C)", so if the old compound
      proposition is true, the new one must be true too. so this will not
      hide any possible bug which can be caught by the old test.
      
      tj: Minor updates to the description.
      
      CC: Jason J. Herne <jjherne@linux.vnet.ibm.com>
      CC: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      
      workqueue: clear POOL_DISASSOCIATED in rebind_workers()
      
      The commit a9ab775b ("workqueue: directly restore CPU affinity of
      workers from CPU_ONLINE") moved the pool->lock into rebind_workers()
      without also moving "pool->flags &= ~POOL_DISASSOCIATED".
      
      There is nothing wrong with "pool->flags &= ~POOL_DISASSOCIATED" not
      being moved together, but there isn't any benefit either. We move it
      into rebind_workers() and achieve these benefits:
      
      1) Better readability.  POOL_DISASSOCIATED is cleared in
         rebind_workers() as expected.
      
      2) When POOL_DISASSOCIATED is cleared, we can ensure that all the
         running workers of the pool are on the local CPU (pool->cpu).
      
      tj: Cosmetic updates to the code and description.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      
      percpu: Use ALIGN macro instead of hand coding alignment calculation
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      
      percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations
      
      __verify_pcpu_ptr() is used to verify that a specified parameter is
      actually an percpu pointer by percpu accessor and operation
      implementations.  Currently, where it's called isn't clearly defined
      and we just ensure that it's invoked at least once for all accessors
      and operations.
      
      The lack of clarity on when it should be called isn't nice and given
      that this is a completely generic issue, there's no reason to make
      archs worry about it.
      
      This patch updates __verify_pcpu_ptr() invocations such that it's
      always invoked from the final generic wrapper once per access or
      operation.  As this is already the case for {raw|this}_cpu_*()
      definitions through __pcpu_size_*(), only the {raw|per|this}_cpu_ptr()
      accessors need to be updated.
      
      This change makes it unnecessary for archs to worry about
      __verify_pcpu_ptr().  x86's arch_raw_cpu_ptr() is updated accordingly.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      
      percpu: preffity percpu header files
      
      percpu macros are difficult to read.  It's partly because they're
      fairly complex but also because they simply lack visual and
      conventional consistency to an unusual degree.  The preceding patches
      tried to organize macro definitions consistently by their roles.  This
      patch makes the following cosmetic changes to improve overall
      readability.
      
      * Use consistent convention for multi-line macro definitions - "do {"
        or "({" are now put on their own lines and the line continuing '\'
        are all put on the same column.
      
      * Temp variables used inside macro are consistently given "__" prefix.
      
      * When a macro argument is passed to another macro or a function,
        putting extra parenthses around it doesn't help anything.  Don't put
        them.
      
      * _this_cpu_generic_*() are renamed to this_cpu_generic_*() so that
        they're consistent with raw_cpu_generic_*().
      
      * Reorganize raw_cpu_*() and this_cpu_*() definitions so that trivial
        wrappers are collected in one place after actual operation
        definitions.
      
      * Other misc cleanups including reorganizing comments.
      
      All changes in this patch are cosmetic and cause no functional
      difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      
      percpu: use raw_cpu_*() to define __this_cpu_*()
      
      __this_cpu_*() operations are the same as raw_cpu_*() operations
      except for the added __this_cpu_preempt_check().  Curiously, these
      were defined using __pcu_size_call_*() instead of being layered on top
      of raw_cpu_*().
      
      Let's layer them so that __this_cpu_*() are defined in terms of
      raw_cpu_*().  It's simpler and less error-prone this way.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      
      percpu: reorder macros in percpu header files
      
      * In include/asm-generic/percpu.h, collect {raw|_this}_cpu_generic*()
        macros into one place.  They were dispersed through
        {raw|this}_cpu_*_N() definitions and the visiual inconsistency was
        making following the code unnecessarily difficult.
      
      * In include/linux/percpu-defs.h, move __verify_pcpu_ptr() later in
        the file so that it's right above accessor definitions where it's
        actually used.
      
      This is pure reorganization.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      
      percpu: move {raw|this}_cpu_*() definitions to include/linux/percpu-defs.h
      
      We're in the process of moving all percpu accessors and operations to
      include/linux/percpu-defs.h so that they're available to arch headers
      without having to include full include/linux/percpu.h which may cause
      cyclic inclusion dependency.
      
      This patch moves {raw|this}_cpu_*() definitions from
      include/linux/percpu.h to include/linux/percpu-defs.h.  The code is
      moved mostly verbatim; however, raw_cpu_*() are placed above
      this_cpu_*() which is more conventional as the raw operations may be
      used to defined other variants.
      
      This is pure reorganization.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      
      percpu: move generic {raw|this}_cpu_*_N() definitions to include/asm-generic/percpu.h
      
      {raw|this}_cpu_*_N() operations are expected to be provided by archs
      and the generic definitions are provided as fallbacks.  As such, these
      firmly belong to include/asm-generic/percpu.h.
      
      Move the generic definitions to include/asm-generic/percpu.h.  The
      code is moved mostly verbatim; however, raw_cpu_*_N() are placed above
      this_cpu_*_N() which is more conventional as the raw operations may be
      used to defined other variants.
      
      This is pure reorganization.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      
      percpu: only allow sized arch overrides for {raw|this}_cpu_*() ops
      
      Currently, percpu allows two separate methods for overriding
      {raw|this}_cpu_*() ops - for a given operation, an arch can provide
      whole replacement or sized sub operations to override specific parts
      of it.  e.g. arch either can provide this_cpu_add() or
      this_cpu_add_4() to override only the 4 byte operation.
      
      While quite flexible on a glance, the dual-overriding scheme
      complicates the code path for no actual gain.  It compilcates the
      already complex operation definitions and if an arch wants to override
      all sizes, it can easily provide all variants anyway.  In fact, no
      arch is actually making use of whole operation override.
      
      Another oddity is that __this_cpu_*() operations are defined in the
      same way as raw_cpu_*() but ignores full overrides of the raw_cpu_*()
      and doesn't allow full operation override, so if an arch provides
      whole overrides for raw_cpu_*() operations __this_cpu_*() ends up
      using the generic implementations.
      
      More importantly, it takes away the layering between arch-specific and
      generic parts making it impossible for the generic part to implement
      arch-independent features on top of arch-specific overrides.
      
      This patch removes the support for whole operation overrides.  As no
      arch is using it, this doesn't cause any actual difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      
      percpu: reorganize include/linux/percpu-defs.h
      
      Reorganize for better readability.
      
      * Accessor definitions are collected into one place and SMP and UP now
        define them in the same order.
      
      * Definitions are layered when possible - e.g. per_cpu() is now
        defined in terms of this_cpu_ptr().
      
      * Rather pointless comment dropped.
      
      * per_cpu(), __raw_get_cpu_var() and __get_cpu_var() are defined in a
        way which can be shared between SMP and UP and moved out of
        CONFIG_SMP blocks.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      
      percpu: move accessors from include/linux/percpu.h to percpu-defs.h
      
      include/linux/percpu-defs.h is gonna host all accessors and operations
      so that arch headers can make use of them too without worrying about
      circular dependency through include/linux/percpu.h.
      
      This patch moves the following accessors from include/linux/percpu.h
      to include/linux/percpu-defs.h.
      
      * get/put_cpu_var()
      * get/put_cpu_ptr()
      * per_cpu_ptr()
      
      This is pure reorgniazation.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      
      percpu: include/asm-generic/percpu.h should contain only arch-overridable parts
      
      The roles of the various percpu header files has become unclear.
      There are four header files involved.
      
       include/linux/percpu-defs.h
       include/linux/percpu.h
       include/asm-generic/percpu.h
       arch/*/include/asm/percpu.h
      
      The original intention for include/asm-generic/percpu.h is providing
      generic definitions for arch-overridable parts; however, it now hosts
      various stuff which can't be overridden by archs.
      
      Also, include/linux/percpu-defs.h was initially added to contain
      section and percpu variable definition macros so that arch header
      files can make use of them without worrying about introducing cyclic
      inclusion dependency by including include/linux/percpu.h; however,
      arch headers sometimes need to access percpu variables too and this is
      one of the reasons why some accessors were implemented in
      include/linux/asm-generic/percpu.h.
      
      Let's clear up the situation by making include/asm-generic/percpu.h
      contain only arch-overridable parts and moving accessors and
      operations into include/linux/percpu-defs.  Note that this patch only
      moves things from include/asm-generic/percpu.h.
      include/linux/percpu.h will be taken care of by later patches.
      
      This patch moves the followings.
      
      * SHIFT_PERCPU_PTR() / VERIFY_PERCPU_PTR()
      * per_cpu()
      * raw_cpu_ptr()
      * this_cpu_ptr()
      * __get_cpu_var()
      * __raw_get_cpu_var()
      * __this_cpu_ptr()
      * PER_CPU_[SHARED_]ALIGNED_SECTION
      * PER_CPU_[SHARED_]ALIGNED_SECTION
      * PER_CPU_FIRST_SECTION
      
      This patch is pure reorganization.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      
      percpu: introduce arch_raw_cpu_ptr()
      
      Currently, archs can override raw_cpu_ptr() directly; however, we
      wanna build a layer of indirection in the generic part of percpu so
      that we can implement generic features there without affecting archs.
      
      Introduce arch_raw_cpu_ptr() which is used to define raw_cpu_ptr() by
      generic percpu code.  The two are identical for now.  x86 is currently
      the only arch which overrides raw_cpu_ptr() and is converted to
      define arch_raw_cpu_ptr() instead.
      
      This doesn't introduce any functional difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      
      percpu: disallow archs from overriding SHIFT_PERCPU_PTR()
      
      It has been about half a decade since all archs started using the
      dynamic percpu allocator and thus the same SHIFT_PERCPU_PTR()
      implementation.  There's no benefit in overriding SHIFT_PERCPU_PTR()
      anymore.
      
      Remove #ifndef around it to clarify that this is identical regardless
      of the arch.
      
      This patch doesn't cause any functional difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      
      (cherry picked from commit 2a1b4cf2
      0b462c89)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      5558b256
    • Mikulas Patocka's avatar
      block: provide compat ioctl for BLKZEROOUT · 92b512de
      Mikulas Patocka authored
      This patch provides the compat BLKZEROOUT ioctl. The argument is a pointer
      to two uint64_t values, so there is no need to translate it.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org	# 3.7+
      Acked-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      
      (cherry picked from commit 3b3a1814)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      92b512de
    • Antti Palosaari's avatar
      (cherry picked from commit ) · f14a678d
      Antti Palosaari authored
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      f14a678d
    • Andrey Utkin's avatar
      arch/sparc/math-emu/math_32.c: drop stray break operator · 543121a4
      Andrey Utkin authored
      This commit is a guesswork, but it seems to make sense to drop this
      break, as otherwise the following line is never executed and becomes
      dead code. And that following line actually saves the result of
      local calculation by the pointer given in function argument. So the
      proposed change makes sense if this code in the whole makes sense (but I
      am unable to analyze it in the whole).
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81641Reported-by: default avatarDavid Binderman <dcb314@hotmail.com>
      Signed-off-by: default avatarAndrey Utkin <andrey.krieger.utkin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit 093758e3)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      543121a4
    • Sowmini Varadhan's avatar
      sparc64: ldc_connect() should not return EINVAL when handshake is in progress. · 2cec4c2e
      Sowmini Varadhan authored
      The LDC handshake could have been asynchronously triggered
      after ldc_bind() enables the ldc_rx() receive interrupt-handler
      (and thus intercepts incoming control packets)
      and before vio_port_up() calls ldc_connect(). If that is the case,
      ldc_connect() should return 0 and let the state-machine
      progress.
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: default avatarKarl Volz <karl.volz@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit 4ec1b010)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      2cec4c2e
    • Christopher Alexander Tobias Schulze's avatar
      sunsab: Fix detection of BREAK on sunsab serial console · 49197f60
      Christopher Alexander Tobias Schulze authored
      Fix detection of BREAK on sunsab serial console: BREAK detection was only
      performed when there were also serial characters received simultaneously.
      To handle all BREAKs correctly, the check for BREAK and the corresponding
      call to uart_handle_break() must also be done if count == 0, therefore
      duplicate this code fragment and pull it out of the loop over the received
      characters.
      
      Patch applies to 3.16-rc6.
      Signed-off-by: default avatarChristopher Alexander Tobias Schulze <cat.schulze@alice-dsl.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit fe418231)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      49197f60
    • Christopher Alexander Tobias Schulze's avatar
      bbc-i2c: Fix BBC I2C envctrl on SunBlade 2000 · d6acb84e
      Christopher Alexander Tobias Schulze authored
      Fix regression in bbc i2c temperature and fan control on some Sun systems
      that causes the driver to refuse to load due to the bbc_i2c_bussel resource not
      being present on the (second) i2c bus where the temperature sensors and fan
      control are located. (The check for the number of resources was removed when
      the driver was ported to a pure OF driver in mid 2008.)
      Signed-off-by: default avatarChristopher Alexander Tobias Schulze <cat.schulze@alice-dsl.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit 5cdceab3)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      d6acb84e
    • David S. Miller's avatar
      sparc64: Add membar to Niagara2 memcpy code. · 460e1398
      David S. Miller authored
      This is the prevent previous stores from overlapping the block stores
      done by the memcpy loop.
      
      Based upon a glibc patch by Jose E. Marchesi
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit 5aa4ecfd)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      460e1398
    • David S. Miller's avatar
      sparc64: Fix huge TSB mapping on pre-UltraSPARC-III cpus. · 3e603b66
      David S. Miller authored
      Access to the TSB hash tables during TLB misses requires that there be
      an atomic 128-bit quad load available so that we fetch a matching TAG
      and DATA field at the same time.
      
      On cpus prior to UltraSPARC-III only virtual address based quad loads
      are available.  UltraSPARC-III and later provide physical address
      based variants which are easier to use.
      
      When we only have virtual address based quad loads available this
      means that we have to lock the TSB into the TLB at a fixed virtual
      address on each cpu when it runs that process.  We can't just access
      the PAGE_OFFSET based aliased mapping of these TSBs because we cannot
      take a recursive TLB miss inside of the TLB miss handler without
      risking running out of hardware trap levels (some trap combinations
      can be deep, such as those generated by register window spill and fill
      traps).
      
      Without huge pages it's working perfectly fine, but when the huge TSB
      got added another chunk of fixed virtual address space was not
      allocated for this second TSB mapping.
      
      So we were mapping both the 8K and 4MB TSBs to the same exact virtual
      address, causing multiple TLB matches which gives undefined behavior.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit b18eb2d7)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      3e603b66
    • David S. Miller's avatar
      sparc64: Don't bark so loudly about 32-bit tasks generating 64-bit fault addresses. · 0138a096
      David S. Miller authored
      This was found using Dave Jone's trinity tool.
      
      When a user process which is 32-bit performs a load or a store, the
      cpu chops off the top 32-bits of the effective address before
      translating it.
      
      This is because we run 32-bit tasks with the PSTATE_AM (address
      masking) bit set.
      
      We can't run the kernel with that bit set, so when the kernel accesses
      userspace no address masking occurs.
      
      Since a 32-bit process will have no mappings in that region we will
      properly fault, so we don't try to handle this using access_ok(),
      which can safely just be a NOP on sparc64.
      
      Real faults from 32-bit processes should never generate such addresses
      so a bug check was added long ago, and it barks in the logs if this
      happens.
      
      But it also barks when a kernel user access causes this condition, and
      that _can_ happen.  For example, if a pointer passed into a system call
      is "0xfffffffc" and the kernel access 4 bytes offset from that pointer.
      
      Just handle such faults normally via the exception entries.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit e5c460f4)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      0138a096
    • David S. Miller's avatar
      sparc64: Fix top-level fault handling bugs. · e5c94847
      David S. Miller authored
      Make get_user_insn() able to cope with huge PMDs.
      
      Next, make do_fault_siginfo() more robust when get_user_insn() can't
      actually fetch the instruction.  In particular, use the MMU announced
      fault address when that happens, instead of calling
      compute_effective_address() and computing garbage.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit 70ffc6eb)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      e5c94847
    • David S. Miller's avatar
      sparc64: Handle 32-bit tasks properly in compute_effective_address(). · 8268afe4
      David S. Miller authored
      If we have a 32-bit task we must chop off the top 32-bits of the
      64-bit value just as the cpu would.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit d037d163)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      8268afe4
    • Kirill Tkhai's avatar
      sparc64: Make itc_sync_lock raw · d763429a
      Kirill Tkhai authored
      One more place where we must not be able
      to be preempted or to be interrupted in RT.
      
      Always actually disable interrupts during
      synchronization cycle.
      Signed-off-by: default avatarKirill Tkhai <tkhai@yandex.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit 49b6c01f)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      d763429a
    • David S. Miller's avatar
      sparc64: Fix argument sign extension for compat_sys_futex(). · 8c9cd188
      David S. Miller authored
      Only the second argument, 'op', is signed.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit aa3449ee)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      8c9cd188
    • Sasha Levin's avatar
      iovec: make sure the caller actually wants anything in memcpy_fromiovecend · 6e37a311
      Sasha Levin authored
      Check for cases when the caller requests 0 bytes instead of running off
      and dereferencing potentially invalid iovecs.
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit 06ebb06d)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      6e37a311
    • Vlad Yasevich's avatar
      net: Correctly set segment mac_len in skb_segment(). · 750adcb8
      Vlad Yasevich authored
      When performing segmentation, the mac_len value is copied right
      out of the original skb.  However, this value is not always set correctly
      (like when the packet is VLAN-tagged) and we'll end up copying a bad
      value.
      
      One way to demonstrate this is to configure a VM which tags
      packets internally and turn off VLAN acceleration on the forwarding
      bridge port.  The packets show up corrupt like this:
      16:18:24.985548 52:54:00:ab:be:25 > 52:54:00:26:ce:a3, ethertype 802.1Q
      (0x8100), length 1518: vlan 100, p 0, ethertype 0x05e0,
              0x0000:  8cdb 1c7c 8cdb 0064 4006 b59d 0a00 6402 ...|...d@.....d.
              0x0010:  0a00 6401 9e0d b441 0a5e 64ec 0330 14fa ..d....A.^d..0..
              0x0020:  29e3 01c9 f871 0000 0101 080a 000a e833)....q.........3
              0x0030:  000f 8c75 6e65 7470 6572 6600 6e65 7470 ...unetperf.netp
              0x0040:  6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
              0x0050:  6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
              0x0060:  6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
              ...
      
      This also leads to awful throughput as GSO packets are dropped and
      cause retransmissions.
      
      The solution is to set the mac_len using the values already available
      in then new skb.  We've already adjusted all of the header offset, so we
      might as well correctly figure out the mac_len using skb_reset_mac_len().
      After this change, packets are segmented correctly and performance
      is restored.
      
      CC: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit fcdfe3a7)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      750adcb8
    • Vlad Yasevich's avatar
      macvlan: Initialize vlan_features to turn on offload support. · 3f673ad5
      Vlad Yasevich authored
      Macvlan devices do not initialize vlan_features.  As a result,
      any vlan devices configured on top of macvlans perform very poorly.
      Initialize vlan_features based on the vlan features of the lower-level
      device.
      Signed-off-by: default avatarVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit 081e83a7)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      3f673ad5
    • Daniel Borkmann's avatar
      net: sctp: inherit auth_capable on INIT collisions · bf6a0ead
      Daniel Borkmann authored
      Jason reported an oops caused by SCTP on his ARM machine with
      SCTP authentication enabled:
      
      Internal error: Oops: 17 [#1] ARM
      CPU: 0 PID: 104 Comm: sctp-test Not tainted 3.13.0-68744-g3632f30c9b20-dirty #1
      task: c6eefa40 ti: c6f52000 task.ti: c6f52000
      PC is at sctp_auth_calculate_hmac+0xc4/0x10c
      LR is at sg_init_table+0x20/0x38
      pc : [<c024bb80>]    lr : [<c00f32dc>]    psr: 40000013
      sp : c6f538e8  ip : 00000000  fp : c6f53924
      r10: c6f50d80  r9 : 00000000  r8 : 00010000
      r7 : 00000000  r6 : c7be4000  r5 : 00000000  r4 : c6f56254
      r3 : c00c8170  r2 : 00000001  r1 : 00000008  r0 : c6f1e660
      Flags: nZcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
      Control: 0005397f  Table: 06f28000  DAC: 00000015
      Process sctp-test (pid: 104, stack limit = 0xc6f521c0)
      Stack: (0xc6f538e8 to 0xc6f54000)
      [...]
      Backtrace:
      [<c024babc>] (sctp_auth_calculate_hmac+0x0/0x10c) from [<c0249af8>] (sctp_packet_transmit+0x33c/0x5c8)
      [<c02497bc>] (sctp_packet_transmit+0x0/0x5c8) from [<c023e96c>] (sctp_outq_flush+0x7fc/0x844)
      [<c023e170>] (sctp_outq_flush+0x0/0x844) from [<c023ef78>] (sctp_outq_uncork+0x24/0x28)
      [<c023ef54>] (sctp_outq_uncork+0x0/0x28) from [<c0234364>] (sctp_side_effects+0x1134/0x1220)
      [<c0233230>] (sctp_side_effects+0x0/0x1220) from [<c02330b0>] (sctp_do_sm+0xac/0xd4)
      [<c0233004>] (sctp_do_sm+0x0/0xd4) from [<c023675c>] (sctp_assoc_bh_rcv+0x118/0x160)
      [<c0236644>] (sctp_assoc_bh_rcv+0x0/0x160) from [<c023d5bc>] (sctp_inq_push+0x6c/0x74)
      [<c023d550>] (sctp_inq_push+0x0/0x74) from [<c024a6b0>] (sctp_rcv+0x7d8/0x888)
      
      While we already had various kind of bugs in that area
      ec0223ec ("net: sctp: fix sctp_sf_do_5_1D_ce to verify if
      we/peer is AUTH capable") and b14878cc ("net: sctp: cache
      auth_enable per endpoint"), this one is a bit of a different
      kind.
      
      Giving a bit more background on why SCTP authentication is
      needed can be found in RFC4895:
      
        SCTP uses 32-bit verification tags to protect itself against
        blind attackers. These values are not changed during the
        lifetime of an SCTP association.
      
        Looking at new SCTP extensions, there is the need to have a
        method of proving that an SCTP chunk(s) was really sent by
        the original peer that started the association and not by a
        malicious attacker.
      
      To cause this bug, we're triggering an INIT collision between
      peers; normal SCTP handshake where both sides intent to
      authenticate packets contains RANDOM; CHUNKS; HMAC-ALGO
      parameters that are being negotiated among peers:
      
        ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->
        <------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] ---------
        -------------------- COOKIE-ECHO -------------------->
        <-------------------- COOKIE-ACK ---------------------
      
      RFC4895 says that each endpoint therefore knows its own random
      number and the peer's random number *after* the association
      has been established. The local and peer's random number along
      with the shared key are then part of the secret used for
      calculating the HMAC in the AUTH chunk.
      
      Now, in our scenario, we have 2 threads with 1 non-blocking
      SEQ_PACKET socket each, setting up common shared SCTP_AUTH_KEY
      and SCTP_AUTH_ACTIVE_KEY properly, and each of them calling
      sctp_bindx(3), listen(2) and connect(2) against each other,
      thus the handshake looks similar to this, e.g.:
      
        ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->
        <------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] ---------
        <--------- INIT[RANDOM; CHUNKS; HMAC-ALGO] -----------
        -------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] -------->
        ...
      
      Since such collisions can also happen with verification tags,
      the RFC4895 for AUTH rather vaguely says under section 6.1:
      
        In case of INIT collision, the rules governing the handling
        of this Random Number follow the same pattern as those for
        the Verification Tag, as explained in Section 5.2.4 of
        RFC 2960 [5]. Therefore, each endpoint knows its own Random
        Number and the peer's Random Number after the association
        has been established.
      
      In RFC2960, section 5.2.4, we're eventually hitting Action B:
      
        B) In this case, both sides may be attempting to start an
           association at about the same time but the peer endpoint
           started its INIT after responding to the local endpoint's
           INIT. Thus it may have picked a new Verification Tag not
           being aware of the previous Tag it had sent this endpoint.
           The endpoint should stay in or enter the ESTABLISHED
           state but it MUST update its peer's Verification Tag from
           the State Cookie, stop any init or cookie timers that may
           running and send a COOKIE ACK.
      
      In other words, the handling of the Random parameter is the
      same as behavior for the Verification Tag as described in
      Action B of section 5.2.4.
      
      Looking at the code, we exactly hit the sctp_sf_do_dupcook_b()
      case which triggers an SCTP_CMD_UPDATE_ASSOC command to the
      side effect interpreter, and in fact it properly copies over
      peer_{random, hmacs, chunks} parameters from the newly created
      association to update the existing one.
      
      Also, the old asoc_shared_key is being released and based on
      the new params, sctp_auth_asoc_init_active_key() updated.
      However, the issue observed in this case is that the previous
      asoc->peer.auth_capable was 0, and has *not* been updated, so
      that instead of creating a new secret, we're doing an early
      return from the function sctp_auth_asoc_init_active_key()
      leaving asoc->asoc_shared_key as NULL. However, we now have to
      authenticate chunks from the updated chunk list (e.g. COOKIE-ACK).
      
      That in fact causes the server side when responding with ...
      
        <------------------ AUTH; COOKIE-ACK -----------------
      
      ... to trigger a NULL pointer dereference, since in
      sctp_packet_transmit(), it discovers that an AUTH chunk is
      being queued for xmit, and thus it calls sctp_auth_calculate_hmac().
      
      Since the asoc->active_key_id is still inherited from the
      endpoint, and the same as encoded into the chunk, it uses
      asoc->asoc_shared_key, which is still NULL, as an asoc_key
      and dereferences it in ...
      
        crypto_hash_setkey(desc.tfm, &asoc_key->data[0], asoc_key->len)
      
      ... causing an oops. All this happens because sctp_make_cookie_ack()
      called with the *new* association has the peer.auth_capable=1
      and therefore marks the chunk with auth=1 after checking
      sctp_auth_send_cid(), but it is *actually* sent later on over
      the then *updated* association's transport that didn't initialize
      its shared key due to peer.auth_capable=0. Since control chunks
      in that case are not sent by the temporary association which
      are scheduled for deletion, they are issued for xmit via
      SCTP_CMD_REPLY in the interpreter with the context of the
      *updated* association. peer.auth_capable was 0 in the updated
      association (which went from COOKIE_WAIT into ESTABLISHED state),
      since all previous processing that performed sctp_process_init()
      was being done on temporary associations, that we eventually
      throw away each time.
      
      The correct fix is to update to the new peer.auth_capable
      value as well in the collision case via sctp_assoc_update(),
      so that in case the collision migrated from 0 -> 1,
      sctp_auth_asoc_init_active_key() can properly recalculate
      the secret. This therefore fixes the observed server panic.
      
      Fixes: 730fc3d0 ("[SCTP]: Implete SCTP-AUTH parameter processing")
      Reported-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Tested-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Acked-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit 1be9a950)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      bf6a0ead
    • Christoph Paasch's avatar
      tcp: Fix integer-overflow in TCP vegas · d5dd8082
      Christoph Paasch authored
      In vegas we do a multiplication of the cwnd and the rtt. This
      may overflow and thus their result is stored in a u64. However, we first
      need to cast the cwnd so that actually 64-bit arithmetic is done.
      
      Then, we need to do do_div to allow this to be used on 32-bit arches.
      
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Doug Leith <doug.leith@nuim.ie>
      Fixes: 8d3a564d (tcp: tcp_vegas cong avoid fix)
      Signed-off-by: default avatarChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit 1f74e613)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      d5dd8082
    • Christoph Paasch's avatar
      tcp: Fix integer-overflows in TCP veno · 245f1d57
      Christoph Paasch authored
      In veno we do a multiplication of the cwnd and the rtt. This
      may overflow and thus their result is stored in a u64. However, we first
      need to cast the cwnd so that actually 64-bit arithmetic is done.
      
      A first attempt at fixing 76f10177 ([TCP]: TCP Veno congestion
      control) was made by 15913114 (tcp: Overflow bug in Vegas), but it
      failed to add the required cast in tcp_veno_cong_avoid().
      
      Fixes: 76f10177 ([TCP]: TCP Veno congestion control)
      Signed-off-by: default avatarChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit 45a07695)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      245f1d57
    • Andrey Ryabinin's avatar
      net: sendmsg: fix NULL pointer dereference · 3d68482f
      Andrey Ryabinin authored
      Sasha's report:
      	> While fuzzing with trinity inside a KVM tools guest running the latest -next
      	> kernel with the KASAN patchset, I've stumbled on the following spew:
      	>
      	> [ 4448.949424] ==================================================================
      	> [ 4448.951737] AddressSanitizer: user-memory-access on address 0
      	> [ 4448.952988] Read of size 2 by thread T19638:
      	> [ 4448.954510] CPU: 28 PID: 19638 Comm: trinity-c76 Not tainted 3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #813
      	> [ 4448.956823]  ffff88046d86ca40 0000000000000000 ffff880082f37e78 ffff880082f37a40
      	> [ 4448.958233]  ffffffffb6e47068 ffff880082f37a68 ffff880082f37a58 ffffffffb242708d
      	> [ 4448.959552]  0000000000000000 ffff880082f37a88 ffffffffb24255b1 0000000000000000
      	> [ 4448.961266] Call Trace:
      	> [ 4448.963158] dump_stack (lib/dump_stack.c:52)
      	> [ 4448.964244] kasan_report_user_access (mm/kasan/report.c:184)
      	> [ 4448.965507] __asan_load2 (mm/kasan/kasan.c:352)
      	> [ 4448.966482] ? netlink_sendmsg (net/netlink/af_netlink.c:2339)
      	> [ 4448.967541] netlink_sendmsg (net/netlink/af_netlink.c:2339)
      	> [ 4448.968537] ? get_parent_ip (kernel/sched/core.c:2555)
      	> [ 4448.970103] sock_sendmsg (net/socket.c:654)
      	> [ 4448.971584] ? might_fault (mm/memory.c:3741)
      	> [ 4448.972526] ? might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3740)
      	> [ 4448.973596] ? verify_iovec (net/core/iovec.c:64)
      	> [ 4448.974522] ___sys_sendmsg (net/socket.c:2096)
      	> [ 4448.975797] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
      	> [ 4448.977030] ? lock_release_holdtime (kernel/locking/lockdep.c:273)
      	> [ 4448.978197] ? lock_release_non_nested (kernel/locking/lockdep.c:3434 (discriminator 1))
      	> [ 4448.979346] ? check_chain_key (kernel/locking/lockdep.c:2188)
      	> [ 4448.980535] __sys_sendmmsg (net/socket.c:2181)
      	> [ 4448.981592] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
      	> [ 4448.982773] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
      	> [ 4448.984458] ? syscall_trace_enter (arch/x86/kernel/ptrace.c:1500 (discriminator 2))
      	> [ 4448.985621] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
      	> [ 4448.986754] SyS_sendmmsg (net/socket.c:2201)
      	> [ 4448.987708] tracesys (arch/x86/kernel/entry_64.S:542)
      	> [ 4448.988929] ==================================================================
      
      This reports means that we've come to netlink_sendmsg() with msg->msg_name == NULL and msg->msg_namelen > 0.
      
      After this report there was no usual "Unable to handle kernel NULL pointer dereference"
      and this gave me a clue that address 0 is mapped and contains valid socket address structure in it.
      
      This bug was introduced in f3d33426
      (net: rework recvmsg handler msg_name and msg_namelen logic).
      Commit message states that:
      	"Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
      	 non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
      	 affect sendto as it would bail out earlier while trying to copy-in the
      	 address."
      But in fact this affects sendto when address 0 is mapped and contains
      socket address structure in it. In such case copy-in address will succeed,
      verify_iovec() function will successfully exit with msg->msg_namelen > 0
      and msg->msg_name == NULL.
      
      This patch fixes it by setting msg_namelen to 0 if msg_name == NULL.
      
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarAndrey Ryabinin <a.ryabinin@samsung.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit 40eea803)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      3d68482f
    • Gao feng's avatar
      ipv6: reallocate addrconf router for ipv6 address when lo device up · 303ebe0b
      Gao feng authored
      commit 25fb6ca4
      "net IPv6 : Fix broken IPv6 routing table after loopback down-up"
      allocates addrconf router for ipv6 address when lo device up.
      but commit a881ae1f
      "ipv6:don't call addrconf_dst_alloc again when enable lo" breaks
      this behavior.
      
      Since the addrconf router is moved to the garbage list when
      lo device down, we should release this router and rellocate
      a new one for ipv6 address when lo device up.
      
      This patch solves bug 67951 on bugzilla
      https://bugzilla.kernel.org/show_bug.cgi?id=67951
      
      change from v1:
      use ip6_rt_put to repleace ip6_del_rt, thanks Hannes!
      change code style, suggested by Sergei.
      
      CC: Sabrina Dubroca <sd@queasysnail.net>
      CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Reported-by: default avatarWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: default avatarWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: default avatarGao feng <gaofeng@cn.fujitsu.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      (cherry picked from commit 33d99113)
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      303ebe0b