1. 19 Feb, 2016 40 commits
    • Mathias Krause's avatar
      PCI: Prevent out of bounds access in numa_node override · c72faa97
      Mathias Krause authored
      commit 3dcc8d39 upstream.
      
      Commit 12669631 ("PCI: Prevent out of bounds access in numa_node
      override") missed that the user-provided node could also be negative.
      Handle this case as well to avoid out-of-bounds accesses to the
      node_states[] array.  However, allow the special value -1, i.e.
      NUMA_NO_NODE, to be able to set the 'no specific node' configuration.
      
      Fixes: 12669631 ("PCI: Prevent out of bounds access in numa_node override")
      Fixes: 63692df1 ("PCI: Allow numa_node override via sysfs")
      Signed-off-by: default avatarMathias Krause <minipli@googlemail.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      CC: Sasha Levin <sasha.levin@oracle.com>
      CC: Prarit Bhargava <prarit@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c72faa97
    • Alexander Duyck's avatar
      PCI: Set SR-IOV NumVFs to zero after enumeration · 16a7fcca
      Alexander Duyck authored
      commit ea9a8854 upstream.
      
      The enumeration path should leave NumVFs set to zero.  But after
      4449f079 ("PCI: Calculate maximum number of buses required for VFs"),
      we call virtfn_max_buses() in the enumeration path, which changes NumVFs.
      This NumVFs change is visible via lspci and sysfs until a driver enables
      SR-IOV.
      
      Iterate from TotalVFs down to zero so NumVFs is zero when we're finished
      computing the maximum number of buses.  Validate offset and stride in
      the loop, so we can test it at every possible NumVFs setting.  Rename
      virtfn_max_buses() to compute_max_vf_buses() to hint that it does have a
      side effect of updating iov->max_VF_buses.
      
      [bhelgaas: changelog, rename, allow numVF==1 && stride==0, rework loop,
      reverse sense of error path]
      Fixes: 4449f079 ("PCI: Calculate maximum number of buses required for VFs")
      Based-on-patch-by: default avatarEthan Zhao <ethan.zhao@oracle.com>
      Signed-off-by: default avatarAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      16a7fcca
    • Gabriele Paoloni's avatar
      PCI: spear: Fix dw_pcie_cfg_read/write() usage · 1d9e01c3
      Gabriele Paoloni authored
      commit fa3b7cba upstream.
      
      The first argument of dw_pcie_cfg_read/write() is a 32-bit aligned address.
      The second argument is the byte offset into a 32-bit word, and
      dw_pcie_cfg_read/write() only look at the low two bits.
      
      SPEAr13xx used dw_pcie_cfg_read() and dw_pcie_cfg_write() incorrectly: it
      passed important address bits in the second argument, where they were
      ignored.
      
      Pass the complete 32-bit word address in the first argument and only the
      2-bit offset into that word in the second argument.
      
      Without this fix, SPEAr13xx host will never work with few buggy gen1 card
      which connects with only gen1 host and also with any endpoint which would
      generate a read request of more than 128 bytes.
      
      [bhelgaas: changelog]
      Reported-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarPratyush Anand <panand@redhat.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1d9e01c3
    • Sebastian Siewior's avatar
      mtd: ubi: don't leak e if schedule_erase() fails · 5b33b092
      Sebastian Siewior authored
      commit 6b238de1 upstream.
      
      If __erase_worker() fails to erase the EB and schedule_erase() fails as
      well to do anything about it then we go RO. But that is not a reason to
      leak the e argument here. Therefore clean up e.
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarRichard Weinberger <richard@nod.at>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5b33b092
    • Sebastian Siewior's avatar
      mtd: ubi: fixup error correction in do_sync_erase() · b08e24a5
      Sebastian Siewior authored
      commit 1a31b20c upstream.
      
      Since fastmap we gained do_sync_erase(). This function can return an error
      and its error handling isn't obvious. First the memory allocation for
      struct ubi_work can fail and as such struct ubi_wl_entry is leaked.
      However if the memory allocation succeeds then the tail function takes
      care of the struct ubi_wl_entry. A free here could result in a double
      free.
      To make the error handling simpler, I split the tail function into one
      piece which does the work and another which frees the struct ubi_work
      which is passed as argument. As result do_sync_erase() can keep the
      struct on stack and we get rid of one error source.
      
      Fixes: 8199b901 ("UBI: Add fastmap support to the WL sub-system")
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarRichard Weinberger <richard@nod.at>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b08e24a5
    • Brian Norris's avatar
      mtd: jz4740_nand: fix build on jz4740 after removing gpio.h · 754b58d9
      Brian Norris authored
      commit 96dd922c upstream.
      
      Fallout from commit 832f5dac ("MIPS: Remove all the uses of custom gpio.h")
      
      We see errors like this:
      
      drivers/mtd/nand/jz4740_nand.c: In function 'jz_nand_detect_bank':
      drivers/mtd/nand/jz4740_nand.c:340:9: error: 'JZ_GPIO_MEM_CS0' undeclared (first use in this function)
      drivers/mtd/nand/jz4740_nand.c:340:9: note: each undeclared identifier is reported only once for each function it appears in
      drivers/mtd/nand/jz4740_nand.c:359:2: error: implicit declaration of function 'jz_gpio_set_function' [-Werror=implicit-function-declaration]
      drivers/mtd/nand/jz4740_nand.c:359:29: error: 'JZ_GPIO_FUNC_MEM_CS0' undeclared (first use in this function)
      drivers/mtd/nand/jz4740_nand.c:399:29: error: 'JZ_GPIO_FUNC_NONE' undeclared (first use in this function)
      drivers/mtd/nand/jz4740_nand.c: In function 'jz_nand_probe':
      drivers/mtd/nand/jz4740_nand.c:528:13: error: 'JZ_GPIO_MEM_CS0' undeclared (first use in this function)
      drivers/mtd/nand/jz4740_nand.c: In function 'jz_nand_remove':
      drivers/mtd/nand/jz4740_nand.c:555:14: error: 'JZ_GPIO_MEM_CS0' undeclared (first use in this function)
      
      Patched similarly to:
      
      https://patchwork.linux-mips.org/patch/11089/
      
      Fixes: 832f5dac ("MIPS: Remove all the uses of custom gpio.h")
      Signed-off-by: default avatarBrian Norris <computersforpeace@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      754b58d9
    • Brian Norris's avatar
      mtd: nand: fix shutdown/reboot for multi-chip systems · 7897886a
      Brian Norris authored
      commit 9ca641b0 upstream.
      
      If multiple NAND chips are registered to the same controller, then when
      rebooting the system, the first one will grab the controller lock, while
      the second will wait forever for the first one to release it. i.e., a
      classic deadlock.
      
      This problem was solved for a similar case (suspend/resume) back in
      commit 6b0d9a84 ("mtd: nand: fix multi-chip suspend problem"), and
      the shutdown state really isn't much different for us, so rather than
      adding a new special case to nand_get_device(), we can just overload the
      FL_PM_SUSPENDED state.
      
      Now, multiple chips can "get" the same controller lock (preventing
      further I/O), while we still allow other chips to pass through
      nand_shutdown().
      
      Original report:
      http://thread.gmane.org/gmane.linux.drivers.mtd/59726
      http://lists.infradead.org/pipermail/linux-mtd/2015-July/059992.html
      
      Fixes: 72ea4036 ("mtd: nand: added nand_shutdown")
      Reported-by: default avatarAndrew E. Mileski <andrewm@isoar.ca>
      Signed-off-by: default avatarBrian Norris <computersforpeace@gmail.com>
      Cc: Scott Branden <sbranden@broadcom.com>
      Cc: Andrew E. Mileski <andrewm@isoar.ca>
      Acked-by: default avatarScott Branden <sbranden@broadcom.com>
      Reviewed-by: default avatarBoris Brezillon <boris.brezillon@free-electrons.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7897886a
    • Brian Norris's avatar
      mtd: blkdevs: fix potential deadlock + lockdep warnings · f4d9ae7a
      Brian Norris authored
      commit f3c63795 upstream.
      
      Commit 073db4a5 ("mtd: fix: avoid race condition when accessing
      mtd->usecount") fixed a race condition but due to poor ordering of the
      mutex acquisition, introduced a potential deadlock.
      
      The deadlock can occur, for example, when rmmod'ing the m25p80 module, which
      will delete one or more MTDs, along with any corresponding mtdblock
      devices. This could potentially race with an acquisition of the block
      device as follows.
      
       -> blktrans_open()
          ->  mutex_lock(&dev->lock);
          ->  mutex_lock(&mtd_table_mutex);
      
       -> del_mtd_device()
          ->  mutex_lock(&mtd_table_mutex);
          ->  blktrans_notify_remove() -> del_mtd_blktrans_dev()
             ->  mutex_lock(&dev->lock);
      
      This is a classic (potential) ABBA deadlock, which can be fixed by
      making the A->B ordering consistent everywhere. There was no real
      purpose to the ordering in the original patch, AFAIR, so this shouldn't
      be a problem. This ordering was actually already present in
      del_mtd_blktrans_dev(), for one, where the function tried to ensure that
      its caller already held mtd_table_mutex before it acquired &dev->lock:
      
              if (mutex_trylock(&mtd_table_mutex)) {
                      mutex_unlock(&mtd_table_mutex);
                      BUG();
              }
      
      So, reverse the ordering of acquisition of &dev->lock and &mtd_table_mutex so
      we always acquire mtd_table_mutex first.
      
      Snippets of the lockdep output follow:
      
        # modprobe -r m25p80
        [   53.419251]
        [   53.420838] ======================================================
        [   53.427300] [ INFO: possible circular locking dependency detected ]
        [   53.433865] 4.3.0-rc6 #96 Not tainted
        [   53.437686] -------------------------------------------------------
        [   53.444220] modprobe/372 is trying to acquire lock:
        [   53.449320]  (&new->lock){+.+...}, at: [<c043fe4c>] del_mtd_blktrans_dev+0x80/0xdc
        [   53.457271]
        [   53.457271] but task is already holding lock:
        [   53.463372]  (mtd_table_mutex){+.+.+.}, at: [<c0439994>] del_mtd_device+0x18/0x100
        [   53.471321]
        [   53.471321] which lock already depends on the new lock.
        [   53.471321]
        [   53.479856]
        [   53.479856] the existing dependency chain (in reverse order) is:
        [   53.487660]
        -> #1 (mtd_table_mutex){+.+.+.}:
        [   53.492331]        [<c043fc5c>] blktrans_open+0x34/0x1a4
        [   53.497879]        [<c01afce0>] __blkdev_get+0xc4/0x3b0
        [   53.503364]        [<c01b0bb8>] blkdev_get+0x108/0x320
        [   53.508743]        [<c01713c0>] do_dentry_open+0x218/0x314
        [   53.514496]        [<c0180454>] path_openat+0x4c0/0xf9c
        [   53.519959]        [<c0182044>] do_filp_open+0x5c/0xc0
        [   53.525336]        [<c0172758>] do_sys_open+0xfc/0x1cc
        [   53.530716]        [<c000f740>] ret_fast_syscall+0x0/0x1c
        [   53.536375]
        -> #0 (&new->lock){+.+...}:
        [   53.540587]        [<c063f124>] mutex_lock_nested+0x38/0x3cc
        [   53.546504]        [<c043fe4c>] del_mtd_blktrans_dev+0x80/0xdc
        [   53.552606]        [<c043f164>] blktrans_notify_remove+0x7c/0x84
        [   53.558891]        [<c04399f0>] del_mtd_device+0x74/0x100
        [   53.564544]        [<c043c670>] del_mtd_partitions+0x80/0xc8
        [   53.570451]        [<c0439aa0>] mtd_device_unregister+0x24/0x48
        [   53.576637]        [<c046ce6c>] spi_drv_remove+0x1c/0x34
        [   53.582207]        [<c03de0f0>] __device_release_driver+0x88/0x114
        [   53.588663]        [<c03de19c>] device_release_driver+0x20/0x2c
        [   53.594843]        [<c03dd9e8>] bus_remove_device+0xd8/0x108
        [   53.600748]        [<c03dacc0>] device_del+0x10c/0x210
        [   53.606127]        [<c03dadd0>] device_unregister+0xc/0x20
        [   53.611849]        [<c046d878>] __unregister+0x10/0x20
        [   53.617211]        [<c03da868>] device_for_each_child+0x50/0x7c
        [   53.623387]        [<c046eae8>] spi_unregister_master+0x58/0x8c
        [   53.629578]        [<c03e12f0>] release_nodes+0x15c/0x1c8
        [   53.635223]        [<c03de0f8>] __device_release_driver+0x90/0x114
        [   53.641689]        [<c03de900>] driver_detach+0xb4/0xb8
        [   53.647147]        [<c03ddc78>] bus_remove_driver+0x4c/0xa0
        [   53.652970]        [<c00cab50>] SyS_delete_module+0x11c/0x1e4
        [   53.658976]        [<c000f740>] ret_fast_syscall+0x0/0x1c
        [   53.664621]
        [   53.664621] other info that might help us debug this:
        [   53.664621]
        [   53.672979]  Possible unsafe locking scenario:
        [   53.672979]
        [   53.679169]        CPU0                    CPU1
        [   53.683900]        ----                    ----
        [   53.688633]   lock(mtd_table_mutex);
        [   53.692383]                                lock(&new->lock);
        [   53.698306]                                lock(mtd_table_mutex);
        [   53.704658]   lock(&new->lock);
        [   53.707946]
        [   53.707946]  *** DEADLOCK ***
      
      Fixes: 073db4a5 ("mtd: fix: avoid race condition when accessing mtd->usecount")
      Reported-by: default avatarFelipe Balbi <balbi@ti.com>
      Tested-by: default avatarFelipe Balbi <balbi@ti.com>
      Signed-off-by: default avatarBrian Norris <computersforpeace@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f4d9ae7a
    • Boris BREZILLON's avatar
      mtd: mtdpart: fix add_mtd_partitions error path · b5a93606
      Boris BREZILLON authored
      commit e5bae867 upstream.
      
      If we fail to allocate a partition structure in the middle of the partition
      creation process, the already allocated partitions are never removed, which
      means they are still present in the partition list and their resources are
      never freed.
      Signed-off-by: default avatarBoris Brezillon <boris.brezillon@free-electrons.com>
      Signed-off-by: default avatarBrian Norris <computersforpeace@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b5a93606
    • Dmitry Kasatkin's avatar
      integrity: prevent loading untrusted certificates on the IMA trusted keyring · e1890d8f
      Dmitry Kasatkin authored
      commit 72e1eed8 upstream.
      
      If IMA_LOAD_X509 is enabled, either directly or indirectly via
      IMA_APPRAISE_SIGNED_INIT, certificates are loaded onto the IMA
      trusted keyring by the kernel via key_create_or_update(). When
      the KEY_ALLOC_TRUSTED flag is provided, certificates are loaded
      without first verifying the certificate is properly signed by a
      trusted key on the system keyring.  This patch removes the
      KEY_ALLOC_TRUSTED flag.
      Signed-off-by: default avatarDmitry Kasatkin <dmitry.kasatkin@huawei.com>
      Signed-off-by: default avatarMimi Zohar <zohar@linux.vnet.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e1890d8f
    • Jarkko Sakkinen's avatar
      TPM: revert the list handling logic fixed in 398a1e71 · 30b258e6
      Jarkko Sakkinen authored
      commit b1a4144a upstream.
      
      Mimi reported that afb5abc2 reverts the fix in 398a1e71. This patch
      reverts it back.
      
      Fixes: afb5abc2 ("tpm: two-phase chip management functions")
      Reported-by: default avatarMimi Zohar <zohar@linux.vnet.ibm.com>
      Signed-off-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Acked-by: default avatarPeter Huewe <PeterHuewe@gmx.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      30b258e6
    • Martin Wilck's avatar
      tpm_tis: free irq after probing · 948670bc
      Martin Wilck authored
      commit 2aef9da6 upstream.
      
      Release IRQs used for probing only. Otherwise the TPM will end up
      with all IRQs 3-15 assigned.
      
      Fixes: afb5abc2 ("tpm: two-phase chip management functions")
      Signed-off-by: default avatarMartin Wilck <Martin.Wilck@ts.fujitsu.com>
      Reviewed-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Tested-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Signed-off-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Acked-by: default avatarPeter Huewe <PeterHuewe@gmx.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      948670bc
    • Hon Ching \(Vicky\) Lo's avatar
      vTPM: fix memory allocation flag for rtce buffer at kernel boot · d75b72ac
      Hon Ching \(Vicky\) Lo authored
      commit 60ecd86c upstream.
      
      At ibm vtpm initialzation, tpm_ibmvtpm_probe() registers its interrupt
      handler, ibmvtpm_interrupt, which calls ibmvtpm_crq_process to allocate
      memory for rtce buffer.  The current code uses 'GFP_KERNEL' as the
      type of kernel memory allocation, which resulted a warning at
      kernel/lockdep.c.  This patch uses 'GFP_ATOMIC' instead so that the
      allocation is high-priority and does not sleep.
      Signed-off-by: default avatarHon Ching(Vicky) Lo <honclo@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Huewe <peterhuewe@gmx.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d75b72ac
    • Jarkko Sakkinen's avatar
      tpm, tpm_crb: fix unaligned read of the command buffer address · dfa9a12b
      Jarkko Sakkinen authored
      commit 149789ce upstream.
      
      The command buffer address must be read with exactly two 32-bit reads.
      Otherwise, on some HW platforms, it seems that HW will abort the read
      operation, which causes CPU to fill the read bytes with 1's. Therefore,
      we cannot rely on memcpy_fromio() but must call ioread32() two times
      instead.
      
      Also, this matches the PC Client Platform TPM Profile specification,
      which defines command buffer address with two 32-bit fields.
      Signed-off-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Reviewed-by: default avatarPeter Huewe <peterhuewe@gmx.de>
      Signed-off-by: default avatarPeter Huewe <peterhuewe@gmx.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dfa9a12b
    • Ricardo Ribalda Delgado's avatar
      spi/spi-xilinx: Fix race condition on last word read · 0b30764f
      Ricardo Ribalda Delgado authored
      commit eca37c7c upstream.
      
      Some users have reported that in polled mode the driver fails randomly
      to read the last word of the transfer.
      
      The end condition used for the transmissions (in polled and irq mode)
      has been the TX_EMPTY flag. But Lars-Peter Clausen has identified a delay
      from the TX_EMPTY to the actual end of the data rx.
      
      I believe that this race condition has not been detected until now
      because of the latency added by the IRQ handler or the PCIe bridge.
      This bugs affects setups with low latency access to the spi core.
      
      This patch replaces the readout logic:
      
      For all the words, except the last one, the TX_EMPTY flag is used (and
      cached).
      
      If !TX_EMPY or is the last word. The status register is read and the
      RX_EMPTY flag is used.
      
      The performance is not affected: there is an extra read of the
      Status Register, but the readout can start as soon as there is a word
      in the buffer.
      Reported-by: default avatarEdward Kigwana <ekigwana@scires.com>
      Initial-fix-by: default avatarLars-Peter Clausen <lars@metafoo.de>
      Signed-off-by: default avatarRicardo Ribalda Delgado <ricardo.ribalda@gmail.com>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0b30764f
    • Uri Mashiach's avatar
      wlcore/wl12xx: spi: fix NULL pointer dereference (Oops) · 6ed15766
      Uri Mashiach authored
      commit e47301b0 upstream.
      
      Fix the below Oops when trying to modprobe wlcore_spi.
      The oops occurs because the wl1271_power_{off,on}()
      function doesn't check the power() function pointer.
      
      [   23.401447] Unable to handle kernel NULL pointer dereference at
      virtual address 00000000
      [   23.409954] pgd = c0004000
      [   23.412922] [00000000] *pgd=00000000
      [   23.416693] Internal error: Oops: 80000007 [#1] SMP ARM
      [   23.422168] Modules linked in: wl12xx wlcore mac80211 cfg80211
      musb_dsps musb_hdrc usbcore usb_common snd_soc_simple_card evdev joydev
      omap_rng wlcore_spi snd_soc_tlv320aic23_i2c rng_core snd_soc_tlv320aic23
      c_can_platform c_can can_dev snd_soc_davinci_mcasp snd_soc_edma
      snd_soc_omap omap_wdt musb_am335x cpufreq_dt thermal_sys hwmon
      [   23.453253] CPU: 0 PID: 36 Comm: kworker/0:2 Not tainted
      4.2.0-00002-g951efee-dirty #233
      [   23.461720] Hardware name: Generic AM33XX (Flattened Device Tree)
      [   23.468123] Workqueue: events request_firmware_work_func
      [   23.473690] task: de32efc0 ti: de4ee000 task.ti: de4ee000
      [   23.479341] PC is at 0x0
      [   23.482112] LR is at wl12xx_set_power_on+0x28/0x124 [wlcore]
      [   23.488074] pc : [<00000000>]    lr : [<bf2581f0>]    psr: 60000013
      [   23.488074] sp : de4efe50  ip : 00000002  fp : 00000000
      [   23.500162] r10: de7cdd00  r9 : dc848800  r8 : bf27af00
      [   23.505663] r7 : bf27a1a8  r6 : dcbd8a80  r5 : dce0e2e0  r4 :
      dce0d2e0
      [   23.512536] r3 : 00000000  r2 : 00000000  r1 : 00000001  r0 :
      dc848810
      [   23.519412] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM
      Segment kernel
      [   23.527109] Control: 10c5387d  Table: 9cb78019  DAC: 00000015
      [   23.533160] Process kworker/0:2 (pid: 36, stack limit = 0xde4ee218)
      [   23.539760] Stack: (0xde4efe50 to 0xde4f0000)
      
      [...]
      
      [   23.665030] [<bf2581f0>] (wl12xx_set_power_on [wlcore]) from
      [<bf25f7ac>] (wlcore_nvs_cb+0x118/0xa4c [wlcore])
      [   23.675604] [<bf25f7ac>] (wlcore_nvs_cb [wlcore]) from [<c04387ec>]
      (request_firmware_work_func+0x30/0x58)
      [   23.685784] [<c04387ec>] (request_firmware_work_func) from
      [<c0058e2c>] (process_one_work+0x1b4/0x4b4)
      [   23.695591] [<c0058e2c>] (process_one_work) from [<c0059168>]
      (worker_thread+0x3c/0x4a4)
      [   23.704124] [<c0059168>] (worker_thread) from [<c005ee68>]
      (kthread+0xd4/0xf0)
      [   23.711747] [<c005ee68>] (kthread) from [<c000f598>]
      (ret_from_fork+0x14/0x3c)
      [   23.719357] Code: bad PC value
      [   23.722760] ---[ end trace 981be8510db9b3a9 ]---
      
      Prevent oops by validationg power() pointer value before
      calling the function.
      Signed-off-by: default avatarUri Mashiach <uri.mashiach@compulab.co.il>
      Acked-by: default avatarIgor Grinberg <grinberg@compulab.co.il>
      Signed-off-by: default avatarKalle Valo <kvalo@codeaurora.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6ed15766
    • Uri Mashiach's avatar
      wlcore/wl12xx: spi: fix oops on firmware load · 8837f9bd
      Uri Mashiach authored
      commit 9b2761cb upstream.
      
      The maximum chunks used by the function is
      (SPI_AGGR_BUFFER_SIZE / WSPI_MAX_CHUNK_SIZE + 1).
      The original commands array had space for
      (SPI_AGGR_BUFFER_SIZE / WSPI_MAX_CHUNK_SIZE) commands.
      When the last chunk is used (len > 4 * WSPI_MAX_CHUNK_SIZE), the last
      command is stored outside the bounds of the commands array.
      
      Oops 5 (page fault) is generated during current wl1271 firmware load
      attempt:
      
      root@debian-armhf:~# ifconfig wlan0 up
      [  294.312399] Unable to handle kernel paging request at virtual address
      00203fc4
      [  294.320173] pgd = de528000
      [  294.323028] [00203fc4] *pgd=00000000
      [  294.326916] Internal error: Oops: 5 [#1] SMP ARM
      [  294.331789] Modules linked in: bnep rfcomm bluetooth ipv6 arc4 wl12xx
      wlcore mac80211 musb_dsps cfg80211 musb_hdrc usbcore usb_common
      wlcore_spi omap_rng rng_core musb_am335x omap_wdt cpufreq_dt thermal_sys
      hwmon
      [  294.351838] CPU: 0 PID: 1827 Comm: ifconfig Not tainted
      4.2.0-00002-g3e9ad27-dirty #78
      [  294.360154] Hardware name: Generic AM33XX (Flattened Device Tree)
      [  294.366557] task: dc9d6d40 ti: de550000 task.ti: de550000
      [  294.372236] PC is at __spi_validate+0xa8/0x2ac
      [  294.376902] LR is at __spi_sync+0x78/0x210
      [  294.381200] pc : [<c049c760>]    lr : [<c049ebe0>]    psr: 60000013
      [  294.381200] sp : de551998  ip : de5519d8  fp : 00200000
      [  294.393242] r10: de551c8c  r9 : de5519d8  r8 : de3a9000
      [  294.398730] r7 : de3a9258  r6 : de3a9400  r5 : de551a48  r4 :
      00203fbc
      [  294.405577] r3 : 00000000  r2 : 00000000  r1 : 00000000  r0 :
      de3a9000
      [  294.412420] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM
      Segment user
      [  294.419918] Control: 10c5387d  Table: 9e528019  DAC: 00000015
      [  294.425954] Process ifconfig (pid: 1827, stack limit = 0xde550218)
      [  294.432437] Stack: (0xde551998 to 0xde552000)
      
      ...
      
      [  294.883613] [<c049c760>] (__spi_validate) from [<c049ebe0>]
      (__spi_sync+0x78/0x210)
      [  294.891670] [<c049ebe0>] (__spi_sync) from [<bf036598>]
      (wl12xx_spi_raw_write+0xfc/0x148 [wlcore_spi])
      [  294.901661] [<bf036598>] (wl12xx_spi_raw_write [wlcore_spi]) from
      [<bf21c694>] (wlcore_boot_upload_firmware+0x1ec/0x458 [wlcore])
      [  294.914038] [<bf21c694>] (wlcore_boot_upload_firmware [wlcore]) from
      [<bf24532c>] (wl12xx_boot+0xc10/0xfac [wl12xx])
      [  294.925161] [<bf24532c>] (wl12xx_boot [wl12xx]) from [<bf20d5cc>]
      (wl1271_op_add_interface+0x5b0/0x910 [wlcore])
      [  294.936364] [<bf20d5cc>] (wl1271_op_add_interface [wlcore]) from
      [<bf15c4ac>] (ieee80211_do_open+0x44c/0xf7c [mac80211])
      [  294.947963] [<bf15c4ac>] (ieee80211_do_open [mac80211]) from
      [<c0537978>] (__dev_open+0xa8/0x110)
      [  294.957307] [<c0537978>] (__dev_open) from [<c0537bf8>]
      (__dev_change_flags+0x88/0x148)
      [  294.965713] [<c0537bf8>] (__dev_change_flags) from [<c0537cd0>]
      (dev_change_flags+0x18/0x48)
      [  294.974576] [<c0537cd0>] (dev_change_flags) from [<c05a55a0>]
      (devinet_ioctl+0x6b4/0x7d0)
      [  294.983191] [<c05a55a0>] (devinet_ioctl) from [<c0517040>]
      (sock_ioctl+0x1e4/0x2bc)
      [  294.991244] [<c0517040>] (sock_ioctl) from [<c017d378>]
      (do_vfs_ioctl+0x420/0x6b0)
      [  294.999208] [<c017d378>] (do_vfs_ioctl) from [<c017d674>]
      (SyS_ioctl+0x6c/0x7c)
      [  295.006880] [<c017d674>] (SyS_ioctl) from [<c000f4c0>]
      (ret_fast_syscall+0x0/0x54)
      [  295.014835] Code: e1550004 e2444034 0a00007d e5953018 (e5942008)
      [  295.021544] ---[ end trace 66ed188198f4e24e ]---
      Signed-off-by: default avatarUri Mashiach <uri.mashiach@compulab.co.il>
      Acked-by: default avatarIgor Grinberg <grinberg@compulab.co.il>
      Signed-off-by: default avatarKalle Valo <kvalo@codeaurora.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8837f9bd
    • Johan Hovold's avatar
      spi: fix parent-device reference leak · ed4c745d
      Johan Hovold authored
      commit 157f38f9 upstream.
      
      Fix parent-device reference leak due to SPI-core taking an unnecessary
      reference to the parent when allocating the master structure, a
      reference that was never released.
      
      Note that driver core takes its own reference to the parent when the
      master device is registered.
      
      Fixes: 49dce689 ("spi doesn't need class_device")
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ed4c745d
    • Vignesh R's avatar
      spi: ti-qspi: Fix data corruption seen on r/w stress test · 6edfb956
      Vignesh R authored
      commit bc27a539 upstream.
      
      Writing invalid command to QSPI_SPI_CMD_REG will terminate current
      transfer and de-assert the chip select. This has to be done before
      calling spi_finalize_current_message(). Because
      spi_finalize_current_message() will mark the end of current message
      transfer and schedule the next transfer. If the chipselect is not
      de-asserted before calling spi_finalize_current_message() then the next
      transfer will overlap with the previous transfer leading to data
      corruption.
      __spi_pump_message() can be called either from kthread worker context or
      directly from the calling process's context. It is possible that these
      two calls can race against each other. But race is serialized by
      checking whether master->cur_msg == NULL (pointer to msg being handled
      by transfer_one() at present). The master->cur_msg is set to NULL when
      spi_finalize_current_message() is called on that message, which means
      calling spi_finalize_current_message() allows __spi_sync() to pump next
      message in calling process context.
      Now if spi-ti-qspi calls spi_finalize_current_message() before we
      terminate transfer at hardware side, if __spi_pump_message() is called
      from process context then the successive transactions can overlap.
      
      Fix this by moving writing invalid command to QSPI_SPI_CMD_REG to
      before calling spi_finalize_current_message() call.
      Signed-off-by: default avatarVignesh R <vigneshr@ti.com>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6edfb956
    • David Mosberger-Tang's avatar
      spi: atmel: Fix DMA-setup for transfers with more than 8 bits per word · d487e49b
      David Mosberger-Tang authored
      commit 06515f83 upstream.
      
      The DMA-slave configuration depends on the whether <= 8 or > 8 bits
      are transferred per word, so we need to call
      atmel_spi_dma_slave_config() with the correct value.
      Signed-off-by: default avatarDavid Mosberger <davidm@egauge.net>
      Signed-off-by: default avatarNicolas Ferre <nicolas.ferre@atmel.com>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d487e49b
    • Neil Armstrong's avatar
      spi: omap2-mcspi: disable other channels CHCONF_FORCE in prepare_message · 865018e2
      Neil Armstrong authored
      commit 468a3208 upstream.
      
      Since the "Switch driver to use transfer_one" change, the cs_change
      behavior has changed and a channel chip select can still be
      asserted when changing channel from a previous last transfer in a
      message having the cs_change attribute.
      
      Since there is no sense having multiple chip select being asserted at the
      same time, disable all the remaining forced chip selects in a the
      prepare_message called right before a spi_transfer_one_message call.
      It ignores the current channel configuration in order to keep the
      possibility to leave the chip select asserted between messages.
      
      It fixes this bug on a DM8168 SoC ES2.1 Soc and an OMAP4 ES2.1 SoC.
      It was hanging all the other channels transfers when a CHCONF_FORCE
      is present on the wrong channel.
      
      Fixes: b28cb941 ("spi: omap2-mcspi: Switch driver to use transfer_one")
      Signed-off-by: default avatarNeil Armstrong <narmstrong@baylibre.com>
      Reviewed-by: default avatarMichael Welling <mwelling@ieee.org>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      865018e2
    • Mauricio Faria de Oliveira's avatar
      Revert "dm mpath: fix stalls when handling invalid ioctls" · 3662b6cf
      Mauricio Faria de Oliveira authored
      commit 47796938 upstream.
      
      This reverts commit a1989b33.
      
      That commit introduced a regression at least for the case of the SG_IO ioctl()
      running without CAP_SYS_RAWIO capability (e.g., unprivileged users) when there
      are no active paths: the ioctl() fails with the ENOTTY errno immediately rather
      than blocking due to queue_if_no_path until a path becomes active, for example.
      
      That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
      (qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2])
      from multipath devices; which leads to SCSI/filesystem errors in such a guest.
      
      More general scenarios can hit that regression too. The following demonstration
      employs a SG_IO ioctl() with a standard SCSI INQUIRY command for this objective
      (some output & user changes omitted for brevity and comments added for clarity).
      
      Reverting that commit restores normal operation (queueing) in failing scenarios;
      tested on linux-next (next-20151022).
      
      1) Test-case is based on sg_simple0 [3] (just SG_IO; remove SG_GET_VERSION_NUM)
      
          $ cat sg_simple0.c
          ... see [3] ...
          $ sed '/SG_GET_VERSION_NUM/,/}/d' sg_simple0.c > sgio_inquiry.c
          $ gcc sgio_inquiry.c -o sgio_inquiry
      
      2) The ioctl() works fine with active paths present.
      
          # multipath -l 85ag56
          85ag56 (...) dm-19 IBM     ,2145
          size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
          |-+- policy='service-time 0' prio=0 status=active
          | |- 8:0:11:0  sdz  65:144  active undef running
          | `- 9:0:9:0   sdbf 67:144  active undef running
          `-+- policy='service-time 0' prio=0 status=enabled
            |- 8:0:12:0  sdae 65:224  active undef running
            `- 9:0:12:0  sdbo 68:32   active undef running
      
          $ ./sgio_inquiry /dev/mapper/85ag56
          Some of the INQUIRY command's response:
              IBM       2145              0000
          INQUIRY duration=0 millisecs, resid=0
      
      3) The ioctl() fails with ENOTTY errno with _no_ active paths present,
         for unprivileged users (rather than blocking due to queue_if_no_path).
      
          # for path in $(multipath -l 85ag56 | grep -o 'sd[a-z]\+'); \
                do multipathd -k"fail path $path"; done
      
          # multipath -l 85ag56
          85ag56 (...) dm-19 IBM     ,2145
          size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
          |-+- policy='service-time 0' prio=0 status=enabled
          | |- 8:0:11:0  sdz  65:144  failed undef running
          | `- 9:0:9:0   sdbf 67:144  failed undef running
          `-+- policy='service-time 0' prio=0 status=enabled
            |- 8:0:12:0  sdae 65:224  failed undef running
            `- 9:0:12:0  sdbo 68:32   failed undef running
      
          $ ./sgio_inquiry /dev/mapper/85ag56
          sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device
      
      4) dmesg shows that scsi_verify_blk_ioctl() failed for SG_IO (0x2285);
         it returns -ENOIOCTLCMD, later replaced with -ENOTTY in vfs_ioctl().
      
          $ dmesg
          <...>
          [] device-mapper: multipath: Failing path 65:144.
          [] device-mapper: multipath: Failing path 67:144.
          [] device-mapper: multipath: Failing path 65:224.
          [] device-mapper: multipath: Failing path 68:32.
          [] sgio_inquiry: sending ioctl 2285 to a partition!
      
      5) The ioctl() only works if the SYS_CAP_RAWIO capability is present
         (then queueing happens -- in this example, queue_if_no_path is set);
         this is due to a conditional check in scsi_verify_blk_ioctl().
      
          # capsh --drop=cap_sys_rawio -- -c './sgio_inquiry /dev/mapper/85ag56'
          sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device
      
          # ./sgio_inquiry /dev/mapper/85ag56 &
          [1] 72830
      
          # cat /proc/72830/stack
          [<c00000171c0df700>] 0xc00000171c0df700
          [<c000000000015934>] __switch_to+0x204/0x350
          [<c000000000152d4c>] msleep+0x5c/0x80
          [<c00000000077dfb0>] dm_blk_ioctl+0x70/0x170
          [<c000000000487c40>] blkdev_ioctl+0x2b0/0x9b0
          [<c0000000003128e4>] block_ioctl+0x64/0xd0
          [<c0000000002dd3b0>] do_vfs_ioctl+0x490/0x780
          [<c0000000002dd774>] SyS_ioctl+0xd4/0xf0
          [<c000000000009358>] system_call+0x38/0xd0
      
      6) This is the function call chain exercised in this analysis:
      
      SYSCALL_DEFINE3(ioctl, <...>) @ fs/ioctl.c
          -> do_vfs_ioctl()
              -> vfs_ioctl()
                  ...
                  error = filp->f_op->unlocked_ioctl(filp, cmd, arg);
                  ...
                      -> dm_blk_ioctl() @ drivers/md/dm.c
                          -> multipath_ioctl() @ drivers/md/dm-mpath.c
                              ...
                              (bdev = NULL, due to no active paths)
                              ...
                              if (!bdev || <...>) {
                                  int err = scsi_verify_blk_ioctl(NULL, cmd);
                                  if (err)
                                      r = err;
                              }
                              ...
                                  -> scsi_verify_blk_ioctl() @ block/scsi_ioctl.c
                                      ...
                                      if (bd && bd == bd->bd_contains) // not taken (bd = NULL)
                                          return 0;
                                      ...
                                      if (capable(CAP_SYS_RAWIO)) // not taken (unprivileged user)
                                          return 0;
                                      ...
                                      printk_ratelimited(KERN_WARNING
                                                 "%s: sending ioctl %x to a partition!\n" <...>);
      
                                      return -ENOIOCTLCMD;
                                  <-
                              ...
                              return r ? : <...>
                          <-
                  ...
                  if (error == -ENOIOCTLCMD)
                      error = -ENOTTY;
                   out:
                      return error;
                  ...
      
      Links:
      [1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
      [2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')
      [3] http://tldp.org/HOWTO/SCSI-Generic-HOWTO/pexample.html (Revision 1.2, 2002-05-03)
      Signed-off-by: default avatarMauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3662b6cf
    • Mikulas Patocka's avatar
      dm: initialize non-blk-mq queue data before queue is used · ac44b98e
      Mikulas Patocka authored
      commit ad5f498f upstream.
      
      Commit bfebd1cd ("dm: add full blk-mq
      support to request-based DM") moves the initialization of the fields
      backing_dev_info.congested_fn, backing_dev_info.congested_data and
      queuedata from the function dm_init_md_queue (that is called when the
      device is created) to dm_init_old_md_queue (that is called after the
      device type is determined).
      
      There is no locking when accessing these variables, thus it is possible
      for other parts of the kernel to briefly see this data in a transient
      state (e.g. queue->backing_dev_info.congested_fn initialized and
      md->queue->backing_dev_info.congested_data uninitialized, resulting in
      passing an incorrect parameter to the function dm_any_congested).
      
      This queue data is left initialized for blk-mq devices even though they
      that don't use it.
      
      Fixes: bfebd1cd ("dm: add full blk-mq support to request-based DM")
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ac44b98e
    • Dmitry V. Levin's avatar
      sh64: fix __NR_fgetxattr · 98216c7c
      Dmitry V. Levin authored
      commit 2d33fa10 upstream.
      
      According to arch/sh/kernel/syscalls_64.S and common sense, __NR_fgetxattr
      has to be defined to 259, but it doesn't.  Instead, it's defined to 269,
      which is of course used by another syscall, __NR_sched_setaffinity in this
      case.
      
      This bug was found by strace test suite.
      Signed-off-by: default avatarDmitry V. Levin <ldv@altlinux.org>
      Acked-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      98216c7c
    • xuejiufei's avatar
      ocfs2/dlm: clear refmap bit of recovery lock while doing local recovery cleanup · c6a5eba5
      xuejiufei authored
      commit c95a5180 upstream.
      
      When recovery master down, dlm_do_local_recovery_cleanup() only remove
      the $RECOVERY lock owned by dead node, but do not clear the refmap bit.
      Which will make umount thread falling in dead loop migrating $RECOVERY
      to the dead node.
      Signed-off-by: default avatarxuejiufei <xuejiufei@huawei.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c6a5eba5
    • xuejiufei's avatar
      ocfs2/dlm: ignore cleaning the migration mle that is inuse · 62e3e821
      xuejiufei authored
      commit bef5502d upstream.
      
      We have found that migration source will trigger a BUG that the refcount
      of mle is already zero before put when the target is down during
      migration.  The situation is as follows:
      
      dlm_migrate_lockres
        dlm_add_migration_mle
        dlm_mark_lockres_migrating
        dlm_get_mle_inuse
        <<<<<< Now the refcount of the mle is 2.
        dlm_send_one_lockres and wait for the target to become the
        new master.
        <<<<<< o2hb detect the target down and clean the migration
        mle. Now the refcount is 1.
      
      dlm_migrate_lockres woken, and put the mle twice when found the target
      goes down which trigger the BUG with the following message:
      
        "ERROR: bad mle: ".
      Signed-off-by: default avatarJiufei Xue <xuejiufei@huawei.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      62e3e821
    • Joseph Qi's avatar
      ocfs2: fix BUG when calculate new backup super · 959c944f
      Joseph Qi authored
      commit 5c9ee4cb upstream.
      
      When resizing, it firstly extends the last gd.  Once it should backup
      super in the gd, it calculates new backup super and update the
      corresponding value.
      
      But it currently doesn't consider the situation that the backup super is
      already done.  And in this case, it still sets the bit in gd bitmap and
      then decrease from bg_free_bits_count, which leads to a corrupted gd and
      trigger the BUG in ocfs2_block_group_set_bits:
      
          BUG_ON(le16_to_cpu(bg->bg_free_bits_count) < num_bits);
      
      So check whether the backup super is done and then do the updates.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: default avatarJiufei Xue <xuejiufei@huawei.com>
      Reviewed-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      959c944f
    • Junxiao Bi's avatar
      ocfs2: fix SGID not inherited issue · c553fd80
      Junxiao Bi authored
      commit 854ee2e9 upstream.
      
      Commit 8f1eb487 ("ocfs2: fix umask ignored issue") introduced an
      issue, SGID of sub dir was not inherited from its parents dir.  It is
      because SGID is set into "inode->i_mode" in ocfs2_get_init_inode(), but
      is overwritten by "mode" which don't have SGID set later.
      
      Fixes: 8f1eb487 ("ocfs2: fix umask ignored issue")
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Acked-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c553fd80
    • Mike Kravetz's avatar
      mm/hugetlb.c: fix resv map memory leak for placeholder entries · 11445dc8
      Mike Kravetz authored
      commit dbe409e4 upstream.
      
      Dmitry Vyukov reported the following memory leak
      
      unreferenced object 0xffff88002eaafd88 (size 32):
        comm "a.out", pid 5063, jiffies 4295774645 (age 15.810s)
        hex dump (first 32 bytes):
          28 e9 4e 63 00 88 ff ff 28 e9 4e 63 00 88 ff ff  (.Nc....(.Nc....
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
           kmalloc include/linux/slab.h:458
           region_chg+0x2d4/0x6b0 mm/hugetlb.c:398
           __vma_reservation_common+0x2c3/0x390 mm/hugetlb.c:1791
           vma_needs_reservation mm/hugetlb.c:1813
           alloc_huge_page+0x19e/0xc70 mm/hugetlb.c:1845
           hugetlb_no_page mm/hugetlb.c:3543
           hugetlb_fault+0x7a1/0x1250 mm/hugetlb.c:3717
           follow_hugetlb_page+0x339/0xc70 mm/hugetlb.c:3880
           __get_user_pages+0x542/0xf30 mm/gup.c:497
           populate_vma_page_range+0xde/0x110 mm/gup.c:919
           __mm_populate+0x1c7/0x310 mm/gup.c:969
           do_mlock+0x291/0x360 mm/mlock.c:637
           SYSC_mlock2 mm/mlock.c:658
           SyS_mlock2+0x4b/0x70 mm/mlock.c:648
      
      Dmitry identified a potential memory leak in the routine region_chg,
      where a region descriptor is not free'ed on an error path.
      
      However, the root cause for the above memory leak resides in region_del.
      In this specific case, a "placeholder" entry is created in region_chg.
      The associated page allocation fails, and the placeholder entry is left
      in the reserve map.  This is "by design" as the entry should be deleted
      when the map is released.  The bug is in the region_del routine which is
      used to delete entries within a specific range (and when the map is
      released).  region_del did not handle the case where a placeholder entry
      exactly matched the start of the range range to be deleted.  In this
      case, the entry would not be deleted and leaked.  The fix is to take
      these special placeholder entries into account in region_del.
      
      The region_chg error path leak is also fixed.
      
      Fixes: feba16e2 ("mm/hugetlb: add region_del() to delete a specific range of entries")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      11445dc8
    • Richard Weinberger's avatar
      kernel/signal.c: unexport sigsuspend() · 0440e1ce
      Richard Weinberger authored
      commit 9d8a7652 upstream.
      
      sigsuspend() is nowhere used except in signal.c itself, so we can mark it
      static do not pollute the global namespace.
      
      But this patch is more than a boring cleanup patch, it fixes a real issue
      on UserModeLinux.  UML has a special console driver to display ttys using
      xterm, or other terminal emulators, on the host side.  Vegard reported
      that sometimes UML is unable to spawn a xterm and he's facing the
      following warning:
      
        WARNING: CPU: 0 PID: 908 at include/linux/thread_info.h:128 sigsuspend+0xab/0xc0()
      
      It turned out that this warning makes absolutely no sense as the UML
      xterm code calls sigsuspend() on the host side, at least it tries.  But
      as the kernel itself offers a sigsuspend() symbol the linker choose this
      one instead of the glibc wrapper.  Interestingly this code used to work
      since ever but always blocked signals on the wrong side.  Some recent
      kernel change made the WARN_ON() trigger and uncovered the bug.
      
      It is a wonderful example of how much works by chance on computers. :-)
      
      Fixes: 68f3f16d ("new helper: sigsuspend()")
      Signed-off-by: default avatarRichard Weinberger <richard@nod.at>
      Reported-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Tested-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0440e1ce
    • Naoya Horiguchi's avatar
      mm: hugetlb: call huge_pte_alloc() only if ptep is null · ceb9eea3
      Naoya Horiguchi authored
      commit 0d777df5 upstream.
      
      Currently at the beginning of hugetlb_fault(), we call huge_pte_offset()
      and check whether the obtained *ptep is a migration/hwpoison entry or
      not.  And if not, then we get to call huge_pte_alloc().  This is racy
      because the *ptep could turn into migration/hwpoison entry after the
      huge_pte_offset() check.  This race results in BUG_ON in
      huge_pte_alloc().
      
      We don't have to call huge_pte_alloc() when the huge_pte_offset()
      returns non-NULL, so let's fix this bug with moving the code into else
      block.
      
      Note that the *ptep could turn into a migration/hwpoison entry after
      this block, but that's not a problem because we have another
      !pte_present check later (we never go into hugetlb_no_page() in that
      case.)
      
      Fixes: 290408d4 ("hugetlb: hugepage migration core")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ceb9eea3
    • OGAWA Hirofumi's avatar
      fat: fix fake_offset handling on error path · f3fc333b
      OGAWA Hirofumi authored
      commit 928a4771 upstream.
      
      For the root directory, .  and ..  are faked (using dir_emit_dots()) and
      ctx->pos is reset from 2 to 0.
      
      A corrupted root directory could cause fat_get_entry() to fail, but
      ->iterate() (fat_readdir()) reports progress to the VFS (with ctx->pos
      rewound to 0), so any following calls to ->iterate() continue to return
      the same entries again and again.
      
      The result is that userspace will never see the end of the directory,
      causing e.g.  'ls' to hang in a getdents() loop.
      
      [hirofumi@mail.parknet.co.jp: cleanup and make sure to correct fake_offset]
      Reported-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Tested-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: default avatarRichard Weinberger <richard.weinberger@gmail.com>
      Signed-off-by: default avatarOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f3fc333b
    • Mike Kravetz's avatar
      mm/hugetlbfs: fix bugs in fallocate hole punch of areas with holes · 525893dd
      Mike Kravetz authored
      commit 1817889e upstream.
      
      Hugh Dickins pointed out problems with the new hugetlbfs fallocate hole
      punch code.  These problems are in the routine remove_inode_hugepages and
      mostly occur in the case where there are holes in the range of pages to be
      removed.  These holes could be the result of a previous hole punch or
      simply sparse allocation.  The current code could access pages outside the
      specified range.
      
      remove_inode_hugepages handles both hole punch and truncate operations.
      Page index handling was fixed/cleaned up so that the loop index always
      matches the page being processed.  The code now only makes a single pass
      through the range of pages as it was determined page faults could not race
      with truncate.  A cond_resched() was added after removing up to
      PAGEVEC_SIZE pages.
      
      Some totally unnecessary code in hugetlbfs_fallocate() that remained from
      early development was also removed.
      
      Tested with fallocate tests submitted here:
      http://librelist.com/browser//libhugetlbfs/2015/6/25/patch-tests-add-tests-for-fallocate-system-call/
      And, some ftruncate tests under development
      
      Fixes: b5cec28d ("hugetlbfs: truncate_hugepages() takes a range of pages")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Hillf Danton" <hillf.zj@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      525893dd
    • Michal Hocko's avatar
      mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress · 16c61891
      Michal Hocko authored
      commit 373ccbe5 upstream.
      
      Tetsuo Handa has reported that the system might basically livelock in
      OOM condition without triggering the OOM killer.
      
      The issue is caused by internal dependency of the direct reclaim on
      vmstat counter updates (via zone_reclaimable) which are performed from
      the workqueue context.  If all the current workers get assigned to an
      allocation request, though, they will be looping inside the allocator
      trying to reclaim memory but zone_reclaimable can see stalled numbers so
      it will consider a zone reclaimable even though it has been scanned way
      too much.  WQ concurrency logic will not consider this situation as a
      congested workqueue because it relies that worker would have to sleep in
      such a situation.  This also means that it doesn't try to spawn new
      workers or invoke the rescuer thread if the one is assigned to the
      queue.
      
      In order to fix this issue we need to do two things.  First we have to
      let wq concurrency code know that we are in trouble so we have to do a
      short sleep.  In order to prevent from issues handled by 0e093d99
      ("writeback: do not sleep on the congestion queue if there are no
      congested BDIs or if significant congestion is not being encountered in
      the current zone") we limit the sleep only to worker threads which are
      the ones of the interest anyway.
      
      The second thing to do is to create a dedicated workqueue for vmstat and
      mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
      have a spare worker thread for it.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Cristopher Lameter <clameter@sgi.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Arkadiusz Miskiewicz <arekm@maven.pl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      16c61891
    • Naoya Horiguchi's avatar
      mm: hugetlb: fix hugepage memory leak caused by wrong reserve count · 3f5fae4d
      Naoya Horiguchi authored
      commit a88c7695 upstream.
      
      When dequeue_huge_page_vma() in alloc_huge_page() fails, we fall back on
      alloc_buddy_huge_page() to directly create a hugepage from the buddy
      allocator.
      
      In that case, however, if alloc_buddy_huge_page() succeeds we don't
      decrement h->resv_huge_pages, which means that successful
      hugetlb_fault() returns without releasing the reserve count.  As a
      result, subsequent hugetlb_fault() might fail despite that there are
      still free hugepages.
      
      This patch simply adds decrementing code on that code path.
      
      I reproduced this problem when testing v4.3 kernel in the following situation:
       - the test machine/VM is a NUMA system,
       - hugepage overcommiting is enabled,
       - most of hugepages are allocated and there's only one free hugepage
         which is on node 0 (for example),
       - another program, which calls set_mempolicy(MPOL_BIND) to bind itself to
         node 1, tries to allocate a hugepage,
       - the allocation should fail but the reserve count is still hold.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3f5fae4d
    • Michal Hocko's avatar
      memcg: fix thresholds for 32b architectures. · 01a79daa
      Michal Hocko authored
      commit c12176d3 upstream.
      
      Commit 424cdc14 ("memcg: convert threshold to bytes") has fixed a
      regression introduced by 3e32cb2e ("mm: memcontrol: lockless page
      counters") where thresholds were silently converted to use page units
      rather than bytes when interpreting the user input.
      
      The fix is not complete, though, as properly pointed out by Ben Hutchings
      during stable backport review.  The page count is converted to bytes but
      unsigned long is used to hold the value which would be obviously not
      sufficient for 32b systems with more than 4G thresholds.  The same applies
      to usage as taken from mem_cgroup_usage which might overflow.
      
      Let's remove this bytes vs.  pages internal tracking differences and
      handle thresholds in page units internally.  Chage mem_cgroup_usage() to
      return the value in page units and revert 424cdc14 because this should
      be sufficient for the consistent handling.  mem_cgroup_read_u64 as the
      only users of mem_cgroup_usage outside of the threshold handling code is
      converted to give the proper in bytes result.  It is doing that already
      for page_counter output so this is more consistent as well.
      
      The value presented to the userspace is still in bytes units.
      
      Fixes: 424cdc14 ("memcg: convert threshold to bytes")
      Fixes: 3e32cb2e ("mm: memcontrol: lockless page counters")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      From: Michal Hocko <mhocko@kernel.org>
      Subject: memcg: fix thresholds for 32b architectures.
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      From: Andrew Morton <akpm@linux-foundation.org>
      Subject: memcg: fix thresholds for 32b architectures.
      
      don't attempt to inline mem_cgroup_usage()
      
      The compiler ignores the inline anwyay.  And __always_inlining it adds 600
      bytes of goop to the .o file.
      
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01a79daa
    • Greg Thelen's avatar
      fs, seqfile: always allow oom killer · 56c5e58f
      Greg Thelen authored
      commit 0f930902 upstream.
      
      Since 5cec38ac ("fs, seq_file: fallback to vmalloc instead of oom kill
      processes") seq_buf_alloc() avoids calling the oom killer for PAGE_SIZE or
      smaller allocations; but larger allocations can use the oom killer via
      vmalloc().  Thus reads of small files can return ENOMEM, but larger files
      use the oom killer to avoid ENOMEM.
      
      The effect of this bug is that reads from /proc and other virtual
      filesystems can return ENOMEM instead of the preferred behavior - oom
      killing something (possibly the calling process).  I don't know of anyone
      except Google who has noticed the issue.
      
      I suspect the fix is more needed in smaller systems where there isn't any
      reclaimable memory.  But these seem like the kinds of systems which
      probably don't use the oom killer for production situations.
      
      Memory overcommit requires use of the oom killer to select a victim
      regardless of file size.
      
      Enable oom killer for small seq_buf_alloc() allocations.
      
      Fixes: 5cec38ac ("fs, seq_file: fallback to vmalloc instead of oom kill processes")
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      56c5e58f
    • Andy Shevchenko's avatar
      lib/hexdump.c: truncate output in case of overflow · 2343eaa3
      Andy Shevchenko authored
      commit 9f029f54 upstream.
      
      There is a classical off-by-one error in case when we try to place, for
      example, 1+1 bytes as hex in the buffer of size 6.  The expected result is
      to get an output truncated, but in the reality we get 6 bytes filed
      followed by terminating NUL.
      
      Change the logic how we fill the output in case of byte dumping into
      limited space.  This will follow the snprintf() behaviour by truncating
      output even on half bytes.
      
      Fixes: 114fc1af (hexdump: make it return number of bytes placed in buffer)
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Reported-by: default avatarAaro Koskinen <aaro.koskinen@nokia.com>
      Tested-by: default avatarAaro Koskinen <aaro.koskinen@nokia.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2343eaa3
    • Tetsuo Handa's avatar
      mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL · 83d79c0d
      Tetsuo Handa authored
      commit 426fb5e7 upstream.
      
      It was confirmed that a local unprivileged user can consume all memory
      reserves and hang up that system using time lag between the OOM killer
      sets TIF_MEMDIE on an OOM victim and sends SIGKILL to that victim, for
      printk() inside for_each_process() loop at oom_kill_process() can consume
      many seconds when there are many thread groups sharing the same memory.
      
      Before starting oom-depleter process:
      
          Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB
          Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB
      
      As of invoking the OOM killer:
      
          Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB
          Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB
      
      Between the thread group leader got TIF_MEMDIE and receives SIGKILL:
      
          Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
          Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
      
      The oom-depleter's thread group leader which got TIF_MEMDIE started
      memset() in user space after the OOM killer set TIF_MEMDIE, and it was
      free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset() in user space
      until SIGKILL is delivered.  If SIGKILL is delivered before TIF_MEMDIE is
      set, the oom-depleter can terminate without touching memory reserves.
      
      Although the possibility of hitting this time lag is very small for 3.19
      and earlier kernels because TIF_MEMDIE is set immediately before sending
      SIGKILL, preemption or long interrupts (an extreme example is SysRq-t) can
      step between and allow memory allocations which are not needed for
      terminating the OOM victim.
      
      Fixes: 83363b91 ("oom: make sure that TIF_MEMDIE is set under task_lock")
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      83d79c0d
    • Catalin Marinas's avatar
      mm: slab: only move management objects off-slab for sizes larger than KMALLOC_MIN_SIZE · 64615901
      Catalin Marinas authored
      commit d4322d88 upstream.
      
      On systems with a KMALLOC_MIN_SIZE of 128 (arm64, some mips and powerpc
      configurations defining ARCH_DMA_MINALIGN to 128), the first
      kmalloc_caches[] entry to be initialised after slab_early_init = 0 is
      "kmalloc-128" with index 7.  Depending on the debug kernel configuration,
      sizeof(struct kmem_cache) can be larger than 128 resulting in an
      INDEX_NODE of 8.
      
      Commit 8fc9cf42 ("slab: make more slab management structure off the
      slab") enables off-slab management objects for sizes starting with
      PAGE_SIZE >> 5 (128 bytes for a 4KB page configuration) and the creation
      of the "kmalloc-128" cache would try to place the management objects
      off-slab.  However, since KMALLOC_MIN_SIZE is already 128 and
      freelist_size == 32 in __kmem_cache_create(), kmalloc_slab(freelist_size)
      returns NULL (kmalloc_caches[7] not populated yet).  This triggers the
      following bug on arm64:
      
        kernel BUG at /work/Linux/linux-2.6-aarch64/mm/slab.c:2283!
        Internal error: Oops - BUG: 0 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted 4.3.0-rc4+ #540
        Hardware name: Juno (DT)
        PC is at __kmem_cache_create+0x21c/0x280
        LR is at __kmem_cache_create+0x210/0x280
        [...]
        Call trace:
          __kmem_cache_create+0x21c/0x280
          create_boot_cache+0x48/0x80
          create_kmalloc_cache+0x50/0x88
          create_kmalloc_caches+0x4c/0xf4
          kmem_cache_init+0x100/0x118
          start_kernel+0x214/0x33c
      
      This patch introduces an OFF_SLAB_MIN_SIZE definition to avoid off-slab
      management objects for sizes equal to or smaller than KMALLOC_MIN_SIZE.
      
      Fixes: 8fc9cf42 ("slab: make more slab management structure off the slab")
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      64615901