1. 28 Oct, 2016 15 commits
    • Steffen Maier's avatar
      zfcp: fix ELS/GS request&response length for hardware data router · f2d495a3
      Steffen Maier authored
      commit 70369f8e upstream.
      
      In the hardware data router case, introduced with kernel 3.2
      commit 86a9668a ("[SCSI] zfcp: support for hardware data router")
      the ELS/GS request&response length needs to be initialized
      as in the chained SBAL case.
      
      Otherwise, the FCP channel rejects ELS requests with
      FSF_REQUEST_SIZE_TOO_LARGE.
      
      Such ELS requests can be issued by user space through BSG / HBA API,
      or zfcp itself uses ADISC ELS for remote port link test on RSCN.
      The latter can cause a short path outage due to
      unnecessary remote target port recovery because the always
      failing ADISC cannot detect extremely short path interruptions
      beyond the local FCP channel.
      
      Below example is decoded with zfcpdbf from s390-tools:
      
      Timestamp      : ...
      Area           : SAN
      Subarea        : 00
      Level          : 1
      Exception      : -
      CPU id         : ..
      Caller         : zfcp_dbf_san_req+0408
      Record id      : 1
      Tag            : fssels1
      Request id     : 0x<reqid>
      Destination ID : 0x00<target d_id>
      Payload info   : 52000000 00000000 <our wwpn       >           [ADISC]
                       <our wwnn       > 00<s_id> 00000000
                       00000000 00000000 00000000 00000000
      
      Timestamp      : ...
      Area           : HBA
      Subarea        : 00
      Level          : 1
      Exception      : -
      CPU id         : ..
      Caller         : zfcp_dbf_hba_fsf_res+0740
      Record id      : 1
      Tag            : fs_ferr
      Request id     : 0x<reqid>
      Request status : 0x00000010
      FSF cmnd       : 0x0000000b               [FSF_QTCB_SEND_ELS]
      FSF sequence no: 0x...
      FSF issued     : ...
      FSF stat       : 0x00000061		  [FSF_REQUEST_SIZE_TOO_LARGE]
      FSF stat qual  : 00000000 00000000 00000000 00000000
      Prot stat      : 0x00000100
      Prot stat qual : 00000000 00000000 00000000 00000000
      Signed-off-by: default avatarSteffen Maier <maier@linux.vnet.ibm.com>
      Fixes: 86a9668a ("[SCSI] zfcp: support for hardware data router")
      Reviewed-by: default avatarBenjamin Block <bblock@linux.vnet.ibm.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      f2d495a3
    • Steffen Maier's avatar
      zfcp: fix fc_host port_type with NPIV · 36dc83b5
      Steffen Maier authored
      commit bd77befa upstream.
      
      For an NPIV-enabled FCP device, zfcp can erroneously show
      "NPort (fabric via point-to-point)" instead of "NPIV VPORT"
      for the port_type sysfs attribute of the corresponding
      fc_host.
      s390-tools that can be affected are dbginfo.sh and ziomon.
      
      zfcp_fsf_exchange_config_evaluate() ignores
      fsf_qtcb_bottom_config.connection_features indicating NPIV
      and only sets fc_host_port_type to FC_PORTTYPE_NPORT if
      fsf_qtcb_bottom_config.fc_topology is FSF_TOPO_FABRIC.
      
      Only the independent zfcp_fsf_exchange_port_evaluate()
      evaluates connection_features to overwrite fc_host_port_type
      to FC_PORTTYPE_NPIV in case of NPIV.
      Code was introduced with upstream kernel 2.6.30
      commit 0282985d
      ("[SCSI] zfcp: Report fc_host_port_type as NPIV").
      
      This works during FCP device recovery (such as set online)
      because it performs FSF_QTCB_EXCHANGE_CONFIG_DATA followed by
      FSF_QTCB_EXCHANGE_PORT_DATA in sequence.
      
      However, the zfcp-specific scsi host sysfs attributes
      "requests", "megabytes", or "seconds_active" trigger only
      zfcp_fsf_exchange_config_evaluate() resetting fc_host
      port_type to FC_PORTTYPE_NPORT despite NPIV.
      
      The zfcp-specific scsi host sysfs attribute "utilization"
      triggers only zfcp_fsf_exchange_port_evaluate() correcting
      the fc_host port_type again in case of NPIV.
      
      Evaluate fsf_qtcb_bottom_config.connection_features
      in zfcp_fsf_exchange_config_evaluate() where it belongs to.
      Signed-off-by: default avatarSteffen Maier <maier@linux.vnet.ibm.com>
      Fixes: 0282985d ("[SCSI] zfcp: Report fc_host_port_type as NPIV")
      Reviewed-by: default avatarBenjamin Block <bblock@linux.vnet.ibm.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      36dc83b5
    • Laurent Dufour's avatar
      powerpc/pseries: Fix stack corruption in htpe code · 429074e8
      Laurent Dufour authored
      commit 05af40e8 upstream.
      
      This commit fixes a stack corruption in the pseries specific code dealing
      with the huge pages.
      
      In __pSeries_lpar_hugepage_invalidate() the buffer used to pass arguments
      to the hypervisor is not large enough. This leads to a stack corruption
      where a previously saved register could be corrupted leading to unexpected
      result in the caller, like the following panic:
      
        Oops: Kernel access of bad area, sig: 11 [#1]
        SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: virtio_balloon ip_tables x_tables autofs4
        virtio_blk 8139too virtio_pci virtio_ring 8139cp virtio
        CPU: 11 PID: 1916 Comm: mmstress Not tainted 4.8.0 #76
        task: c000000005394880 task.stack: c000000005570000
        NIP: c00000000027bf6c LR: c00000000027bf64 CTR: 0000000000000000
        REGS: c000000005573820 TRAP: 0300   Not tainted  (4.8.0)
        MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 84822884  XER: 20000000
        CFAR: c00000000010a924 DAR: 420000000014e5e0 DSISR: 40000000 SOFTE: 1
        GPR00: c00000000027bf64 c000000005573aa0 c000000000e02800 c000000004447964
        GPR04: c00000000404de18 c000000004d38810 00000000042100f5 00000000f5002104
        GPR08: e0000000f5002104 0000000000000001 042100f5000000e0 00000000042100f5
        GPR12: 0000000000002200 c00000000fe02c00 c00000000404de18 0000000000000000
        GPR16: c1ffffffffffe7ff 00003fff62000000 420000000014e5e0 00003fff63000000
        GPR20: 0008000000000000 c0000000f7014800 0405e600000000e0 0000000000010000
        GPR24: c000000004d38810 c000000004447c10 c00000000404de18 c000000004447964
        GPR28: c000000005573b10 c000000004d38810 00003fff62000000 420000000014e5e0
        NIP [c00000000027bf6c] zap_huge_pmd+0x4c/0x470
        LR [c00000000027bf64] zap_huge_pmd+0x44/0x470
        Call Trace:
        [c000000005573aa0] [c00000000027bf64] zap_huge_pmd+0x44/0x470 (unreliable)
        [c000000005573af0] [c00000000022bbd8] unmap_page_range+0xcf8/0xed0
        [c000000005573c30] [c00000000022c2d4] unmap_vmas+0x84/0x120
        [c000000005573c80] [c000000000235448] unmap_region+0xd8/0x1b0
        [c000000005573d80] [c0000000002378f0] do_munmap+0x2d0/0x4c0
        [c000000005573df0] [c000000000237be4] SyS_munmap+0x64/0xb0
        [c000000005573e30] [c000000000009560] system_call+0x38/0x108
        Instruction dump:
        fbe1fff8 fb81ffe0 7c7f1b78 7ca32b78 7cbd2b78 f8010010 7c9a2378 f821ffb1
        7cde3378 4bfffea9 7c7b1b79 41820298 <e87f0000> 48000130 7fa5eb78 7fc4f378
      
      Most of the time, the bug is surfacing in a caller up in the stack from
      __pSeries_lpar_hugepage_invalidate() which is quite confusing.
      
      This bug is pending since v3.11 but was hidden if a caller of the
      caller of __pSeries_lpar_hugepage_invalidate() has pushed the corruped
      register (r18 in this case) in the stack and is not using it until
      restoring it. GCC 6.2.0 seems to raise it more frequently.
      
      This commit also change the definition of the parameter buffer in
      pSeries_lpar_flush_hash_range() to rely on the global define
      PLPAR_HCALL9_BUFSIZE (no functional change here).
      
      Fixes: 1a527286 ("powerpc: Optimize hugepage invalidate")
      Signed-off-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      429074e8
    • Paul Mackerras's avatar
      powerpc/64: Fix incorrect return value from __copy_tofrom_user · 61bc5e39
      Paul Mackerras authored
      commit 1a34439e upstream.
      
      Debugging a data corruption issue with virtio-net/vhost-net led to
      the observation that __copy_tofrom_user was occasionally returning
      a value 16 larger than it should.  Since the return value from
      __copy_tofrom_user is the number of bytes not copied, this means
      that __copy_tofrom_user can occasionally return a value larger
      than the number of bytes it was asked to copy.  In turn this can
      cause higher-level copy functions such as copy_page_to_iter_iovec
      to corrupt memory by copying data into the wrong memory locations.
      
      It turns out that the failing case involves a fault on the store
      at label 79, and at that point the first unmodified byte of the
      destination is at R3 + 16.  Consequently the exception handler
      for that store needs to add 16 to R3 before using it to work out
      how many bytes were not copied, but in this one case it was not
      adding the offset to R3.  To fix it, this moves the label 179 to
      the point where we add 16 to R3.  I have checked manually all the
      exception handlers for the loads and stores in this code and the
      rest of them are correct (it would be excellent to have an
      automated test of all the exception cases).
      
      This bug has been present since this code was initially
      committed in May 2002 to Linux version 2.5.20.
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      61bc5e39
    • Gavin Shan's avatar
      powerpc/powernv: Use CPU-endian PEST in pnv_pci_dump_p7ioc_diag_data() · b63c03af
      Gavin Shan authored
      commit 5adaf862 upstream.
      
      This fixes the warnings reported from sparse:
      
        pci.c:312:33: warning: restricted __be64 degrades to integer
        pci.c:313:33: warning: restricted __be64 degrades to integer
      
      Fixes: cee72d5b ("powerpc/powernv: Display diag data on p7ioc EEH errors")
      Signed-off-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      b63c03af
    • Anton Blanchard's avatar
      powerpc/vdso64: Use double word compare on pointers · 62c65d87
      Anton Blanchard authored
      commit 5045ea37 upstream.
      
      __kernel_get_syscall_map() and __kernel_clock_getres() use cmpli to
      check if the passed in pointer is non zero. cmpli maps to a 32 bit
      compare on binutils, so we ignore the top 32 bits.
      
      A simple test case can be created by passing in a bogus pointer with
      the bottom 32 bits clear. Using a clk_id that is handled by the VDSO,
      then one that is handled by the kernel shows the problem:
      
        printf("%d\n", clock_getres(CLOCK_REALTIME, (void *)0x100000000));
        printf("%d\n", clock_getres(CLOCK_BOOTTIME, (void *)0x100000000));
      
      And we get:
      
        0
        -1
      
      The bigger issue is if we pass a valid pointer with the bottom 32 bits
      clear, in this case we will return success but won't write any data
      to the pointer.
      
      I stumbled across this issue because the LLVM integrated assembler
      doesn't accept cmpli with 3 arguments. Fix this by converting them to
      cmpldi.
      
      Fixes: a7f290da ("[PATCH] powerpc: Merge vdso's and add vdso support to 32 bits kernel")
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      62c65d87
    • Bart Van Assche's avatar
      dm: mark request_queue dead before destroying the DM device · 2210e6dc
      Bart Van Assche authored
      commit 3b785fbc upstream.
      
      This avoids that new requests are queued while __dm_destroy() is in
      progress.
      
      [js] use md->queue instead of non-present helper
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2210e6dc
    • Andrew Bresticker's avatar
      pstore/ram: Use memcpy_fromio() to save old buffer · 74bf47ca
      Andrew Bresticker authored
      commit d771fdf9 upstream.
      
      The ramoops buffer may be mapped as either I/O memory or uncached
      memory.  On ARM64, this results in a device-type (strongly-ordered)
      mapping.  Since unnaligned accesses to device-type memory will
      generate an alignment fault (regardless of whether or not strict
      alignment checking is enabled), it is not safe to use memcpy().
      memcpy_fromio() is guaranteed to only use aligned accesses, so use
      that instead.
      Signed-off-by: default avatarAndrew Bresticker <abrestic@chromium.org>
      Signed-off-by: default avatarEnric Balletbo Serra <enric.balletbo@collabora.com>
      Reviewed-by: default avatarPuneet Kumar <puneetster@chromium.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      74bf47ca
    • Furquan Shaikh's avatar
      pstore/ram: Use memcpy_toio instead of memcpy · a883eada
      Furquan Shaikh authored
      commit 7e75678d upstream.
      
      persistent_ram_update uses vmap / iomap based on whether the buffer is in
      memory region or reserved region. However, both map it as non-cacheable
      memory. For armv8 specifically, non-cacheable mapping requests use a
      memory type that has to be accessed aligned to the request size. memcpy()
      doesn't guarantee that.
      Signed-off-by: default avatarFurquan Shaikh <furquan@google.com>
      Signed-off-by: default avatarEnric Balletbo Serra <enric.balletbo@collabora.com>
      Reviewed-by: default avatarAaron Durbin <adurbin@chromium.org>
      Reviewed-by: default avatarOlof Johansson <olofj@chromium.org>
      Tested-by: default avatarFurquan Shaikh <furquan@chromium.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a883eada
    • Sebastian Andrzej Siewior's avatar
      pstore/core: drop cmpxchg based updates · 16ba2808
      Sebastian Andrzej Siewior authored
      commit d5a9bf0b upstream.
      
      I have here a FPGA behind PCIe which exports SRAM which I use for
      pstore. Now it seems that the FPGA no longer supports cmpxchg based
      updates and writes back 0xff…ff and returns the same.  This leads to
      crash during crash rendering pstore useless.
      Since I doubt that there is much benefit from using cmpxchg() here, I am
      dropping this atomic access and use the spinlock based version.
      
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Rabin Vincent <rabinv@axis.com>
      Tested-by: default avatarRabin Vincent <rabinv@axis.com>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: default avatarGuenter Roeck <linux@roeck-us.net>
      [kees: remove "_locked" suffix since it's the only option now]
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      16ba2808
    • Daniel Glöckner's avatar
      mmc: block: don't use CMD23 with very old MMC cards · acc2317b
      Daniel Glöckner authored
      commit 0ed50abb upstream.
      
      CMD23 aka SET_BLOCK_COUNT was introduced with MMC v3.1.
      Older versions of the specification allowed to terminate
      multi-block transfers only with CMD12.
      
      The patch fixes the following problem:
      
        mmc0: new MMC card at address 0001
        mmcblk0: mmc0:0001 SDMB-16 15.3 MiB
        mmcblk0: timed out sending SET_BLOCK_COUNT command, card status 0x400900
        ...
        blk_update_request: I/O error, dev mmcblk0, sector 0
        Buffer I/O error on dev mmcblk0, logical block 0, async page read
         mmcblk0: unable to read partition table
      Signed-off-by: default avatarDaniel Glöckner <dg@emlix.com>
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      acc2317b
    • Jan Remmet's avatar
      regulator: tps65910: Work around silicon erratum SWCZ010 · 402ae64c
      Jan Remmet authored
      commit 8f9165c9 upstream.
      
      http://www.ti.com/lit/pdf/SWCZ010:
        DCDC o/p voltage can go higher than programmed value
      
      Impact:
      VDDI, VDD2, and VIO output programmed voltage level can go higher than
      expected or crash, when coming out of PFM to PWM mode or using DVFS.
      
      Description:
      When DCDC CLK SYNC bits are 11/01:
      * VIO 3-MHz oscillator is the source clock of the digital core and input
        clock of VDD1 and VDD2
      * Turn-on of VDD1 and VDD2 HSD PFETis synchronized or at a constant
        phase shift
      * Current pulled though VCC1+VCC2 is Iload(VDD1) + Iload(VDD2)
      * The 3 HSD PFET will be turned-on at the same time, causing the highest
        possible switching noise on the application. This noise level depends
        on the layout, the VBAT level, and the load current. The noise level
        increases with improper layout.
      
      When DCDC CLK SYNC bits are 00:
      * VIO 3-MHz oscillator is the source clock of digital core
      * VDD1 and VDD2 are running on their own 3-MHz oscillator
      * Current pulled though VCC1+VCC2 average of Iload(VDD1) + Iload(VDD2)
      * The switching noise of the 3 SMPS will be randomly spread over time,
        causing lower overall switching noise.
      
      Workaround:
      Set DCDCCTRL_REG[1:0]= 00.
      Signed-off-by: default avatarJan Remmet <j.remmet@phytec.de>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      402ae64c
    • Liu Gang's avatar
      gpio: mpc8xxx: Correct irq handler function · 11db4e79
      Liu Gang authored
      commit d71cf15b upstream.
      
      From the beginning of the gpio-mpc8xxx.c, the "handle_level_irq"
      has being used to handle GPIO interrupts in the PowerPC/Layerscape
      platforms. But actually, almost all PowerPC/Layerscape platforms
      assert an interrupt request upon either a high-to-low change or
      any change on the state of the signal.
      
      So the "handle_level_irq" is not reasonable for PowerPC/Layerscape
      GPIO interrupt, it should be "handle_edge_irq". Otherwise the system
      may lost some interrupts from the PIN's state changes.
      Signed-off-by: default avatarLiu Gang <Gang.Liu@nxp.com>
      Signed-off-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      11db4e79
    • Joe Perches's avatar
      ipc: remove use of seq_printf return value · dbeba38c
      Joe Perches authored
      commit 7f032d6e upstream.
      
      The seq_printf return value, because it's frequently misused,
      will eventually be converted to void.
      
      See: commit 1f33c41c ("seq_file: Rename seq_overflow() to
           seq_has_overflowed() and make public")
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      dbeba38c
    • Liu ShuoX's avatar
      pstore: Fix buffer overflow while write offset equal to buffer size · e3286c43
      Liu ShuoX authored
      commit 017321cf upstream.
      
      In case new offset is equal to prz->buffer_size, it won't wrap at this
      time and will return old(overflow) value next time.
      Signed-off-by: default avatarLiu ShuoX <shuox.liu@intel.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e3286c43
  2. 25 Oct, 2016 6 commits
    • Glauber Costa's avatar
      cfq: fix starvation of asynchronous writes · 03a140b7
      Glauber Costa authored
      commit 3932a86b upstream.
      
      While debugging timeouts happening in my application workload (ScyllaDB), I have
      observed calls to open() taking a long time, ranging everywhere from 2 seconds -
      the first ones that are enough to time out my application - to more than 30
      seconds.
      
      The problem seems to happen because XFS may block on pending metadata updates
      under certain circumnstances, and that's confirmed with the following backtrace
      taken by the offcputime tool (iovisor/bcc):
      
          ffffffffb90c57b1 finish_task_switch
          ffffffffb97dffb5 schedule
          ffffffffb97e310c schedule_timeout
          ffffffffb97e1f12 __down
          ffffffffb90ea821 down
          ffffffffc046a9dc xfs_buf_lock
          ffffffffc046abfb _xfs_buf_find
          ffffffffc046ae4a xfs_buf_get_map
          ffffffffc046babd xfs_buf_read_map
          ffffffffc0499931 xfs_trans_read_buf_map
          ffffffffc044a561 xfs_da_read_buf
          ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
          ffffffffc0452b90 xfs_dir2_leaf_lookup_int
          ffffffffc0452e0f xfs_dir2_leaf_lookup
          ffffffffc044d9d3 xfs_dir_lookup
          ffffffffc047d1d9 xfs_lookup
          ffffffffc0479e53 xfs_vn_lookup
          ffffffffb925347a path_openat
          ffffffffb9254a71 do_filp_open
          ffffffffb9242a94 do_sys_open
          ffffffffb9242b9e sys_open
          ffffffffb97e42b2 entry_SYSCALL_64_fastpath
          00007fb0698162ed [unknown]
      
      Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
      high "Dispatch wait" times, on the dozens of seconds range and consistent with
      the open() times I have saw in that run.
      
      Still from the blktrace output, we can after searching a bit, identify the
      request that wasn't dispatched:
      
        8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <- (8,1) 141696240
        8,0   11      153    81.092472889   804  Q  WM 141698288 + 8 [xfsaild/sda1]
        8,0   11      154    81.092473207   804  G  WM 141698288 + 8 [xfsaild/sda1]
        8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (   22911) [xfsaild/sda1]
        <==== 'I' means Inserted (into the IO scheduler) ===================================>
        8,0    0   289372    96.718761435     0  D  WM 141698288 + 8 (15626265317) [swapper/0]
        <==== Only 15s later the CFQ scheduler dispatches the request ======================>
      
      As we can see above, in this particular example CFQ took 15 seconds to dispatch
      this request. Going back to the full trace, we can see that the xfsaild queue
      had plenty of opportunity to run, and it was selected as the active queue many
      times. It would just always be preempted by something else (example):
      
        8,0    1        0    81.117912979     0  m   N cfq1618SN / insert_request
        8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
        8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
        8,0    1        0    81.117914398     0  m   N cfq767A  / slice expired t=1
        8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
        8,0    1        0    81.117915340     0  m   N / served: vt=1948520448 min_vt=1948520448
        8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1 disp=0 charge=0 iops=1 sect=0
      
      where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
      IO dispatchers.
      
      The requests preempting the xfsaild queue are synchronous requests. That's a
      characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
      While it can be argued that preempting ASYNC requests in favor of SYNC is part
      of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
      goal.
      
      Moreover, unless I am misunderstanding something, that breaks the expectation
      set by the "fifo_expire_async" tunable, which in my system is set to the
      default.
      
      Looking at the code, it seems to me that the issue is that after we make
      an async queue active, there is no guarantee that it will execute any request.
      
      When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
      requests in flight. An incoming request from another queue can also preempt it
      in such situation before we have the chance to execute anything (as seen in the
      trace above).
      
      This patch sets the must_dispatch flag if we notice that we have requests
      that are already fifo_expired. This flag is always cleared after
      cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
      the queue for subsequent requests (unless they are themselves expired)
      
      Care is taken during preempt to still allow rt requests to preempt us
      regardless.
      
      Testing my workload with this patch applied produces much better results.
      From the application side I see no timeouts, and the open() latency histogram
      generated by systemtap looks much better, with the worst outlier at 131ms:
      
      Latency histogram of xfs_buf_lock acquisition (microseconds):
       value |-------------------------------------------------- count
           0 |                                                     11
           1 |@@@@                                                161
           2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1966
           4 |@                                                    54
           8 |                                                     36
          16 |                                                      7
          32 |                                                      0
          64 |                                                      0
             ~
        1024 |                                                      0
        2048 |                                                      0
        4096 |                                                      1
        8192 |                                                      1
       16384 |                                                      2
       32768 |                                                      0
       65536 |                                                      0
      131072 |                                                      1
      262144 |                                                      0
      524288 |                                                      0
      Signed-off-by: default avatarGlauber Costa <glauber@scylladb.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: linux-block@vger.kernel.org
      CC: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarGlauber Costa <glauber@scylladb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      03a140b7
    • Ross Zwisler's avatar
      ext4: allow DAX writeback for hole punch · 474832ec
      Ross Zwisler authored
      commit cca32b7e upstream.
      
      Currently when doing a DAX hole punch with ext4 we fail to do a writeback.
      This is because the logic around filemap_write_and_wait_range() in
      ext4_punch_hole() only looks for dirty page cache pages in the radix tree,
      not for dirty DAX exceptional entries.
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      474832ec
    • Daeho Jeong's avatar
      ext4: reinforce check of i_dtime when clearing high fields of uid and gid · f0e1bad8
      Daeho Jeong authored
      commit 93e3b4e6 upstream.
      
      Now, ext4_do_update_inode() clears high 16-bit fields of uid/gid
      of deleted and evicted inode to fix up interoperability with old
      kernels. However, it checks only i_dtime of an inode to determine
      whether the inode was deleted and evicted, and this is very risky,
      because i_dtime can be used for the pointer maintaining orphan inode
      list, too. We need to further check whether the i_dtime is being
      used for the orphan inode list even if the i_dtime is not NULL.
      
      We found that high 16-bit fields of uid/gid of inode are unintentionally
      and permanently cleared when the inode truncation is just triggered,
      but not finished, and the inode metadata, whose high uid/gid bits are
      cleared, is written on disk, and the sudden power-off follows that
      in order.
      Signed-off-by: default avatarDaeho Jeong <daeho.jeong@samsung.com>
      Signed-off-by: default avatarHobin Woo <hobin.woo@samsung.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      f0e1bad8
    • Brian King's avatar
      scsi: ibmvfc: Fix I/O hang when port is not mapped · 393328a0
      Brian King authored
      commit 07d0e9a8 upstream.
      
      If a VFC port gets unmapped in the VIOS, it may not respond with a CRQ
      init complete following H_REG_CRQ. If this occurs, we can end up having
      called scsi_block_requests and not a resulting unblock until the init
      complete happens, which may never occur, and we end up hanging I/O
      requests.  This patch ensures the host action stay set to
      IBMVFC_HOST_ACTION_TGT_DEL so we move all rports into devloss state and
      unblock unless we receive an init complete.
      Signed-off-by: default avatarBrian King <brking@linux.vnet.ibm.com>
      Acked-by: default avatarTyrel Datwyler <tyreld@linux.vnet.ibm.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      393328a0
    • Mike Galbraith's avatar
      reiserfs: Unlock superblock before calling reiserfs_quota_on_mount() · 643a9a21
      Mike Galbraith authored
      commit 420902c9 upstream.
      
      If we hold the superblock lock while calling reiserfs_quota_on_mount(), we can
      deadlock our own worker - mount blocks kworker/3:2, sleeps forever more.
      
      crash> ps|grep UN
          715      2   3  ffff880220734d30  UN   0.0       0      0  [kworker/3:2]
         9369   9341   2  ffff88021ffb7560  UN   1.3  493404 123184  Xorg
         9665   9664   3  ffff880225b92ab0  UN   0.0   47368    812  udisks-daemon
        10635  10403   3  ffff880222f22c70  UN   0.0   14904    936  mount
      crash> bt ffff880220734d30
      PID: 715    TASK: ffff880220734d30  CPU: 3   COMMAND: "kworker/3:2"
       #0 [ffff8802244c3c20] schedule at ffffffff8144584b
       #1 [ffff8802244c3cc8] __rt_mutex_slowlock at ffffffff814472b3
       #2 [ffff8802244c3d28] rt_mutex_slowlock at ffffffff814473f5
       #3 [ffff8802244c3dc8] reiserfs_write_lock at ffffffffa05f28fd [reiserfs]
       #4 [ffff8802244c3de8] flush_async_commits at ffffffffa05ec91d [reiserfs]
       #5 [ffff8802244c3e08] process_one_work at ffffffff81073726
       #6 [ffff8802244c3e68] worker_thread at ffffffff81073eba
       #7 [ffff8802244c3ec8] kthread at ffffffff810782e0
       #8 [ffff8802244c3f48] kernel_thread_helper at ffffffff81450064
      crash> rd ffff8802244c3cc8 10
      ffff8802244c3cc8:  ffffffff814472b3 ffff880222f23250   .rD.....P2."....
      ffff8802244c3cd8:  0000000000000000 0000000000000286   ................
      ffff8802244c3ce8:  ffff8802244c3d30 ffff880220734d80   0=L$.....Ms ....
      ffff8802244c3cf8:  ffff880222e8f628 0000000000000000   (.."............
      ffff8802244c3d08:  0000000000000000 0000000000000002   ................
      crash> struct rt_mutex ffff880222e8f628
      struct rt_mutex {
        wait_lock = {
          raw_lock = {
            slock = 65537
          }
        },
        wait_list = {
          node_list = {
            next = 0xffff8802244c3d48,
            prev = 0xffff8802244c3d48
          }
        },
        owner = 0xffff880222f22c71,
        save_state = 0
      }
      crash> bt 0xffff880222f22c70
      PID: 10635  TASK: ffff880222f22c70  CPU: 3   COMMAND: "mount"
       #0 [ffff8802216a9868] schedule at ffffffff8144584b
       #1 [ffff8802216a9910] schedule_timeout at ffffffff81446865
       #2 [ffff8802216a99a0] wait_for_common at ffffffff81445f74
       #3 [ffff8802216a9a30] flush_work at ffffffff810712d3
       #4 [ffff8802216a9ab0] schedule_on_each_cpu at ffffffff81074463
       #5 [ffff8802216a9ae0] invalidate_bdev at ffffffff81178aba
       #6 [ffff8802216a9af0] vfs_load_quota_inode at ffffffff811a3632
       #7 [ffff8802216a9b50] dquot_quota_on_mount at ffffffff811a375c
       #8 [ffff8802216a9b80] finish_unfinished at ffffffffa05dd8b0 [reiserfs]
       #9 [ffff8802216a9cc0] reiserfs_fill_super at ffffffffa05de825 [reiserfs]
          RIP: 00007f7b9303997a  RSP: 00007ffff443c7a8  RFLAGS: 00010202
          RAX: 00000000000000a5  RBX: ffffffff8144ef12  RCX: 00007f7b932e9ee0
          RDX: 00007f7b93d9a400  RSI: 00007f7b93d9a3e0  RDI: 00007f7b93d9a3c0
          RBP: 00007f7b93d9a2c0   R8: 00007f7b93d9a550   R9: 0000000000000001
          R10: ffffffffc0ed040e  R11: 0000000000000202  R12: 000000000000040e
          R13: 0000000000000000  R14: 00000000c0ed040e  R15: 00007ffff443ca20
          ORIG_RAX: 00000000000000a5  CS: 0033  SS: 002b
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: default avatarMike Galbraith <mgalbraith@suse.de>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      643a9a21
    • Guilherme G Piccoli's avatar
      i40e: avoid NULL pointer dereference and recursive errors on early PCI error · 0f3caac7
      Guilherme G Piccoli authored
      commit edfc23ee upstream.
      
      Although rare, it's possible to hit PCI error early on device
      probe, meaning possibly some structs are not entirely initialized,
      and some might even be completely uninitialized, leading to NULL
      pointer dereference.
      
      The i40e driver currently presents a "bad" behavior if device hits
      such early PCI error: firstly, the struct i40e_pf might not be
      attached to pci_dev yet, leading to a NULL pointer dereference on
      access to pf->state.
      
      Even checking if the struct is NULL and avoiding the access in that
      case isn't enough, since the driver cannot recover from PCI error
      that early; in our experiments we saw multiple failures on kernel
      log, like:
      
        [549.664] i40e 0007:01:00.1: Initial pf_reset failed: -15
        [549.664] i40e: probe of 0007:01:00.1 failed with error -15
        [...]
        [871.644] i40e 0007:01:00.1: The driver for the device stopped because the
        device firmware failed to init. Try updating your NVM image.
        [871.644] i40e: probe of 0007:01:00.1 failed with error -32
        [...]
        [872.516] i40e 0007:01:00.0: ARQ: Unknown event 0x0000 ignored
      
      Between the first probe failure (error -15) and the second (error -32)
      another PCI error happened due to the first bad probe. Also, driver
      started to flood console with those ARQ event messages.
      
      This patch will prevent these issues by allowing error recovery
      mechanism to remove the failed device from the system instead of
      trying to recover from early PCI errors during device probe.
      Signed-off-by: default avatarGuilherme G Piccoli <gpiccoli@linux.vnet.ibm.com>
      Acked-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      0f3caac7
  3. 20 Oct, 2016 6 commits
    • Jiri Slaby's avatar
      Linux 3.12.66 · e5dbd4b8
      Jiri Slaby authored
      e5dbd4b8
    • Linus Torvalds's avatar
      mm: remove gup_flags FOLL_WRITE games from __get_user_pages() · f949fcd7
      Linus Torvalds authored
      commit 19be0eaf upstream.
      
      This is an ancient bug that was actually attempted to be fixed once
      (badly) by me eleven years ago in commit 4ceb5db9 ("Fix
      get_user_pages() race for write access") but that was then undone due to
      problems on s390 by commit f33ea7f4 ("fix get_user_pages bug").
      
      In the meantime, the s390 situation has long been fixed, and we can now
      fix it by checking the pte_dirty() bit properly (and do it better).  The
      s390 dirty bit was implemented in abf09bed ("s390/mm: implement
      software dirty bits") which made it into v3.9.  Earlier kernels will
      have to look at the page state itself.
      
      Also, the VM has become more scalable, and what used a purely
      theoretical race back then has become easier to trigger.
      
      To fix it, we introduce a new internal FOLL_COW flag to mark the "yes,
      we already did a COW" rather than play racy games with FOLL_WRITE that
      is very fundamental, and then use the pte dirty flag to validate that
      the FOLL_COW flag is still valid.
      Reported-and-tested-by: default avatarPhil "not Paul" Oester <kernel@linuxace.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      f949fcd7
    • H.J. Lu's avatar
      x86/build: Build compressed x86 kernels as PIE · 7faace13
      H.J. Lu authored
      commit 6d92bc9d upstream.
      
      The 32-bit x86 assembler in binutils 2.26 will generate R_386_GOT32X
      relocation to get the symbol address in PIC.  When the compressed x86
      kernel isn't built as PIC, the linker optimizes R_386_GOT32X relocations
      to their fixed symbol addresses.  However, when the compressed x86
      kernel is loaded at a different address, it leads to the following
      load failure:
      
        Failed to allocate space for phdrs
      
      during the decompression stage.
      
      If the compressed x86 kernel is relocatable at run-time, it should be
      compiled with -fPIE, instead of -fPIC, if possible and should be built as
      Position Independent Executable (PIE) so that linker won't optimize
      R_386_GOT32X relocation to its fixed symbol address.
      
      Older linkers generate R_386_32 relocations against locally defined
      symbols, _bss, _ebss, _got and _egot, in PIE.  It isn't wrong, just less
      optimal than R_386_RELATIVE.  But the x86 kernel fails to properly handle
      R_386_32 relocations when relocating the kernel.  To generate
      R_386_RELATIVE relocations, we mark _bss, _ebss, _got and _egot as
      hidden in both 32-bit and 64-bit x86 kernels.
      
      To build a 64-bit compressed x86 kernel as PIE, we need to disable the
      relocation overflow check to avoid relocation overflow errors. We do
      this with a new linker command-line option, -z noreloc-overflow, which
      got added recently:
      
       commit 4c10bbaa0912742322f10d9d5bb630ba4e15dfa7
       Author: H.J. Lu <hjl.tools@gmail.com>
       Date:   Tue Mar 15 11:07:06 2016 -0700
      
          Add -z noreloc-overflow option to x86-64 ld
      
          Add -z noreloc-overflow command-line option to the x86-64 ELF linker to
          disable relocation overflow check.  This can be used to avoid relocation
          overflow check if there will be no dynamic relocation overflow at
          run-time.
      
      The 64-bit compressed x86 kernel is built as PIE only if the linker supports
      -z noreloc-overflow.  So far 64-bit relocatable compressed x86 kernel
      boots fine even when it is built as a normal executable.
      Signed-off-by: default avatarH.J. Lu <hjl.tools@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      [ Edited the changelog and comments. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Paul Bolle <pebolle@tiscali.nl>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      7faace13
    • Arend Van Spriel's avatar
      brcmfmac: avoid potential stack overflow in brcmf_cfg80211_start_ap() · bfce0a40
      Arend Van Spriel authored
      commit ded89912 upstream.
      
      User-space can choose to omit NL80211_ATTR_SSID and only provide raw
      IE TLV data. When doing so it can provide SSID IE with length exceeding
      the allowed size. The driver further processes this IE copying it
      into a local variable without checking the length. Hence stack can be
      corrupted and used as exploit.
      Reported-by: default avatarDaxing Guo <freener.gdx@gmail.com>
      Reviewed-by: default avatarHante Meuleman <hante.meuleman@broadcom.com>
      Reviewed-by: default avatarPieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com>
      Reviewed-by: default avatarFranky Lin <franky.lin@broadcom.com>
      Signed-off-by: default avatarArend van Spriel <arend.vanspriel@broadcom.com>
      Signed-off-by: default avatarKalle Valo <kvalo@codeaurora.org>
      Acked-by: default avatarBenjamin Poirier <bpoirier@suse.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      bfce0a40
    • Jaewon Kim's avatar
      ratelimit: fix bug in time interval by resetting right begin time · 5637eef2
      Jaewon Kim authored
      commit c2594bc3 upstream.
      
      rs->begin in ratelimit is set in two cases.
       1) when rs->begin was not initialized
       2) when rs->interval was passed
      
      For case #2, current ratelimit sets the begin to 0.  This incurrs
      improper suppression.  The begin value will be set in the next ratelimit
      call by 1).  Then the time interval check will be always false, and
      rs->printed will not be initialized.  Although enough time passed,
      ratelimit may return 0 if rs->printed is not less than rs->burst.  To
      reset interval properly, begin should be jiffies rather than 0.
      
      For an example code below:
      
          static DEFINE_RATELIMIT_STATE(mylimit, 1, 1);
          for (i = 1; i <= 10; i++) {
              if (__ratelimit(&mylimit))
                  printk("ratelimit test count %d\n", i);
              msleep(3000);
          }
      
      test result in the current code shows suppression even there is 3 seconds sleep.
      
        [  78.391148] ratelimit test count 1
        [  81.295988] ratelimit test count 2
        [  87.315981] ratelimit test count 4
        [  93.336267] ratelimit test count 6
        [  99.356031] ratelimit test count 8
        [ 105.376367] ratelimit test count 10
      Signed-off-by: default avatarJaewon Kim <jaewon31.kim@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5637eef2
    • Junichi Nomura's avatar
      dm: fix AB-BA deadlock in __dm_destroy() · 52342f2c
      Junichi Nomura authored
      commit 2a708cff upstream.
      
      __dm_destroy() takes io_barrier SRCU lock (dm_get_live_table) and
      suspend_lock in reverse order.  Doing so can cause AB-BA deadlock:
      
        __dm_destroy                    dm_swap_table
        ---------------------------------------------------
                                        mutex_lock(suspend_lock)
        dm_get_live_table()
          srcu_read_lock(io_barrier)
                                        dm_sync_table()
                                          synchronize_srcu(io_barrier)
                                            .. waiting for dm_put_live_table()
        mutex_lock(suspend_lock)
          .. waiting for suspend_lock
      
      Fix this by taking the locks in proper order.
      Signed-off-by: default avatarJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Fixes: ab7c7bb6 ("dm: hold suspend_lock while suspending device during device deletion")
      Acked-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      52342f2c
  4. 17 Oct, 2016 9 commits
  5. 07 Oct, 2016 4 commits