1. 28 Oct, 2016 27 commits
  2. 22 Oct, 2016 13 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.8.4 · a2b42342
      Greg Kroah-Hartman authored
      a2b42342
    • Glauber Costa's avatar
      cfq: fix starvation of asynchronous writes · 9362516c
      Glauber Costa authored
      commit 3932a86b upstream.
      
      While debugging timeouts happening in my application workload (ScyllaDB), I have
      observed calls to open() taking a long time, ranging everywhere from 2 seconds -
      the first ones that are enough to time out my application - to more than 30
      seconds.
      
      The problem seems to happen because XFS may block on pending metadata updates
      under certain circumnstances, and that's confirmed with the following backtrace
      taken by the offcputime tool (iovisor/bcc):
      
          ffffffffb90c57b1 finish_task_switch
          ffffffffb97dffb5 schedule
          ffffffffb97e310c schedule_timeout
          ffffffffb97e1f12 __down
          ffffffffb90ea821 down
          ffffffffc046a9dc xfs_buf_lock
          ffffffffc046abfb _xfs_buf_find
          ffffffffc046ae4a xfs_buf_get_map
          ffffffffc046babd xfs_buf_read_map
          ffffffffc0499931 xfs_trans_read_buf_map
          ffffffffc044a561 xfs_da_read_buf
          ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
          ffffffffc0452b90 xfs_dir2_leaf_lookup_int
          ffffffffc0452e0f xfs_dir2_leaf_lookup
          ffffffffc044d9d3 xfs_dir_lookup
          ffffffffc047d1d9 xfs_lookup
          ffffffffc0479e53 xfs_vn_lookup
          ffffffffb925347a path_openat
          ffffffffb9254a71 do_filp_open
          ffffffffb9242a94 do_sys_open
          ffffffffb9242b9e sys_open
          ffffffffb97e42b2 entry_SYSCALL_64_fastpath
          00007fb0698162ed [unknown]
      
      Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
      high "Dispatch wait" times, on the dozens of seconds range and consistent with
      the open() times I have saw in that run.
      
      Still from the blktrace output, we can after searching a bit, identify the
      request that wasn't dispatched:
      
        8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <- (8,1) 141696240
        8,0   11      153    81.092472889   804  Q  WM 141698288 + 8 [xfsaild/sda1]
        8,0   11      154    81.092473207   804  G  WM 141698288 + 8 [xfsaild/sda1]
        8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (   22911) [xfsaild/sda1]
        <==== 'I' means Inserted (into the IO scheduler) ===================================>
        8,0    0   289372    96.718761435     0  D  WM 141698288 + 8 (15626265317) [swapper/0]
        <==== Only 15s later the CFQ scheduler dispatches the request ======================>
      
      As we can see above, in this particular example CFQ took 15 seconds to dispatch
      this request. Going back to the full trace, we can see that the xfsaild queue
      had plenty of opportunity to run, and it was selected as the active queue many
      times. It would just always be preempted by something else (example):
      
        8,0    1        0    81.117912979     0  m   N cfq1618SN / insert_request
        8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
        8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
        8,0    1        0    81.117914398     0  m   N cfq767A  / slice expired t=1
        8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
        8,0    1        0    81.117915340     0  m   N / served: vt=1948520448 min_vt=1948520448
        8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1 disp=0 charge=0 iops=1 sect=0
      
      where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
      IO dispatchers.
      
      The requests preempting the xfsaild queue are synchronous requests. That's a
      characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
      While it can be argued that preempting ASYNC requests in favor of SYNC is part
      of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
      goal.
      
      Moreover, unless I am misunderstanding something, that breaks the expectation
      set by the "fifo_expire_async" tunable, which in my system is set to the
      default.
      
      Looking at the code, it seems to me that the issue is that after we make
      an async queue active, there is no guarantee that it will execute any request.
      
      When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
      requests in flight. An incoming request from another queue can also preempt it
      in such situation before we have the chance to execute anything (as seen in the
      trace above).
      
      This patch sets the must_dispatch flag if we notice that we have requests
      that are already fifo_expired. This flag is always cleared after
      cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
      the queue for subsequent requests (unless they are themselves expired)
      
      Care is taken during preempt to still allow rt requests to preempt us
      regardless.
      
      Testing my workload with this patch applied produces much better results.
      From the application side I see no timeouts, and the open() latency histogram
      generated by systemtap looks much better, with the worst outlier at 131ms:
      
      Latency histogram of xfs_buf_lock acquisition (microseconds):
       value |-------------------------------------------------- count
           0 |                                                     11
           1 |@@@@                                                161
           2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1966
           4 |@                                                    54
           8 |                                                     36
          16 |                                                      7
          32 |                                                      0
          64 |                                                      0
             ~
        1024 |                                                      0
        2048 |                                                      0
        4096 |                                                      1
        8192 |                                                      1
       16384 |                                                      2
       32768 |                                                      0
       65536 |                                                      0
      131072 |                                                      1
      262144 |                                                      0
      524288 |                                                      0
      Signed-off-by: default avatarGlauber Costa <glauber@scylladb.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: linux-block@vger.kernel.org
      CC: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarGlauber Costa <glauber@scylladb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9362516c
    • Vishal Verma's avatar
      acpi, nfit: check for the correct event code in notifications · afac7081
      Vishal Verma authored
      commit c09f1218 upstream.
      
      Commit 20985164 "acpi: nfit: Add support for hot-add" added
      support for _FIT notifications, but it neglected to verify the
      notification event code matches the one in the ACPI spec for
      "NFIT Update". Currently there is only one code in the spec, but
      once additional codes are added, older kernels (without this fix)
      will misbehave by assuming all event notifications are for an
      NFIT Update.
      
      Fixes: 20985164 ("acpi: nfit: Add support for hot-add")
      Cc: <linux-acpi@vger.kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarLinda Knippers <linda.knippers@hpe.com>
      Signed-off-by: default avatarVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      afac7081
    • Laszlo Ersek's avatar
      drm: virtio: reinstate drm_virtio_set_busid() · 3245ff58
      Laszlo Ersek authored
      commit c2cbc38b upstream.
      
      Before commit a3257256 ("drm: Lobotomize set_busid nonsense for !pci
      drivers"), several DRM drivers for platform devices used to expose an
      explicit "drm_driver.set_busid" callback, invariably backed by
      drm_platform_set_busid().
      
      Commit a3257256 removed drm_platform_set_busid(), along with the
      referring .set_busid field initializations. This was justified because
      interchangeable functionality had been implemented in drm_dev_alloc() /
      drm_dev_init(), which DRM_IOCTL_SET_VERSION would rely on going forward.
      
      However, commit a3257256 also removed drm_virtio_set_busid(), for
      which the same consolidation was not appropriate: this .set_busid callback
      had been implemented with drm_pci_set_busid(), and not
      drm_platform_set_busid(). The error regressed Xorg/xserver on QEMU's
      "virtio-vga" card; the drmGetBusid() function from libdrm would no longer
      return stable PCI identifiers like "pci:0000:00:02.0", but rather unstable
      platform ones like "virtio0".
      
      Reinstate drm_virtio_set_busid() with judicious use of
      
        git checkout -p a3257256^ -- drivers/gpu/drm/virtio
      
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Emil Velikov <emil.l.velikov@gmail.com>
      Cc: Gerd Hoffmann <kraxel@redhat.com>
      Cc: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
      Cc: Hans de Goede <hdegoede@redhat.com>
      Cc: Joachim Frieben <jfrieben@hotmail.com>
      Reported-by: default avatarJoachim Frieben <jfrieben@hotmail.com>
      Fixes: a3257256
      Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1366842Signed-off-by: default avatarLaszlo Ersek <lersek@redhat.com>
      Reviewed-by: default avatarEmil Velikov <emil.l.velikov@gmail.com>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3245ff58
    • David Howells's avatar
      cachefiles: Fix attempt to read i_blocks after deleting file [ver #2] · 336f2e1e
      David Howells authored
      commit a818101d upstream.
      
      An NULL-pointer dereference happens in cachefiles_mark_object_inactive()
      when it tries to read i_blocks so that it can tell the cachefilesd daemon
      how much space it's making available.
      
      The problem is that cachefiles_drop_object() calls
      cachefiles_mark_object_inactive() after calling cachefiles_delete_object()
      because the object being marked active staves off attempts to (re-)use the
      file at that filename until after it has been deleted.  This means that
      d_inode is NULL by the time we come to try to access it.
      
      To fix the problem, have the caller of cachefiles_mark_object_inactive()
      supply the number of blocks freed up.
      
      Without this, the following oops may occur:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
      IP: [<ffffffffa06c5cc1>] cachefiles_mark_object_inactive+0x61/0xb0 [cachefiles]
      ...
      CPU: 11 PID: 527 Comm: kworker/u64:4 Tainted: G          I    ------------   3.10.0-470.el7.x86_64 #1
      Hardware name: Hewlett-Packard HP Z600 Workstation/0B54h, BIOS 786G4 v03.19 03/11/2011
      Workqueue: fscache_object fscache_object_work_func [fscache]
      task: ffff880035edaf10 ti: ffff8800b77c0000 task.ti: ffff8800b77c0000
      RIP: 0010:[<ffffffffa06c5cc1>] cachefiles_mark_object_inactive+0x61/0xb0 [cachefiles]
      RSP: 0018:ffff8800b77c3d70  EFLAGS: 00010246
      RAX: 0000000000000000 RBX: ffff8800bf6cc400 RCX: 0000000000000034
      RDX: 0000000000000000 RSI: ffff880090ffc710 RDI: ffff8800bf761ef8
      RBP: ffff8800b77c3d88 R08: 2000000000000000 R09: 0090ffc710000000
      R10: ff51005d2ff1c400 R11: 0000000000000000 R12: ffff880090ffc600
      R13: ffff8800bf6cc520 R14: ffff8800bf6cc400 R15: ffff8800bf6cc498
      FS:  0000000000000000(0000) GS:ffff8800bb8c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000098 CR3: 00000000019ba000 CR4: 00000000000007e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Stack:
       ffff880090ffc600 ffff8800bf6cc400 ffff8800867df140 ffff8800b77c3db0
       ffffffffa06c48cb ffff880090ffc600 ffff880090ffc180 ffff880090ffc658
       ffff8800b77c3df0 ffffffffa085d846 ffff8800a96b8150 ffff880090ffc600
      Call Trace:
       [<ffffffffa06c48cb>] cachefiles_drop_object+0x6b/0xf0 [cachefiles]
       [<ffffffffa085d846>] fscache_drop_object+0xd6/0x1e0 [fscache]
       [<ffffffffa085d615>] fscache_object_work_func+0xa5/0x200 [fscache]
       [<ffffffff810a605b>] process_one_work+0x17b/0x470
       [<ffffffff810a6e96>] worker_thread+0x126/0x410
       [<ffffffff810a6d70>] ? rescuer_thread+0x460/0x460
       [<ffffffff810ae64f>] kthread+0xcf/0xe0
       [<ffffffff810ae580>] ? kthread_create_on_node+0x140/0x140
       [<ffffffff81695418>] ret_from_fork+0x58/0x90
       [<ffffffff810ae580>] ? kthread_create_on_node+0x140/0x140
      
      The oopsing code shows:
      
      	callq  0xffffffff810af6a0 <wake_up_bit>
      	mov    0xf8(%r12),%rax
      	mov    0x30(%rax),%rax
      	mov    0x98(%rax),%rax   <---- oops here
      	lock add %rax,0x130(%rbx)
      
      where this is:
      
      	d_backing_inode(object->dentry)->i_blocks
      
      Fixes: a5b3a80b (CacheFiles: Provide read-and-reset release counters for cachefilesd)
      Reported-by: default avatarJianhong Yin <jiyin@redhat.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarJeff Layton <jlayton@redhat.com>
      Reviewed-by: default avatarSteve Dickson <steved@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      336f2e1e
    • Miklos Szeredi's avatar
      vfs: move permission checking into notify_change() for utimes(NULL) · 7bf99896
      Miklos Szeredi authored
      commit f2b20f6e upstream.
      
      This fixes a bug where the permission was not properly checked in
      overlayfs.  The testcase is ltp/utimensat01.
      
      It is also cleaner and safer to do the permission checking in the vfs
      helper instead of the caller.
      
      This patch introduces an additional ia_valid flag ATTR_TOUCH (since
      touch(1) is the most obvious user of utimes(NULL)) that is passed into
      notify_change whenever the conditions for this special permission checking
      mode are met.
      Reported-by: default avatarAihua Zhang <zhangaihua1@huawei.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Tested-by: default avatarAihua Zhang <zhangaihua1@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7bf99896
    • Marcelo Ricardo Leitner's avatar
      dlm: free workqueues after the connections · 6b746940
      Marcelo Ricardo Leitner authored
      commit 3a8db798 upstream.
      
      After backporting commit ee44b4bc ("dlm: use sctp 1-to-1 API")
      series to a kernel with an older workqueue which didn't use RCU yet, it
      was noticed that we are freeing the workqueues in dlm_lowcomms_stop()
      too early as free_conn() will try to access that memory for canceling
      the queued works if any.
      
      This issue was introduced by commit 0d737a8c as before it such
      attempt to cancel the queued works wasn't performed, so the issue was
      not present.
      
      This patch fixes it by simply inverting the free order.
      
      Fixes: 0d737a8c ("dlm: fix race while closing connections")
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6b746940
    • Marcelo Cerri's avatar
      crypto: vmx - Fix memory corruption caused by p8_ghash · 3fae7862
      Marcelo Cerri authored
      commit 80da44c2 upstream.
      
      This patch changes the p8_ghash driver to use ghash-generic as a fixed
      fallback implementation. This allows the correct value of descsize to be
      defined directly in its shash_alg structure and avoids problems with
      incorrect buffer sizes when its state is exported or imported.
      Reported-by: default avatarJan Stancek <jstancek@redhat.com>
      Fixes: cc333cd6 ("crypto: vmx - Adding GHASH routines for VMX module")
      Signed-off-by: default avatarMarcelo Cerri <marcelo.cerri@canonical.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3fae7862
    • Marcelo Cerri's avatar
      crypto: ghash-generic - move common definitions to a new header file · e15e0b84
      Marcelo Cerri authored
      commit a397ba82 upstream.
      
      Move common values and types used by ghash-generic to a new header file
      so drivers can directly use ghash-generic as a fallback implementation.
      
      Fixes: cc333cd6 ("crypto: vmx - Adding GHASH routines for VMX module")
      Signed-off-by: default avatarMarcelo Cerri <marcelo.cerri@canonical.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e15e0b84
    • Jan Kara's avatar
      ext4: unmap metadata when zeroing blocks · fb13b62a
      Jan Kara authored
      commit 9b623df6 upstream.
      
      When zeroing blocks for DAX allocations, we also have to unmap aliases
      in the block device mappings.  Otherwise writeback can overwrite zeros
      with stale data from block device page cache.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fb13b62a
    • gmail's avatar
      ext4: release bh in make_indexed_dir · ff50a724
      gmail authored
      commit e81d4477 upstream.
      
      The commit 6050d47a: "ext4: bail out from make_indexed_dir() on
      first error" could end up leaking bh2 in the error path.
      
      [ Also avoid renaming bh2 to bh, which just confuses things --tytso ]
      Signed-off-by: default avataryangsheng <yngsion@gmail.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ff50a724
    • Ross Zwisler's avatar
      ext4: allow DAX writeback for hole punch · 99fa4c50
      Ross Zwisler authored
      commit cca32b7e upstream.
      
      Currently when doing a DAX hole punch with ext4 we fail to do a writeback.
      This is because the logic around filemap_write_and_wait_range() in
      ext4_punch_hole() only looks for dirty page cache pages in the radix tree,
      not for dirty DAX exceptional entries.
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      99fa4c50
    • Eric Biggers's avatar
      ext4: fix memory leak when symlink decryption fails · 5373f6cc
      Eric Biggers authored
      commit dcce7a46 upstream.
      
      This bug was introduced in v4.8-rc1.
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5373f6cc