Commits · f970b977e068aa54e6eaf916a964a0abaf028afe · Kirill Smelkov / linux

27 Mar, 2020 3 commits

mm/hmm: remove unused code and tidy comments · f970b977

Jason Gunthorpe authored Mar 27, 2020

Delete several functions that are never called, fix some desync between
comments and structure content, toss the now out of date top of file
header, and move one function only used by hmm.c into hmm.c

Link: https://lore.kernel.org/r/20200327200021.29372-4-jgg@ziepe.caReviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

f970b977

mm/hmm: return the fault type from hmm_pte_need_fault() · a3eb13c1

Jason Gunthorpe authored Mar 27, 2020

Using two bools instead of flags return is not necessary and leads to
bugs. Returning a value is easier for the compiler to check and easier to
pass around the code flow.

Convert the two bools into flags and push the change to all callers.

Link: https://lore.kernel.org/r/20200327200021.29372-3-jgg@ziepe.caReviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

a3eb13c1

mm/hmm: remove pgmap checking for devmap pages · 068354ad

Jason Gunthorpe authored Mar 27, 2020

The checking boils down to some racy check if the pagemap is still
available or not. Instead of checking this, rely entirely on the
notifiers, if a pagemap is destroyed then all pages that belong to it must
be removed from the tables and the notifiers triggered.

Link: https://lore.kernel.org/r/20200327200021.29372-2-jgg@ziepe.caReviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

068354ad

26 Mar, 2020 17 commits

mm/hmm: check the device private page owner in hmm_range_fault() · 08ddddda

Christoph Hellwig authored Mar 16, 2020

hmm_range_fault() will succeed for any kind of device private memory, even
if it doesn't belong to the calling entity. While nouveau has some crude
checks for that, they are broken because they assume nouveau is the only
user of device private memory. Fix this by passing in an expected pgmap
owner in the hmm_range_fault structure.

If a device_private page is found and doesn't match the owner then it is
treated as an non-present and non-faultable page.

This prevents a bug in amdgpu, where it doesn't know how to handle
device_private pages, but hmm_range_fault would return them anyhow.

Fixes: 4ef589dc ("mm/hmm/devmem: device memory hotplug using ZONE_DEVICE")
Link: https://lore.kernel.org/r/20200316193216.920734-5-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

08ddddda

mm: simplify device private page handling in hmm_range_fault · 17ffdc48

Christoph Hellwig authored Mar 16, 2020

Remove the HMM_PFN_DEVICE_PRIVATE flag, no driver has ever set this flag
on input, and the only place that uses it on output can be trivially
changed to use is_device_private_page().

This removes the ability to request that device_private pages are faulted
back into system memory.

Link: https://lore.kernel.org/r/20200316193216.920734-4-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

17ffdc48

mm: handle multiple owners of device private pages in migrate_vma · 800bb1c8

Christoph Hellwig authored Mar 16, 2020

Add a new src_owner field to struct migrate_vma. If the field is set,
only device private pages with page->pgmap->owner equal to that field are
migrated. If the field is not set only "normal" pages are migrated.

Fixes: df6ad698 ("mm/device-public-memory: device memory cache coherent with CPU")
Link: https://lore.kernel.org/r/20200316193216.920734-3-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

800bb1c8

memremap: add an owner field to struct dev_pagemap · f894ddd5

Christoph Hellwig authored Mar 16, 2020

Add a new opaque owner field to struct dev_pagemap, which will allow the
hmm and migrate_vma code to identify who owns ZONE_DEVICE memory, and
refuse to work on mappings not owned by the calling entity.

Link: https://lore.kernel.org/r/20200316193216.920734-2-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

f894ddd5

mm: merge hmm_vma_do_fault into into hmm_vma_walk_hole_ · 5a0c38d3

Christoph Hellwig authored Mar 16, 2020

There is no good reason for this split, as it just obsfucates the flow.

Link: https://lore.kernel.org/r/20200316135310.899364-6-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

5a0c38d3

mm/hmm: don't handle the non-fault case in hmm_vma_walk_hole_() · f8c888a3

Christoph Hellwig authored Mar 16, 2020

Setting a pfns entry to NONE before returning -EBUSY is a bug that will
cause corruption of the input flags on the next loop.

There is just a single caller using hmm_vma_walk_hole_() for the non-fault
case. Use hmm_pfns_fill() to fill the whole pfn array with zeroes in the
only caller for the non-fault case and remove the non-fault path from
hmm_vma_walk_hole_(). This avoids setting NONE before returning -EBUSY.

Also rename the function to hmm_vma_fault() to better describe what it
does.

Fixes: 2aee09d8 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis")
Link: https://lore.kernel.org/r/20200316135310.899364-5-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

f8c888a3

mm/hmm: simplify hmm_vma_walk_hugetlb_entry() · 45050692

Christoph Hellwig authored Mar 16, 2020

Remove the rather confusing goto label and just handle the fault case
directly in the branch checking for it.

Link: https://lore.kernel.org/r/20200316135310.899364-4-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

45050692

mm/hmm: remove the unused HMM_FAULT_ALLOW_RETRY flag · 96268163

Christoph Hellwig authored Mar 16, 2020

The HMM_FAULT_ALLOW_RETRY isn't used anywhere in the tree. Remove it and
the weird -EAGAIN handling where handle_mm_fault() drops the mmap_sem.

Link: https://lore.kernel.org/r/20200316135310.899364-3-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

96268163

mm/hmm: don't provide a stub for hmm_range_fault() · ddfaed17

Christoph Hellwig authored Mar 16, 2020

All callers of hmm_range_fault depend on CONFIG_HMM_MIRROR, so don't
bother with a stub.

Link: https://lore.kernel.org/r/20200316135310.899364-2-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

ddfaed17

mm/hmm: do not check pmd_protnone twice in hmm_vma_handle_pmd() · 24cee8ab

Jason Gunthorpe authored Mar 11, 2020

pmd_to_hmm_pfn_flags() already checks it and makes the cpu flags 0. If no
fault is requested then the pfns should be returned with the not valid
flags.

It should not unconditionally fault if faulting is not requested.

Fixes: 2aee09d8 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis")
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

24cee8ab

mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling · 40550627

Jason Gunthorpe authored Mar 05, 2020

Currently if a special PTE is encountered hmm_range_fault() immediately
returns EFAULT and sets the HMM_PFN_SPECIAL error output (which nothing
uses).

EFAULT should only be returned after testing with hmm_pte_need_fault().

Also pte_devmap() and pte_special() are exclusive, and there is no need to
check IS_ENABLED, pte_special() is stubbed out to return false on
unsupported architectures.

Fixes: 992de9a8 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem")
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

40550627

mm/hmm: return -EFAULT when setting HMM_PFN_ERROR on requested valid pages · 2288a9a6

Jason Gunthorpe authored Mar 05, 2020

hmm_range_fault() should never return 0 if the caller requested a valid
page, but the pfns output for that page would be HMM_PFN_ERROR.

hmm_pte_need_fault() must always be called before setting HMM_PFN_ERROR to
detect if the page is in faulting mode or not.

Fix two cases in hmm_vma_walk_pmd() and reorganize some of the duplicated
code.

Fixes: d08faca0 ("mm/hmm: properly handle migration pmd")
Fixes: da4c3c73 ("mm/hmm/mirror: helper to snapshot CPU page table")
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

2288a9a6

mm/hmm: reorganize how !pte_present is handled in hmm_vma_handle_pte() · 76612d6c

Jason Gunthorpe authored Feb 28, 2020

The intention with this code is to determine if the caller required the
pages to be valid, and if so, then take some action to make them valid.
The action varies depending on the page type.

In all cases, if the caller doesn't ask for the page, then
hmm_range_fault() should not return an error.

Revise the implementation to be clearer, and fix some bugs:

 - hmm_pte_need_fault() must always be called before testing fault or
   write_fault otherwise the defaults of false apply and the if()'s don't
   work. This was missed on the is_migration_entry() branch

 - -EFAULT should not be returned unless hmm_pte_need_fault() indicates
   fault is required - ie snapshotting should not fail.

 - For !pte_present() the cpu_flags are always 0, except in the special
   case of is_device_private_entry(), calling pte_to_hmm_pfn_flags() is
   confusing.

Reorganize the flow so that it always follows the pattern of calling
hmm_pte_need_fault() and then checking fault || write_fault.

Fixes: 2aee09d8 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis")
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

76612d6c

mm/hmm: add missing call to hmm_range_need_fault() before returning EFAULT · c2579c9c

Jason Gunthorpe authored Mar 05, 2020

All return paths that do EFAULT must call hmm_range_need_fault() to
determine if the user requires this page to be valid.

If the page cannot be made valid if the user later requires it, due to vma
flags in this case, then the return should be HMM_PFN_ERROR.

Fixes: a3e0d41c ("mm/hmm: improve driver API to work and wait over a range")
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

c2579c9c

mm/hmm: add missing pfns set to hmm_vma_walk_pmd() · 7d082987

Jason Gunthorpe authored Mar 04, 2020

All success exit paths from the walker functions must set the pfns array.

A migration entry with no required fault is a HMM_PFN_NONE return, just
like the pte case.

Fixes: d08faca0 ("mm/hmm: properly handle migration pmd")
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

7d082987

mm/hmm: do not call hmm_vma_walk_hole() while holding a spinlock · 05fc1df9

Jason Gunthorpe authored Mar 02, 2020

This eventually calls into handle_mm_fault() which is a sleeping function.
Release the lock first.

hmm_vma_walk_hole() does not touch the contents of the PUD, so it does not
need the lock.

Fixes: 3afc4236 ("mm: pagewalk: add p4d_entry() and pgd_entry()")
Cc: Steven Price <steven.price@arm.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

05fc1df9

mm/hmm: add missing unmaps of the ptep during hmm_vma_handle_pte() · dfdc2207

Jason Gunthorpe authored Feb 28, 2020

Many of the direct returns of error skipped doing the pte_unmap(). All non
zero exit paths must unmap the pte.

The pte_unmap() is split unnaturally like this because some of the error
exit paths trigger a sleep and must release the lock before sleeping.

Fixes: 992de9a8 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem")
Fixes: 53f5c3f4 ("mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()")
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

dfdc2207

24 Feb, 2020 1 commit
- Linux 5.6-rc3 · f8788d86
  Linus Torvalds authored Feb 23, 2020
  
  f8788d86
23 Feb, 2020 7 commits

Merge tag 'for-5.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · d2eee258

Linus Torvalds authored Feb 23, 2020

Pull btrfs fixes from David Sterba:
 "These are fixes that were found during testing with help of error
  injection, plus some other stable material.

  There's a fixup to patch added to rc1 causing locking in wrong context
  warnings, tests found one more deadlock scenario. The patches are
  tagged for stable, two of them now in the queue but we'd like all
  three released at the same time.

  I'm not happy about fixes to fixes in such a fast succession during
  rcs, but I hope we found all the fallouts of commit 28553fa9
  ('Btrfs: fix race between shrinking truncate and fiemap')"

* tag 'for-5.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix deadlock during fast fsync when logging prealloc extents beyond eof
  Btrfs: fix btrfs_wait_ordered_range() so that it waits for all ordered extents
  btrfs: fix bytes_may_use underflow in prealloc error condtition
  btrfs: handle logged extent failure properly
  btrfs: do not check delayed items are empty for single transaction cleanup
  btrfs: reset fs_root to NULL on error in open_ctree
  btrfs: destroy qgroup extent records on transaction abort

d2eee258

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · a3163ca0

Linus Torvalds authored Feb 23, 2020

Pull ext4 fixes from Ted Ts'o:
 "More miscellaneous ext4 bug fixes (all stable fodder)"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix mount failure with quota configured as module
  jbd2: fix ocfs2 corrupt when clearing block group bits
  ext4: fix race between writepages and enabling EXT4_EXTENTS_FL
  ext4: rename s_journal_flag_rwsem to s_writepages_rwsem
  ext4: fix potential race between s_flex_groups online resizing and access
  ext4: fix potential race between s_group_info online resizing and access
  ext4: fix potential race between online resizing and write operations
  ext4: add cond_resched() to __ext4_find_entry()
  ext4: fix a data race in EXT4_I(inode)->i_disksize

a3163ca0

Merge tag 'csky-for-linus-5.6-rc3' of git://github.com/c-sky/csky-linux · c6188dff

Linus Torvalds authored Feb 23, 2020

Pull csky updates from Guo Ren:
 "Sorry, I missed 5.6-rc1 merge window, but in this pull request the
  most are the fixes and the rests are between fixes and features. The
  only outside modification is the MAINTAINERS file update with our
  mailing list.

   - cache flush implementation fixes

   - ftrace modify panic fix

   - CONFIG_SMP boot problem fix

   - fix pt_regs saving for atomic.S

   - fix fixaddr_init without highmem.

   - fix stack protector support

   - fix fake Tightly-Coupled Memory code compile and use

   - fix some typos and coding convention"

* tag 'csky-for-linus-5.6-rc3' of git://github.com/c-sky/csky-linux: (23 commits)
  csky: Replace <linux/clk-provider.h> by <linux/of_clk.h>
  csky: Implement copy_thread_tls
  csky: Add PCI support
  csky: Minimize defconfig to support buildroot config.fragment
  csky: Add setup_initrd check code
  csky: Cleanup old Kconfig options
  arch/csky: fix some Kconfig typos
  csky: Fixup compile warning for three unimplemented syscalls
  csky: Remove unused cache implementation
  csky: Fixup ftrace modify panic
  csky: Add flush_icache_mm to defer flush icache all
  csky: Optimize abiv2 copy_to_user_page with VM_EXEC
  csky: Enable defer flush_dcache_page for abiv2 cpus (807/810/860)
  csky: Remove unnecessary flush_icache_* implementation
  csky: Support icache flush without specific instructions
  csky/Kconfig: Add Kconfig.platforms to support some drivers
  csky/smp: Fixup boot failed when CONFIG_SMP
  csky: Set regs->usp to kernel sp, when the exception is from kernel
  csky/mm: Fixup export invalid_pte_table symbol
  csky: Separate fixaddr_init from highmem
  ...

c6188dff

csky: Replace <linux/clk-provider.h> by <linux/of_clk.h> · 99db590b

Geert Uytterhoeven authored Feb 12, 2020

The C-Sky platform code is not a clock provider, and just needs to call
of_clk_init().

Hence it can include <linux/of_clk.h> instead of <linux/clk-provider.h>.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>

99db590b

Merge tag 'ras-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · dca132a6

Linus Torvalds authored Feb 22, 2020

Pull RAS fixes from Thomas Gleixner:
 "Two fixes for the AMD MCE driver:

   - Populate the per CPU MCA bank descriptor pointer only after it has
     been completely set up to prevent a use-after-free in case that one
     of the subsequent initialization step fails

   - Implement a proper release function for the sysfs entries of MCA
     threshold controls instead of freeing the memory right in the CPU
     teardown code, which leads to another use-after-free when the
     associated sysfs file is opened and accessed"

* tag 'ras-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce/amd: Fix kobject lifetime
  x86/mce/amd: Publish the bank pointer only after setup has succeeded

dca132a6

Merge tag 'irq-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · f3cc2494

Linus Torvalds authored Feb 22, 2020

Pull irq fixes from Thomas Gleixner:
 "Two fixes for the irq core code which are follow ups to the recent MSI
  fixes:

   - The WARN_ON which was put into the MSI setaffinity callback for
     paranoia reasons actually triggered via a callchain which escaped
     when all the possible ways to reach that code were analyzed.

     The proc/irq/$N/*affinity interfaces have a quirk which came in
     when ALPHA moved to the generic interface: In case that the written
     affinity mask does not contain any online CPU it calls into ALPHAs
     magic auto affinity setting code.

     A few years later this mechanism was also made available to x86 for
     no good reasons and in a way which circumvents all sanity checks
     for interrupts which cannot have their affinity set from process
     context on X86 due to the way the X86 interrupt delivery works.

     It would be possible to make this work properly, but there is no
     point in doing so. If the interrupt is not yet started then the
     affinity setting has no effect and if it is started already then it
     is already assigned to an online CPU so there is no point to
     randomly move it to some other CPU. Just return EINVAL as the code
     has done before that change forever.

   - The new MSI quirk bit in the irq domain flags turned out to be
     already occupied, which escaped the author and the reviewers
     because the already in use bits were 0,6,2,3,4,5 listed in that
     order.

     That bit 6 was simply overlooked because the ordering was straight
     forward linear otherwise. So the new bit ended up being a
     duplicate.

     Fix it up by switching the oddball 6 to the obvious 1"

* tag 'irq-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq/irqdomain: Make sure all irq domain flags are distinct
  genirq/proc: Reject invalid affinity masks (again)

f3cc2494

Merge tag 'x86-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · fca10378

Linus Torvalds authored Feb 22, 2020

Pull x86 fixes from Thomas Gleixner:
 "Two fixes for x86:

   - Remove the __force_oder definiton from the kaslr boot code as it is
     already defined in the page table code which makes GCC 10 builds
     fail because it changed the default to -fno-common.

   - Address the AMD erratum 1054 concerning the IRPERF capability and
     enable the Instructions Retired fixed counter on machines which are
     not affected by the erratum"

* tag 'x86-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu/amd: Enable the fixed Instructions Retired counter IRPERF
  x86/boot/compressed: Don't declare __force_order in kaslr_64.c

fca10378

22 Feb, 2020 12 commits

Merge tag 'zonefs-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs · 0a115e5f

Linus Torvalds authored Feb 22, 2020

Pull zonefs fix from Damien Le Moal:
 "A single patch fixing typos in the documentation file"

* tag 'zonefs-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: fix documentation typos etc.

0a115e5f

Merge tag 'io_uring-5.6-2020-02-22' of git://git.kernel.dk/linux-block · b88025ea

Linus Torvalds authored Feb 22, 2020

Pull io_uring fixes from Jens Axboe:
 "Here's a small collection of fixes that were queued up:

   - Remove unnecessary NULL check (Dan)

   - Missing io_req_cancelled() call in fallocate (Pavel)

   - Put the cleanup check for aux data in the right spot (Pavel)

   - Two fixes for SQPOLL (Stefano, Xiaoguang)"

* tag 'io_uring-5.6-2020-02-22' of git://git.kernel.dk/linux-block:
  io_uring: fix __io_iopoll_check deadlock in io_sq_thread
  io_uring: prevent sq_thread from spinning when it should stop
  io_uring: fix use-after-free by io_cleanup_req()
  io_uring: remove unnecessary NULL checks
  io_uring: add missing io_req_cancelled()

b88025ea

Merge tag 'block-5.6-2020-02-22' of git://git.kernel.dk/linux-block · f6c69b7f

Linus Torvalds authored Feb 22, 2020

Pull block fixes from Jens Axboe:
 "Just a set of NVMe fixes via Keith"

* tag 'block-5.6-2020-02-22' of git://git.kernel.dk/linux-block:
  nvme-multipath: Fix memory leak with ana_log_buf
  nvme: Fix uninitialized-variable warning
  nvme-pci: Use single IRQ vector for old Apple models
  nvme/pci: Add sleep quirk for Samsung and Toshiba drives

f6c69b7f

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · b98b809c

Linus Torvalds authored Feb 22, 2020

Pull SCSI fixes from James Bottomley:
 "Four non-core fixes.

  Two are reverts of target fixes which turned out to have unwanted side
  effects, one is a revert of an RDMA fix with the same problem and the
  final one fixes an incorrect warning about memory allocation failures
  in megaraid_sas (the driver actually reduces the allocation size until
  it succeeds)"
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: Revert "target: iscsi: Wait for all commands to finish before freeing a session"
  scsi: Revert "RDMA/isert: Fix a recently introduced regression related to logout"
  scsi: megaraid_sas: silence a warning
  scsi: Revert "target/core: Inline transport_lun_remove_cmd()"

b98b809c

Merge tag 'hwmon-for-v5.6-rc3' of... · 5b442b1a

Linus Torvalds authored Feb 22, 2020

Merge tag 'hwmon-for-v5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging

Pull hwmon fixes from Guenter Roeck:

 - Fix crash in w83627ehf driver seen with W83627DHG-P

 - Fix lockdep splat in acpi_power_meter driver

 - Fix xdpe12284 documentation Sphinx warnings

* tag 'hwmon-for-v5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
  hwmon: (w83627ehf) Fix crash seen with W83627DHG-P
  hwmon: (acpi_power_meter) Fix lockdep splat
  Documentation/hwmon: fix xdpe12284 Sphinx warnings

5b442b1a

Merge tag 'devicetree-fixes-for-5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux · fea63021

Linus Torvalds authored Feb 22, 2020

Pull devicetree fixes deom Rob Herring:
 "A handful of fixes in DT bindings for MDIO bus, Allwinner CSI, OMAP
  HSMMC, and Tegra124 EMC"

* tag 'devicetree-fixes-for-5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
  dt-bindings: media: csi: Fix clocks description
  dt-bindings: media: csi: Add interconnects properties
  dt-bindings: net: mdio: remove compatible string from example
  dt-bindings: memory-controller: Update example for Tegra124 EMC
  dt-bindings: mmc: omap-hsmmc: Fix SDIO interrupt

fea63021

Merge tag 's390-5.6-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 591dd4c1

Linus Torvalds authored Feb 22, 2020

Pull s390 fixes from Vasily Gorbik:

 - Remove ieee_emulation_warnings sysctl which is a dead code.

 - Avoid triggering rebuild of the kernel during make install.

 - Enable protected virtualization guest support in default configs.

 - Fix cio_ignore seq_file .next function to increase position index.
   And use kobj_to_dev instead of container_of in cio code.

 - Fix storage block address lists to contain absolute addresses in qdio
   code.

 - Few clang warnings and spelling fixes.

* tag 's390-5.6-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/qdio: fill SBALEs with absolute addresses
  s390/qdio: fill SL with absolute addresses
  s390: remove obsolete ieee_emulation_warnings
  s390: make 'install' not depend on vmlinux
  s390/kaslr: Fix casts in get_random
  s390/mm: Explicitly compare PAGE_DEFAULT_KEY against zero in storage_key_init_range
  s390/pkey/zcrypt: spelling s/crytp/crypt/
  s390/cio: use kobj_to_dev() API
  s390/defconfig: enable CONFIG_PROTECTED_VIRTUALIZATION_GUEST
  s390/cio: cio_ignore_proc_seq_next should increase position index

591dd4c1

io_uring: fix __io_iopoll_check deadlock in io_sq_thread · c7849be9

Xiaoguang Wang authored Feb 22, 2020

Since commit a3a0e43f ("io_uring: don't enter poll loop if we have
CQEs pending"), if we already events pending, we won't enter poll loop.
In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has
been terminated and don't reap pending events which are already in cq
ring, and there are some reqs in poll_list, io_sq_thread will enter
__io_iopoll_check(), and find pending events, then return, this loop
will never have a chance to exit.

I have seen this issue in fio stress tests, to fix this issue, let
io_sq_thread call io_iopoll_getevents() with argument 'min' being zero,
and remove __io_iopoll_check().

Fixes: a3a0e43f ("io_uring: don't enter poll loop if we have CQEs pending")
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c7849be9

ext4: fix mount failure with quota configured as module · 9db176bc

Jan Kara authored Feb 21, 2020

When CONFIG_QFMT_V2 is configured as a module, the test in
ext4_feature_set_ok() fails and so mount of filesystems with quota or
project features fails. Fix the test to use IS_ENABLED macro which
works properly even for modules.

Link: https://lore.kernel.org/r/20200221100835.9332-1-jack@suse.cz
Fixes: d65d87a0 ("ext4: improve explanation of a mount failure caused by a misconfigured kernel")
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

9db176bc

jbd2: fix ocfs2 corrupt when clearing block group bits · 8eedabfd

wangyan authored Feb 20, 2020

I found a NULL pointer dereference in ocfs2_block_group_clear_bits().
The running environment:
	kernel version: 4.19
	A cluster with two nodes, 5 luns mounted on two nodes, and do some
	file operations like dd/fallocate/truncate/rm on every lun with storage
	network disconnection.

The fallocate operation on dm-23-45 caused an null pointer dereference.

The information of NULL pointer dereference as follows:
	[577992.878282] JBD2: Error -5 detected when updating journal superblock for dm-23-45.
	[577992.878290] Aborting journal on device dm-23-45.
	...
	[577992.890778] JBD2: Error -5 detected when updating journal superblock for dm-24-46.
	[577992.890908] __journal_remove_journal_head: freeing b_committed_data
	[577992.890916] (fallocate,88392,52):ocfs2_extend_trans:474 ERROR: status = -30
	[577992.890918] __journal_remove_journal_head: freeing b_committed_data
	[577992.890920] (fallocate,88392,52):ocfs2_rotate_tree_right:2500 ERROR: status = -30
	[577992.890922] __journal_remove_journal_head: freeing b_committed_data
	[577992.890924] (fallocate,88392,52):ocfs2_do_insert_extent:4382 ERROR: status = -30
	[577992.890928] (fallocate,88392,52):ocfs2_insert_extent:4842 ERROR: status = -30
	[577992.890928] __journal_remove_journal_head: freeing b_committed_data
	[577992.890930] (fallocate,88392,52):ocfs2_add_clusters_in_btree:4947 ERROR: status = -30
	[577992.890933] __journal_remove_journal_head: freeing b_committed_data
	[577992.890939] __journal_remove_journal_head: freeing b_committed_data
	[577992.890949] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
	[577992.890950] Mem abort info:
	[577992.890951]   ESR = 0x96000004
	[577992.890952]   Exception class = DABT (current EL), IL = 32 bits
	[577992.890952]   SET = 0, FnV = 0
	[577992.890953]   EA = 0, S1PTW = 0
	[577992.890954] Data abort info:
	[577992.890955]   ISV = 0, ISS = 0x00000004
	[577992.890956]   CM = 0, WnR = 0
	[577992.890958] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000f8da07a9
	[577992.890960] [0000000000000020] pgd=0000000000000000
	[577992.890964] Internal error: Oops: 96000004 [#1] SMP
	[577992.890965] Process fallocate (pid: 88392, stack limit = 0x00000000013db2fd)
	[577992.890968] CPU: 52 PID: 88392 Comm: fallocate Kdump: loaded Tainted: G        W  OE     4.19.36 #1
	[577992.890969] Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 0.98 08/25/2019
	[577992.890971] pstate: 60400009 (nZCv daif +PAN -UAO)
	[577992.891054] pc : _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2]
	[577992.891082] lr : _ocfs2_free_suballoc_bits+0x618/0x968 [ocfs2]
	[577992.891084] sp : ffff0000c8e2b810
	[577992.891085] x29: ffff0000c8e2b820 x28: 0000000000000000
	[577992.891087] x27: 00000000000006f3 x26: ffffa07957b02e70
	[577992.891089] x25: ffff807c59d50000 x24: 00000000000006f2
	[577992.891091] x23: 0000000000000001 x22: ffff807bd39abc30
	[577992.891093] x21: ffff0000811d9000 x20: ffffa07535d6a000
	[577992.891097] x19: ffff000001681638 x18: ffffffffffffffff
	[577992.891098] x17: 0000000000000000 x16: ffff000080a03df0
	[577992.891100] x15: ffff0000811d9708 x14: 203d207375746174
	[577992.891101] x13: 73203a524f525245 x12: 20373439343a6565
	[577992.891103] x11: 0000000000000038 x10: 0101010101010101
	[577992.891106] x9 : ffffa07c68a85d70 x8 : 7f7f7f7f7f7f7f7f
	[577992.891109] x7 : 0000000000000000 x6 : 0000000000000080
	[577992.891110] x5 : 0000000000000000 x4 : 0000000000000002
	[577992.891112] x3 : ffff000001713390 x2 : 2ff90f88b1c22f00
	[577992.891114] x1 : ffff807bd39abc30 x0 : 0000000000000000
	[577992.891116] Call trace:
	[577992.891139]  _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2]
	[577992.891162]  _ocfs2_free_clusters+0x100/0x290 [ocfs2]
	[577992.891185]  ocfs2_free_clusters+0x50/0x68 [ocfs2]
	[577992.891206]  ocfs2_add_clusters_in_btree+0x198/0x5e0 [ocfs2]
	[577992.891227]  ocfs2_add_inode_data+0x94/0xc8 [ocfs2]
	[577992.891248]  ocfs2_extend_allocation+0x1bc/0x7a8 [ocfs2]
	[577992.891269]  ocfs2_allocate_extents+0x14c/0x338 [ocfs2]
	[577992.891290]  __ocfs2_change_file_space+0x3f8/0x610 [ocfs2]
	[577992.891309]  ocfs2_fallocate+0xe4/0x128 [ocfs2]
	[577992.891316]  vfs_fallocate+0x11c/0x250
	[577992.891317]  ksys_fallocate+0x54/0x88
	[577992.891319]  __arm64_sys_fallocate+0x28/0x38
	[577992.891323]  el0_svc_common+0x78/0x130
	[577992.891325]  el0_svc_handler+0x38/0x78
	[577992.891327]  el0_svc+0x8/0xc

My analysis process as follows:
ocfs2_fallocate
  __ocfs2_change_file_space
    ocfs2_allocate_extents
      ocfs2_extend_allocation
        ocfs2_add_inode_data
          ocfs2_add_clusters_in_btree
            ocfs2_insert_extent
              ocfs2_do_insert_extent
                ocfs2_rotate_tree_right
                  ocfs2_extend_rotate_transaction
                    ocfs2_extend_trans
                      jbd2_journal_restart
                        jbd2__journal_restart
                          /* handle->h_transaction is NULL,
                           * is_handle_aborted(handle) is true
                           */
                          handle->h_transaction = NULL;
                          start_this_handle
                            return -EROFS;
            ocfs2_free_clusters
              _ocfs2_free_clusters
                _ocfs2_free_suballoc_bits
                  ocfs2_block_group_clear_bits
                    ocfs2_journal_access_gd
                      __ocfs2_journal_access
                        jbd2_journal_get_undo_access
                          /* I think jbd2_write_access_granted() will
                           * return true, because do_get_write_access()
                           * will return -EROFS.
                           */
                          if (jbd2_write_access_granted(...)) return 0;
                          do_get_write_access
                            /* handle->h_transaction is NULL, it will
                             * return -EROFS here, so do_get_write_access()
                             * was not called.
                             */
                            if (is_handle_aborted(handle)) return -EROFS;
                    /* bh2jh(group_bh) is NULL, caused NULL
                       pointer dereference */
                    undo_bg = (struct ocfs2_group_desc *)
                                bh2jh(group_bh)->b_committed_data;

If handle->h_transaction == NULL, then jbd2_write_access_granted()
does not really guarantee that journal_head will stay around,
not even speaking of its b_committed_data. The bh2jh(group_bh)
can be removed after ocfs2_journal_access_gd() and before call
"bh2jh(group_bh)->b_committed_data". So, we should move
is_handle_aborted() check from do_get_write_access() into
jbd2_journal_get_undo_access() and jbd2_journal_get_write_access()
before the call to jbd2_write_access_granted().

Link: https://lore.kernel.org/r/f72a623f-b3f1-381a-d91d-d22a1c83a336@huawei.comSigned-off-by: Yan Wang <wangyan122@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org

8eedabfd

ext4: fix race between writepages and enabling EXT4_EXTENTS_FL · cb85f4d2

Eric Biggers authored Feb 19, 2020

If EXT4_EXTENTS_FL is set on an inode while ext4_writepages() is running
on it, the following warning in ext4_add_complete_io() can be hit:

WARNING: CPU: 1 PID: 0 at fs/ext4/page-io.c:234 ext4_put_io_end_defer+0xf0/0x120

Here's a minimal reproducer (not 100% reliable) (root isn't required):

        while true; do
                sync
        done &
        while true; do
                rm -f file
                touch file
                chattr -e file
                echo X >> file
                chattr +e file
        done

The problem is that in ext4_writepages(), ext4_should_dioread_nolock()
(which only returns true on extent-based files) is checked once to set
the number of reserved journal credits, and also again later to select
the flags for ext4_map_blocks() and copy the reserved journal handle to
ext4_io_end::handle.  But if EXT4_EXTENTS_FL is being concurrently set,
the first check can see dioread_nolock disabled while the later one can
see it enabled, causing the reserved handle to unexpectedly be NULL.

Since changing EXT4_EXTENTS_FL is uncommon, and there may be other races
related to doing so as well, fix this by synchronizing changing
EXT4_EXTENTS_FL with ext4_writepages() via the existing
s_writepages_rwsem (previously called s_journal_flag_rwsem).

This was originally reported by syzbot without a reproducer at
https://syzkaller.appspot.com/bug?extid=2202a584a00fffd19fbf,
but now that dioread_nolock is the default I also started seeing this
when running syzkaller locally.

Link: https://lore.kernel.org/r/20200219183047.47417-3-ebiggers@kernel.org
Reported-by: syzbot+2202a584a00fffd19fbf@syzkaller.appspotmail.com
Fixes: 6b523df4 ("ext4: use transaction reservation for extent conversion in ext4_end_io")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org

cb85f4d2

ext4: rename s_journal_flag_rwsem to s_writepages_rwsem · bbd55937

Eric Biggers authored Feb 19, 2020

In preparation for making s_journal_flag_rwsem synchronize
ext4_writepages() with changes to both the EXTENTS and JOURNAL_DATA
flags (rather than just JOURNAL_DATA as it does currently), rename it to
s_writepages_rwsem.

Link: https://lore.kernel.org/r/20200219183047.47417-2-ebiggers@kernel.orgSigned-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org

bbd55937