Commits · 47d4f3eeef5f7fd346640fa8b49a942b506d2659 · Kirill Smelkov / linux

17 Feb, 2022 13 commits

mm/thp: shrink_page_list() avoid splitting VM_LOCKED THP · 47d4f3ee

Hugh Dickins authored Feb 14, 2022

4.8 commit 7751b2da ("vmscan: split file huge pages before paging
them out") inserted a split_huge_page_to_list() into shrink_page_list()
without considering the mlock case: no problem if the page has already
been marked as Mlocked (the !page_evictable check much higher up will
have skipped all this), but it has always been the case that races or
omissions in setting Mlocked can rely on page reclaim to detect this
and correct it before actually reclaiming - and that remains so, but
what a shame if a hugepage is needlessly split before discovering it.

It is surprising that page_check_references() returns PAGEREF_RECLAIM
when VM_LOCKED, but there was a good reason for that: try_to_unmap_one()
is where the condition is detected and corrected; and until now it could
not be done in page_referenced_one(), because that does not always have
the page locked. Now that mlock's requirement for page lock has gone,
copy try_to_unmap_one()'s mlock restoration into page_referenced_one(),
and let page_check_references() return PAGEREF_ACTIVATE in this case.

But page_referenced_one() may find a pte mapping one part of a hugepage:
what hold should a pte mapped in a VM_LOCKED area exert over the entire
huge page? That's debatable. The approach taken here is to treat that
pte mapping in page_referenced_one() as if not VM_LOCKED, and if no
VM_LOCKED pmd mapping is found later in the walk, and lack of reference
permits, then PAGEREF_RECLAIM take it to attempted splitting as before.
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

47d4f3ee

mm/thp: collapse_file() do try_to_unmap(TTU_BATCH_FLUSH) · 6d9df8a5

Hugh Dickins authored Feb 14, 2022

collapse_file() is using unmap_mapping_pages(1) on each small page found
mapped, unlike others (reclaim, migration, splitting, memory-failure) who
use try_to_unmap(). There are four advantages to try_to_unmap(): first,
its TTU_IGNORE_MLOCK option now avoids leaving mlocked page in pagevec;
second, its vma lookup uses i_mmap_lock_read() not i_mmap_lock_write();
third, it breaks out early if page is not mapped everywhere it might be;
fourth, its TTU_BATCH_FLUSH option can be used, as in page reclaim, to
save up all the TLB flushing until all of the pages have been unmapped.

Wild guess: perhaps it was originally written to use try_to_unmap(),
but hit the VM_BUG_ON_PAGE(page_mapped) after unmapping, because without
TTU_SYNC it may skip page table locks; but unmap_mapping_pages() never
skips them, so fixed the issue. I did once hit that VM_BUG_ON_PAGE()
since making this change: we could pass TTU_SYNC here, but I think just
delete the check - the race is very rare, this is an ordinary small page
so we don't need to be so paranoid about mapcount surprises, and the
page_ref_freeze() just below already handles the case adequately.
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

6d9df8a5

mm/munlock: page migration needs mlock pagevec drained · b7435507

Hugh Dickins authored Feb 14, 2022

Page migration of a VM_LOCKED page tends to fail, because when the old
page is unmapped, it is put on the mlock pagevec with raised refcount,
which then fails the freeze.

At first I thought this would be fixed by a local mlock_page_drain() at
the upper rmap_walk() level - which would have nicely batched all the
munlocks of that page; but tests show that the task can too easily move
to another cpu, leaving pagevec residue behind which fails the migration.

So try_to_migrate_one() drain the local pagevec after page_remove_rmap()
from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose
TTU_IGNORE_MLOCK users would want the same treatment; and do the same
in remove_migration_pte() - not important when successfully inserting
a new page, but necessary when hoping to retry after failure.

Any new pagevec runs the risk of adding a new way of stranding, and we
might discover other corners where mlock_page_drain() or lru_add_drain()
would now help.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

b7435507

mm/munlock: mlock_page() munlock_page() batch by pagevec · 2fbb0c10

Hugh Dickins authored Feb 14, 2022

A weakness of the page->mlock_count approach is the need for lruvec lock
while holding page table lock. That is not an overhead we would allow on
normal pages, but I think acceptable just for pages in an mlocked area.
But let's try to amortize the extra cost by gathering on per-cpu pagevec
before acquiring the lruvec lock.

I have an unverified conjecture that the mlock pagevec might work out
well for delaying the mlock processing of new file pages until they have
got off lru_cache_add()'s pagevec and on to LRU.

The initialization of page->mlock_count is subject to races and awkward:
0 or !!PageMlocked or 1? Was it wrong even in the implementation before
this commit, which just widens the window? I haven't gone back to think
it through. Maybe someone can point out a better way to initialize it.

Bringing lru_cache_add_inactive_or_unevictable()'s mlock initialization
into mm/mlock.c has helped: mlock_new_page(), using the mlock pagevec,
rather than lru_cache_add()'s pagevec.

Experimented with various orderings: the right thing seems to be for
mlock_page() and mlock_new_page() to TestSetPageMlocked before adding to
pagevec, but munlock_page() to leave TestClearPageMlocked to the later
pagevec processing.

Dropped the VM_BUG_ON_PAGE(PageTail)s this time around: they have made
their point, and the thp_nr_page()s already contain a VM_BUG_ON_PGFLAGS()
for that.

This still leaves acquiring lruvec locks under page table lock each time
the pagevec fills (or a THP is added): which I suppose is rather silly,
since they sit on pagevec waiting to be processed long after page table
lock has been dropped; but I'm disinclined to uglify the calling sequence
until some load shows an actual problem with it (nothing wrong with
taking lruvec lock under page table lock, just "nicer" to do it less).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

2fbb0c10

mm/munlock: delete smp_mb() from __pagevec_lru_add_fn() · 2262ace6

Hugh Dickins authored Feb 14, 2022

My reading of comment on smp_mb__after_atomic() in __pagevec_lru_add_fn()
says that it can now be deleted; and that remains so when the next patch
is added.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

2262ace6

mm/migrate: __unmap_and_move() push good newpage to LRU · c3096e67

Hugh Dickins authored Feb 14, 2022

Compaction, NUMA page movement, THP collapse/split, and memory failure
do isolate unevictable pages from their "LRU", losing the record of
mlock_count in doing so (isolators are likely to use page->lru for their
own private lists, so mlock_count has to be presumed lost).

That's unfortunate, and we should put in some work to correct that: one
can imagine a function to build up the mlock_count again - but it would
require i_mmap_rwsem for read, so be careful where it's called.  Or
page_referenced_one() and try_to_unmap_one() might do that extra work.

But one place that can very easily be improved is page migration's
__unmap_and_move(): a small adjustment to where the successful new page
is put back on LRU, and its mlock_count (if any) is built back up by
remove_migration_ptes().
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

c3096e67

mm/munlock: mlock_pte_range() when mlocking or munlocking · 34b67923

Hugh Dickins authored Feb 14, 2022

Fill in missing pieces: reimplementation of munlock_vma_pages_range(),
required to lower the mlock_counts when munlocking without munmapping;
and its complement, implementation of mlock_vma_pages_range(), required
to raise the mlock_counts on pages already there when a range is mlocked.

Combine them into just the one function mlock_vma_pages_range(), using
walk_page_range() to run mlock_pte_range(). This approach fixes the
"Very slow unlockall()" of unpopulated PROT_NONE areas, reported in
https://lore.kernel.org/linux-mm/70885d37-62b7-748b-29df-9e94f3291736@gmail.com/

Munlock clears VM_LOCKED at the start, under exclusive mmap_lock; but if
a racing truncate or holepunch (depending on i_mmap_rwsem) gets to the
pte first, it will not try to munlock the page: leaving release_pages()
to correct it when the last reference to the page is gone - that's okay,
a page is not evictable anyway while it is held by an extra reference.

Mlock sets VM_LOCKED at the start, under exclusive mmap_lock; but if
a racing remove_migration_pte() or try_to_unmap_one() (depending on
i_mmap_rwsem) gets to the pte first, it will try to mlock the page,
then mlock_pte_range() mlock it a second time. This is harder to
reproduce, but a more serious race because it could leave the page
unevictable indefinitely though the area is munlocked afterwards.
Guard against it by setting the (inappropriate) VM_IO flag,
and modifying mlock_vma_page() to decline such vmas.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

34b67923

mm/munlock: maintain page->mlock_count while unevictable · 07ca7606

Hugh Dickins authored Feb 14, 2022

Previous patches have been preparatory: now implement page->mlock_count.
The ordering of the "Unevictable LRU" is of no significance, and there is
no point holding unevictable pages on a list: place page->mlock_count to
overlay page->lru.prev (since page->lru.next is overlaid by compound_head,
which needs to be even so as not to satisfy PageTail - though 2 could be
added instead of 1 for each mlock, if that's ever an improvement).

But it's only safe to rely on or modify page->mlock_count while lruvec
lock is held and page is on unevictable "LRU" - we can save lots of edits
by continuing to pretend that there's an imaginary LRU here (there is an
unevictable count which still needs to be maintained, but not a list).

The mlock_count technique suffers from an unreliability much like with
page_mlock(): while someone else has the page off LRU, not much can
be done. As before, err on the safe side (behave as if mlock_count 0),
and let try_to_unlock_one() move the page to unevictable if reclaim finds
out later on - a few misplaced pages don't matter, what we want to avoid
is imbalancing reclaim by flooding evictable lists with unevictable pages.

I am not a fan of "if (!isolate_lru_page(page)) putback_lru_page(page);":
if we have taken lruvec lock to get the page off its present list, then
we save everyone trouble (and however many extra atomic ops) by putting
it on its destination list immediately.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

07ca7606

mm/munlock: replace clear_page_mlock() by final clearance · b109b870

Hugh Dickins authored Feb 14, 2022

Placing munlock_vma_page() at the end of page_remove_rmap() shifts most
of the munlocking to clear_page_mlock(), since PageMlocked is typically
still set when mapcount has fallen to 0.  That is not what we want: we
want /proc/vmstat's unevictable_pgs_cleared to remain as a useful check
on the integrity of of the mlock/munlock protocol - small numbers are
not surprising, but big numbers mean the protocol is not working.

That could be easily fixed by placing munlock_vma_page() at the start of
page_remove_rmap(); but later in the series we shall want to batch the
munlocking, and that too would tend to leave PageMlocked still set at
the point when it is checked.

So delete clear_page_mlock() now: leave it instead to release_pages()
(and __page_cache_release()) to do this backstop clearing of Mlocked,
when page refcount has fallen to 0.  If a pinned page occasionally gets
counted as Mlocked and Unevictable until it is unpinned, that's okay.

A slightly regrettable side-effect of this change is that, since
release_pages() and __page_cache_release() may be called at interrupt
time, those places which update NR_MLOCK with interrupts enabled
had better use mod_zone_page_state() than __mod_zone_page_state()
(but holding the lruvec lock always has interrupts disabled).

This change, forcing Mlocked off when refcount 0 instead of earlier
when mapcount 0, is not fundamental: it can be reversed if performance
or something else is found to suffer; but this is the easiest way to
separate the stats - let's not complicate that without good reason.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

b109b870

mm/munlock: rmap call mlock_vma_page() munlock_vma_page() · cea86fe2

Hugh Dickins authored Feb 14, 2022

Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
inline functions which check (vma->vm_flags & VM_LOCKED) before calling
mlock_page() and munlock_page() in mm/mlock.c.

Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
because we have understandable difficulty in accounting pte maps of THPs,
and if passed a PageHead page, mlock_page() and munlock_page() cannot
tell whether it's a pmd map to be counted or a pte map to be ignored.

Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
others, and use that to call mlock_vma_page() at the end of the page
adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
beginning? unimportant, but end was easier for assertions in testing).

No page lock is required (although almost all adds happen to hold it):
delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
Certainly page lock did serialize with page migration, but I'm having
difficulty explaining why that was ever important.

Mlock accounting on THPs has been hard to define, differed between anon
and file, involved PageDoubleMap in some places and not others, required
clear_page_mlock() at some points. Keep it simple now: just count the
pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.

page_add_new_anon_rmap() callers unchanged: they have long been calling
lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
handling (it also checks for not VM_SPECIAL: I think that's overcautious,
and inconsistent with other checks, that mmap_region() already prevents
VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

cea86fe2

mm/munlock: delete munlock_vma_pages_all(), allow oomreap · a213e5cf

Hugh Dickins authored Feb 14, 2022

munlock_vma_pages_range() will still be required, when munlocking but
not munmapping a set of pages; but when unmapping a pte, the mlock count
will be maintained in much the same way as it will be maintained when
mapping in the pte. Which removes the need for munlock_vma_pages_all()
on mlocked vmas when munmapping or exiting: eliminating the catastrophic
contention on i_mmap_rwsem, and the need for page lock on the pages.

There is still a need to update locked_vm accounting according to the
munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped().
exit_mmap() does not need locked_vm updates, so delete unlock_range().

And wasn't I the one who forbade the OOM reaper to attack mlocked vmas,
because of the uncertainty in blocking on all those page locks?
No fear of that now, so permit the OOM reaper on mlocked vmas.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

a213e5cf

mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE · b67bf49c

Hugh Dickins authored Feb 14, 2022

If counting page mlocks, we must not double-count: follow_page_pte() can
tell if a page has already been Mlocked or not, but cannot tell if a pte
has already been counted or not: that will have to be done when the pte
is mapped in (which lru_cache_add_inactive_or_unevictable() already tracks
for new anon pages, but there's no such tracking yet for others).

Delete all the FOLL_MLOCK code - faulting in the missing pages will do
all that is necessary, without special mlock_vma_page() calls from here.

But then FOLL_POPULATE turns out to serve no purpose - it was there so
that its absence would tell faultin_page() not to faultin page when
setting up VM_LOCKONFAULT areas; but if there's no special work needed
here for mlock, then there's no work at all here for VM_LOCKONFAULT.

Have I got that right? I've not looked into the history, but see that
FOLL_POPULATE goes back before VM_LOCKONFAULT: did it serve a different
purpose before? Ah, yes, it was used to skip the old stack guard page.

And is it intentional that COW is not broken on existing pages when
setting up a VM_LOCKONFAULT area? I can see that being argued either
way, and have no reason to disagree with current behaviour.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

b67bf49c

mm/munlock: delete page_mlock() and all its works · ebcbc6ea

Hugh Dickins authored Feb 14, 2022

We have recommended some applications to mlock their userspace, but that
turns out to be counter-productive: when many processes mlock the same
file, contention on rmap's i_mmap_rwsem can become intolerable at exit: it
is needed for write, to remove any vma mapping that file from rmap's tree;
but hogged for read by those with mlocks calling page_mlock() (formerly
known as try_to_munlock()) on *each* page mapped from the file (the
purpose being to find out whether another process has the page mlocked,
so therefore it should not be unmlocked yet).

Several optimizations have been made in the past: one is to skip
page_mlock() when mapcount tells that nothing else has this page
mapped; but that doesn't help at all when others do have it mapped.
This time around, I initially intended to add a preliminary search
of the rmap tree for overlapping VM_LOCKED ranges; but that gets
messy with locking order, when in doubt whether a page is actually
present; and risks adding even more contention on the i_mmap_rwsem.

A solution would be much easier, if only there were space in struct page
for an mlock_count... but actually, most of the time, there is space for
it - an mlocked page spends most of its life on an unevictable LRU, but
since 3.18 removed the scan_unevictable_pages sysctl, that "LRU" has
been redundant. Let's try to reuse its page->lru.

But leave that until a later patch: in this patch, clear the ground by
removing page_mlock(), and all the infrastructure that has gathered
around it - which mostly hinders understanding, and will make reviewing
new additions harder. Don't mind those old comments about THPs, they
date from before 4.5's refcounting rework: splitting is not a risk here.

Just keep a minimal version of munlock_vma_page(), as reminder of what it
should attend to (in particular, the odd way PGSTRANDED is counted out of
PGMUNLOCKED), and likewise a stub for munlock_vma_pages_range(). Move
unchanged __mlock_posix_error_return() out of the way, down to above its
caller: this series then makes no further change after mlock_fixup().

After this and each following commit, the kernel builds, boots and runs;
but with deficiencies which may show up in testing of mlock and munlock.
The system calls succeed or fail as before, and mlock remains effective
in preventing page reclaim; but meminfo's Unevictable and Mlocked amounts
may be shown too low after mlock, grow, then stay too high after munlock:
with previously mlocked pages remaining unevictable for too long, until
finally unmapped and freed and counts corrected. Normal service will be
resumed in "mm/munlock: mlock_pte_range() when mlocking or munlocking".
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

ebcbc6ea

16 Feb, 2022 2 commits

Merge tag 'mmc-v5.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc · f71077a4

Linus Torvalds authored Feb 16, 2022

Pull MMC fix from Ulf Hansson:
 "Fix recovery logic for multi block I/O reads (MMC_READ_MULTIPLE_BLOCK)"

* tag 'mmc-v5.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: block: fix read single on recovery logic

f71077a4

tty: n_tty: do not look ahead for EOL character past the end of the buffer · 35930307

Linus Torvalds authored Feb 15, 2022

Daniel Gibson reports that the n_tty code gets line termination wrong in
very specific cases:

 "If you feed a line with exactly 64 chars + terminating newline, and
  directly afterwards (without reading) another line into a pseudo
  terminal, the the first read() on the other side will return the 64
  char line *without* terminating newline, and the next read() will
  return the missing terminating newline AND the complete next line (if
  it fits in the buffer)"

and bisected the behavior to commit 3b830a9c ("tty: convert
tty_ldisc_ops 'read()' function to take a kernel pointer").

Now, digging deeper, it turns out that the behavior isn't exactly new:
what changed in commit 3b830a9c was that the tty line discipline
.read() function is now passed an intermediate kernel buffer rather than
the final user space buffer.

And that intermediate kernel buffer is 64 bytes in size - thus that
special case with exactly 64 bytes plus terminating newline.

The same problem did exist before, but historically the boundary was not
the 64-byte chunk, but the user-supplied buffer size, which is obviously
generally bigger (and potentially bigger than N_TTY_BUF_SIZE, which
would hide the issue entirely).

The reason is that the n_tty canon_copy_from_read_buf() code would look
ahead for the EOL character one byte further than it would actually
copy.  It would then decide that it had found the terminator, and unmark
it as an EOL character - which in turn explains why the next read
wouldn't then be terminated by it.

Now, the reason it did all this in the first place is related to some
historical and pretty obscure EOF behavior, see commit ac8f3bf8
("n_tty: Fix poll() after buffer-limited eof push read") and commit
40d5e090 ("n_tty: Fix EOF push handling").

And the reason for the EOL confusion is that we treat EOF as a special
EOL condition, with the EOL character being NUL (aka "__DISABLED_CHAR"
in the kernel sources).

So that EOF look-ahead also affects the normal EOL handling.

This patch just removes the look-ahead that causes problems, because EOL
is much more critical than the historical "EOF in the middle of a line
that coincides with the end of the buffer" handling ever was.

Now, it is possible that we should indeed re-introduce the "look at next
character to see if it's a EOF" behavior, but if so, that should be done
not at the kernel buffer chunk boundary in canon_copy_from_read_buf(),
but at a higher level, when we run out of the user buffer.

In particular, the place to do that would be at the top of
'n_tty_read()', where we check if it's a continuation of a previously
started read, and there is no more buffer space left, we could decide to
just eat the __DISABLED_CHAR at that point.

But that would be a separate patch, because I suspect nobody actually
cares, and I'd like to get a report about it before bothering.

Fixes: 3b830a9c ("tty: convert tty_ldisc_ops 'read()' function to take a kernel pointer")
Fixes: ac8f3bf8 ("n_tty: Fix  poll() after buffer-limited eof push read")
Fixes: 40d5e090 ("n_tty: Fix EOF push handling")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215611Reported-and-tested-by: Daniel Gibson <metalcaedes@gmail.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

35930307

15 Feb, 2022 5 commits

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · c5d9ae26

Linus Torvalds authored Feb 15, 2022

Pull kvm fixes from Paolo Bonzini:
 "ARM:

   - Read HW interrupt pending state from the HW

  x86:

   - Don't truncate the performance event mask on AMD

   - Fix Xen runstate updates to be atomic when preempting vCPU

   - Fix for AMD AVIC interrupt injection race

   - Several other AMD fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86/pmu: Use AMD64_RAW_EVENT_MASK for PERF_TYPE_RAW
  KVM: x86/pmu: Don't truncate the PerfEvtSeln MSR when creating a perf event
  KVM: SVM: fix race between interrupt delivery and AVIC inhibition
  KVM: SVM: set IRR in svm_deliver_interrupt
  KVM: SVM: extract avic_ring_doorbell
  selftests: kvm: Remove absent target file
  KVM: arm64: vgic: Read HW interrupt pending state from the HW
  KVM: x86/xen: Fix runstate updates to be atomic when preempting vCPU
  KVM: x86: SVM: move avic definitions from AMD's spec to svm.h
  KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it
  KVM: x86: nSVM: deal with L1 hypervisor that intercepts interrupts but lets L2 control them
  KVM: x86: nSVM: expose clean bit support to the guest
  KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a result of RSM
  KVM: x86: nSVM: mark vmcb01 as dirty when restoring SMM saved state
  KVM: x86: nSVM: fix potential NULL derefernce on nested migration
  KVM: x86: SVM: don't passthrough SMAP/SMEP/PKE bits in !NPT && !gCR0.PG case
  Revert "svm: Add warning message for AVIC IPI invalid target"

c5d9ae26

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid · a254a9da

Linus Torvalds authored Feb 15, 2022

Pull HID fixes from Jiri Kosina:

 - memory leak fix for hid-elo driver (Dongliang Mu)

 - fix for hangs on newer AMD platforms with amd_sfh-driven hardware
   (Basavaraj Natikar )

 - locking fix in i2c-hid (Daniel Thompson)

 - a few device-ID specific quirks

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
  HID: amd_sfh: Add interrupt handler to process interrupts
  HID: amd_sfh: Add functionality to clear interrupts
  HID: amd_sfh: Disable the interrupt for all command
  HID: amd_sfh: Correct the structure field name
  HID: amd_sfh: Handle amd_sfh work buffer in PM ops
  HID:Add support for UGTABLET WP5540
  HID: amd_sfh: Add illuminance mask to limit ALS max value
  HID: amd_sfh: Increase sensor command timeout
  HID: i2c-hid: goodix: Fix a lockdep splat
  HID: elo: fix memory leak in elo_probe
  HID: apple: Set the tilde quirk flag on the Wellspring 5 and later

a254a9da

Merge tag 'for-5.17-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 705d84a3

Linus Torvalds authored Feb 15, 2022

Pull btrfs fixes from David Sterba:

 - yield CPU more often when defragmenting a large file

 - skip defragmenting extents already under writeback

 - improve error message when send fails to write file data

 - get rid of warning when mounted with 'flushoncommit'

* tag 'for-5.17-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: send: in case of IO error log it
  btrfs: get rid of warning on transaction commit when using flushoncommit
  btrfs: defrag: don't try to defrag extents which are under writeback
  btrfs: don't hold CPU for too long when defragging a file

705d84a3

Merge tag 'for-5.17/parisc-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux · 2572da44

Linus Torvalds authored Feb 15, 2022

Pull parisc architecture fixes from Helge Deller:

 - Fix miscompilations when function calls are made from inside a
   put_user() call

 - Drop __init from map_pages() declaration to avoid random boot crashes

 - Added #error messages if a 64-bit compiler was used to build a 32-bit
   kernel (and vice versa)

 - Fix out-of-bound data TLB miss faults in sba_iommu and ccio-dma
   drivers

 - Add ioread64_lo_hi() and iowrite64_lo_hi() functions to avoid kernel
   test robot errors

 - Fix link failure when 8250_gsc driver is built without CONFIG_IOSAPIC

* tag 'for-5.17/parisc-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  serial: parisc: GSC: fix build when IOSAPIC is not set
  parisc: Fix some apparent put_user() failures
  parisc: Show error if wrong 32/64-bit compiler is being used
  parisc: Add ioread64_lo_hi() and iowrite64_lo_hi()
  parisc: Fix sglist access in ccio-dma.c
  parisc: Fix data TLB miss in sba_unmap_sg
  parisc: Drop __init from map_pages declaration

2572da44

Merge tag 'hyperv-fixes-signed-20220215' of... · c24449b3

Linus Torvalds authored Feb 15, 2022

Merge tag 'hyperv-fixes-signed-20220215' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv fixes from Wei Liu:

 - Rework use of DMA_BIT_MASK in vmbus to work around a clang bug
   (Michael Kelley)

 - Fix NUMA topology (Long Li)

 - Fix a memory leak in vmbus (Miaoqian Lin)

 - One minor clean-up patch (Cai Huoqing)

* tag 'hyperv-fixes-signed-20220215' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  Drivers: hv: utils: Make use of the helper macro LIST_HEAD()
  Drivers: hv: vmbus: Rework use of DMA_BIT_MASK(64)
  Drivers: hv: vmbus: Fix memory leak in vmbus_add_channel_kobj
  PCI: hv: Fix NUMA node assignment when kernel boots with custom NUMA topology

c24449b3

14 Feb, 2022 10 commits

serial: parisc: GSC: fix build when IOSAPIC is not set · 6e879367

Randy Dunlap authored Feb 14, 2022

There is a build error when using a kernel .config file from
'kernel test robot' for a different build problem:

hppa64-linux-ld: drivers/tty/serial/8250/8250_gsc.o: in function `.LC3':
(.data.rel.ro+0x18): undefined reference to `iosapic_serial_irq'

when:
  CONFIG_GSC=y
  CONFIG_SERIO_GSCPS2=y
  CONFIG_SERIAL_8250_GSC=y
  CONFIG_PCI is not set
    and hence PCI_LBA is not set.
  IOSAPIC depends on PCI_LBA, so IOSAPIC is not set/enabled.

Make the use of iosapic_serial_irq() conditional to fix the build error.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kernel test robot <lkp@intel.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: linux-parisc@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: linux-serial@vger.kernel.org
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johan Hovold <johan@kernel.org>
Suggested-by: Helge Deller <deller@gmx.de>
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: stable@vger.kernel.org
Signed-off-by: Helge Deller <deller@gmx.de>

6e879367

Merge tag 'regulator-fix-v5.17-rc4' of... · d567f5db

Linus Torvalds authored Feb 14, 2022

Merge tag 'regulator-fix-v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator

Pull regulator fix from Mark Brown:
 "One fix here, for initialisation of regulators that don't have an
  in_enabled() operation which would mainly impact cases where they
  aren't otherwise used during early setup for some reason"

* tag 'regulator-fix-v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
  regulator: core: fix false positive in regulator_late_cleanup()

d567f5db

HID: amd_sfh: Add interrupt handler to process interrupts · 7f016b35

Basavaraj Natikar authored Feb 08, 2022

On newer AMD platforms with SFH, it is observed that random interrupts
get generated on the SFH hardware and until this is cleared the firmware
sensor processing is stalled, resulting in no data been received to
driver side.

Add routines to handle these interrupts, so that firmware operations are
not stalled.
Signed-off-by: Basavaraj Natikar <Basavaraj.Natikar@amd.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>

7f016b35

HID: amd_sfh: Add functionality to clear interrupts · fb75a379

Basavaraj Natikar authored Feb 08, 2022

Newer AMD platforms with SFH may generate interrupts on some events
which are unwarranted. Until this is cleared the actual MP2 data
processing maybe stalled in some cases.

Add a mechanism to clear the pending interrupts (if any) during the
driver initialization and sensor command operations.
Signed-off-by: Basavaraj Natikar <Basavaraj.Natikar@amd.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>

fb75a379

HID: amd_sfh: Disable the interrupt for all command · b300667b

Basavaraj Natikar authored Feb 08, 2022

Sensor data is processed in polling mode. Hence disable the interrupt
for all sensor command.
Signed-off-by: Basavaraj Natikar <Basavaraj.Natikar@amd.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>

b300667b

HID: amd_sfh: Correct the structure field name · aa0b724a

Basavaraj Natikar authored Feb 08, 2022

Misinterpreted intr_enable field name. Hence correct the structure
field name accordingly to reflect the functionality.

Fixes: f264481a ("HID: amd_sfh: Extend driver capabilities for multi-generation support")
Signed-off-by: Basavaraj Natikar <Basavaraj.Natikar@amd.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>

aa0b724a

HID: amd_sfh: Handle amd_sfh work buffer in PM ops · 0cf74235

Basavaraj Natikar authored Feb 08, 2022

Since in the current amd_sfh design the sensor data is periodically
obtained in the form of poll data, during the suspend/resume cycle,
scheduling a delayed work adds no value.

So, cancel the work and restart back during the suspend/resume cycle
respectively.
Signed-off-by: Basavaraj Natikar <Basavaraj.Natikar@amd.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>

0cf74235

KVM: x86/pmu: Use AMD64_RAW_EVENT_MASK for PERF_TYPE_RAW · 710c4765

Jim Mattson authored Feb 02, 2022

AMD's event select is 3 nybbles, with the high nybble in bits 35:32 of
a PerfEvtSeln MSR. Don't mask off the high nybble when configuring a
RAW perf event.

Fixes: ca724305 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20220203014813.2130559-2-jmattson@google.com>
Reviewed-by: David Dunn <daviddunn@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

710c4765

KVM: x86/pmu: Don't truncate the PerfEvtSeln MSR when creating a perf event · b8bfee85

Jim Mattson authored Feb 02, 2022

AMD's event select is 3 nybbles, with the high nybble in bits 35:32 of
a PerfEvtSeln MSR. Don't drop the high nybble when setting up the
config field of a perf_event_attr structure for a call to
perf_event_create_kernel_counter().

Fixes: ca724305 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20220203014813.2130559-1-jmattson@google.com>
Reviewed-by: David Dunn <daviddunn@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

b8bfee85

parisc: Fix some apparent put_user() failures · dbd0b423

Helge Deller authored Feb 13, 2022

After commit 4b9d2a73 ("parisc: Switch user access functions
to signal errors in r29 instead of r8") bash suddenly started
to report those warnings after login:

-bash: cannot set terminal process group (-1): Bad file descriptor
-bash: no job control in this shell

It turned out, that a function call inside a put_user(), e.g.:
put_user(vt_do_kdgkbmode(console), (int __user *)arg);
clobbered the error register (r29) and thus the put_user() call itself
seem to have failed.

Rearrange the C-code to pre-calculate the intermediate value
and then do the put_user().
Additionally prefer the "+" constraint on pu_err and gu_err registers
to tell the compiler that those operands are both read and written by
the assembly instruction.
Reported-by: John David Anglin <dave.anglin@bell.net>
Signed-off-by: Helge Deller <deller@gmx.de>
Fixes: 4b9d2a73 ("parisc: Switch user access functions to signal errors in r29 instead of r8")
Signed-off-by: Helge Deller <deller@gmx.de>

dbd0b423

13 Feb, 2022 10 commits

parisc: Show error if wrong 32/64-bit compiler is being used · b160628e

Helge Deller authored Feb 13, 2022

It happens quite often that people use the wrong compiler to build the
kernel:

make ARCH=parisc   -> builds the 32-bit kernel
make ARCH=parisc64 -> builds the 64-bit kernel

This patch adds a sanity check which errors out with an instruction how
use the correct ARCH= option.
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: stable@vger.kernel.org # v5.15+

b160628e

Linux 5.17-rc4 · 754e0b0e
Linus Torvalds authored Feb 13, 2022

754e0b0e

Merge tag 'kbuild-fixes-v5.17-2' of... · e89d3a46

Linus Torvalds authored Feb 13, 2022

Merge tag 'kbuild-fixes-v5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - Fix the truncated path issue for HAVE_GCC_PLUGINS test in Kconfig

 - Move -Wunsligned-access to W=1 builds to avoid sprinkling warnings
   for the latest Clang

 - Fix missing fclose() in Kconfig

 - Fix Kconfig to touch dep headers correctly when KCONFIG_AUTOCONFIG is
   overridden.

* tag 'kbuild-fixes-v5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kconfig: fix failing to generate auto.conf
  kconfig: fix missing fclose() on error paths
  Makefile.extrawarn: Move -Wunaligned-access to W=1
  kconfig: let 'shell' return enough output for deep path names

e89d3a46

Merge tag 'irq-urgent-2022-02-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · c5d714aa

Linus Torvalds authored Feb 13, 2022

Pull irq fixes from Thomas Gleixner:
 "Interrupt chip driver fixes:

   - Don't install an hotplug notifier for GICV3-ITS on systems which do
     not need it to prevent a warning in the notifier about inconsistent
     state

   - Add the missing device tree matching for the T-HEAD PLIC variant so
     the related SoC is properly supported"

* tag 'irq-urgent-2022-02-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/sifive-plic: Add missing thead,c900-plic match string
  dt-bindings: update riscv plic compatible string
  irqchip/gic-v3-its: Skip HP notifier when no ITS is registered

c5d714aa

Merge tag 'objtool_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 42964a18

Linus Torvalds authored Feb 13, 2022

Pull objtool fix from Borislav Petkov:
 "Fix a case where objtool would mistakenly warn about instructions
  being unreachable"

* tag 'objtool_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/bug: Merge annotate_reachable() into _BUG_FLAGS() asm

42964a18

Merge tag 'sched_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 6f357367

Linus Torvalds authored Feb 13, 2022

Pull scheduler fix from Borislav Petkov:
 "Fix a NULL-ptr dereference when recalculating a sched entity's weight"

* tag 'sched_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix fault in reweight_entity

6f357367

Merge tag 'perf_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · f5e02656

Linus Torvalds authored Feb 13, 2022

Pull perf fix from Borislav Petkov:
 "Prevent cgroup event list corruption when switching events"

* tag 'perf_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix list corruption in perf_cgroup_switch()

f5e02656

Merge tag 'x86_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 808f0ab2

Linus Torvalds authored Feb 13, 2022

Pull x86 fix from Borislav Petkov:
 "Prevent softlockups when tearing down large SGX enclaves"

* tag 'x86_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sgx: Silence softlockup detection when releasing large enclaves

808f0ab2

Merge tag '5.17-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 · e9c25787

Linus Torvalds authored Feb 13, 2022

Pull cifs fixes from Steve French:
 "Three small smb3 reconnect fixes and an error log clarification"

* tag '5.17-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: mark sessions for reconnection in helper function
  cifs: call helper functions for marking channels for reconnect
  cifs: call cifs_reconnect when a connection is marked
  [smb3] improve error message when mount options conflict with posix

e9c25787

Merge tag 'irqchip-fixes-5.17-2' of... · 1e34064b

Thomas Gleixner authored Feb 13, 2022

Merge tag 'irqchip-fixes-5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/urgent

Pull irqchip fixes from Marc Zyngier:

 - Don't register a hotplug notifier on GICv3 systems that advertise
   LPI support, but have no ITS to make use of it

 - Add missing DT matching for the thead,c900-plic variant of the
   SiFive PLIC

Link: https://lore.kernel.org/r/20220211110038.1179155-1-maz@kernel.org

1e34064b