Commits · 47b7ec1daa511cd82cb9c31e88bfdb664b031d2a · Kirill Smelkov / linux

08 Feb, 2021 1 commit

gfs2: Enable rgrplvb for sb_fs_format 1802 · 47b7ec1d

Andrew Price authored Feb 05, 2021

Turn on rgrplvb by default for sb_fs_format > 1801.

Mount options still have to override this so a new args field to
differentiate between 'off' and 'not specified' is added, and the new
default is applied only when it's not specified.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>

47b7ec1d

05 Feb, 2021 2 commits

gfs2: Don't skip dlm unlock if glock has an lvb · 78178ca8

Bob Peterson authored Feb 05, 2021

Patch fb6791d1 was designed to allow gfs2 to unmount quicker by
skipping the step where it tells dlm to unlock glocks in EX with lvbs.
This was done because when gfs2 unmounts a file system, it destroys the
dlm lockspace shortly after it destroys the glocks so it doesn't need to
unlock them all: the unlock is implied when the lockspace is destroyed
by dlm.

However, that patch introduced a use-after-free in dlm: as part of its
normal dlm_recoverd process, it can call ls_recovery to recover dead
locks. In so doing, it can call recover_rsbs which calls recover_lvb for
any mastered rsbs. Func recover_lvb runs through the list of lkbs queued
to the given rsb (if the glock is cached but unlocked, it will still be
queued to the lkb, but in NL--Unlocked--mode) and if it has an lvb,
copies it to the rsb, thus trying to preserve the lkb. However, when
gfs2 skips the dlm unlock step, it frees the glock and its lvb, which
means dlm's function recover_lvb references the now freed lvb pointer,
copying the freed lvb memory to the rsb.

This patch changes the check in gdlm_put_lock so that it calls
dlm_unlock for all glocks that contain an lvb pointer.

Fixes: fb6791d1 ("GFS2: skip dlm_unlock calls in unmount")
Cc: stable@vger.kernel.org # v3.8+
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>

78178ca8

gfs2: Lock imbalance on error path in gfs2_recover_one · 834ec3e1

Andreas Gruenbacher authored Feb 05, 2021

In gfs2_recover_one, fix a sd_log_flush_lock imbalance when a recovery
pass fails.

Fixes: c9ebc4b7 ("gfs2: allow journal replay to hold sd_log_flush_lock")
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>

834ec3e1

25 Jan, 2021 4 commits

gfs2: keep bios separate for each journal · 82218943

Bob Peterson authored Jan 21, 2021

The recovery func can recover multiple journals, but they were all using
the same bio. This resulted in use-after-free related to sdp->sd_log_bio.
This patch moves the variable to the journal descriptor, jd, so that
every recovery can operate on its own bio. And hopefully we never run out.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>

82218943

gfs2: fix glock confusion in function signal_our_withdraw · f5f02fde

Bob Peterson authored Jan 18, 2021

If go_free is defined, function signal_our_withdraw is supposed to
synchronize on the GLF_FREEING flag of the inode glock, but it
accidentally does that on the live glock. Fix that and disambiguate
the glock variables.

Fixes: 601ef0d5 ("gfs2: Force withdraw to replay journals and wait for it to finish")
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Bob Peterson <rpeterso@redhat.com>

f5f02fde

Revert "GFS2: Re-add a call to log_flush_wait when flushing the journal" · 4a011849

Bob Peterson authored Jan 20, 2021

This reverts commit 428fd95d.
Patch 428fd95d85b2 added a call to log_flush_wait to function
gfs2_log_flush. Then gfs2_log_flush calls log_write_header which submits
a write request with the REQ_PREFLUSH flag which also forces it to wait.
This patch removes the unnecessary call to log_flush_wait.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>

4a011849

gfs2: Fix invalid block size message · bff2e532

Andrew Price authored Jan 12, 2021

Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>

bff2e532

22 Jan, 2021 1 commit

gfs2: amend SLAB_RECLAIM_ACCOUNT on gfs2 related slab cache · 00e8e9bc

Zhaoyang Huang authored Jan 05, 2021

As gfs2_quotad_cachep and gfs2_glock_cachep have registered
shrinkers, amending SLAB_RECLAIM_ACCOUNT when creating them,
which improves slab accounting.
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>

00e8e9bc

31 Dec, 2020 1 commit

gfs2: make gfs2_log_write_page static · 2a6fe26c

Bob Peterson authored Dec 22, 2020

Function gfs2_log_write_page is only used in lops.c, so make it static.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>

2a6fe26c

22 Dec, 2020 2 commits

gfs2: move freeze glock outside the make_fs_rw and _ro functions · 96b1454f

Bob Peterson authored Dec 22, 2020

Before this patch, sister functions gfs2_make_fs_rw and gfs2_make_fs_ro locked
(held) the freeze glock by calling gfs2_freeze_lock and gfs2_freeze_unlock.
The problem is, not all the callers of gfs2_make_fs_ro should be doing this.
The three callers of gfs2_make_fs_ro are: remount (gfs2_reconfigure),
signal_our_withdraw, and unmount (gfs2_put_super). But when unmounting the
file system we can get into the following circular lock dependency:

deactivate_super
   down_write(&s->s_umount); <-------------------------------------- s_umount
   deactivate_locked_super
      gfs2_kill_sb
         kill_block_super
            generic_shutdown_super
               gfs2_put_super
                  gfs2_make_fs_ro
                     gfs2_glock_nq_init sd_freeze_gl
                        freeze_go_sync
                           if (freeze glock in SH)
                              freeze_super (vfs)
                                 down_write(&sb->s_umount); <------- s_umount

This patch moves the hold of the freeze glock outside the two sister rw/ro
functions to their callers, but it doesn't request the glock from
gfs2_put_super, thus eliminating the circular dependency.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>

96b1454f

gfs2: Add common helper for holding and releasing the freeze glock · c77b52c0

Bob Peterson authored Dec 22, 2020

Many places in the gfs2 code queued and dequeued the freeze glock.
Almost all of them acquire it in SHARED mode, and need to specify the
same LM_FLAG_NOEXP and GL_EXACT flags.

This patch adds common helper functions gfs2_freeze_lock and gfs2_freeze_unlock
to make the code more readable, and to prepare for the next patch.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>

c77b52c0

20 Dec, 2020 2 commits

Merge tag 'gfs2-for-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 · 7703f46f

Linus Torvalds authored Dec 20, 2020

Pull gfs2 updates from Andreas Gruenbacher:

 - Don't wait for unfreeze of the wrong filesystems

 - Remove an obsolete delete_work_func hack and an incorrect
   sb_start_write

 - Minor documentation updates and cosmetic care

* tag 'gfs2-for-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: in signal_our_withdraw wait for unfreeze of _this_ fs only
  gfs2: Remove sb_start_write from gfs2_statfs_sync
  gfs2: remove trailing semicolons from macro definitions
  Revert "GFS2: Prevent delete work from occurring on glocks used for create"
  gfs2: Make inode operations static
  MAINTAINERS: Add gfs2 bug tracker link
  Documentation: Update filesystems/gfs2.rst

7703f46f

epoll: fix compat syscall wire up of epoll_pwait2 · 450f68e2

Heiko Carstens authored Dec 20, 2020

Commit b0a0c261 ("epoll: wire up syscall epoll_pwait2") wired up
the 64 bit syscall instead of the compat variant in a couple of places.

Fixes: b0a0c261 ("epoll: wire up syscall epoll_pwait2")
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

450f68e2

19 Dec, 2020 27 commits

Merge tag 'close-range-cloexec-unshare-v5.11' of... · 467f8165

Linus Torvalds authored Dec 19, 2020

Merge tag 'close-range-cloexec-unshare-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull close_range fix from Christian Brauner:
"syzbot reported a bug when asking close_range() to unshare the file
descriptor table and making all fds close-on-exec.

If CLOSE_RANGE_UNSHARE the caller will receive a private file
descriptor table in case their file descriptor table is currently
shared before operating on the requested file descriptor range.

For the case where the caller has requested all file descriptors to be
actually closed via e.g. close_range(3, ~0U, CLOSE_RANGE_UNSHARE) the
kernel knows that the caller does not need any of the file descriptors
anymore and will optimize the close operation by only copying all
files in the range from 0 to 3 and no others.

However, if the caller requested CLOSE_RANGE_CLOEXEC together with
CLOSE_RANGE_UNSHARE the caller wants to still make use of the file
descriptors so the kernel needs to copy all of them and can't
optimize.

The original patch didn't account for this and thus could cause oopses
as evidenced by the syzbot report because it assumed that all fds had
been copied. Fix this by handling the CLOSE_RANGE_CLOEXEC case and
copying all fds if the two flags are specified together.

This should've been caught in the selftests but the original patch
didn't cover this case and I didn't catch it during review. So in
addition to the bugfix I'm also adding selftests. They will reliably
reproduce the bug on a non-fixed kernel and allows us to catch
regressions and verify correct behavior.

Note, the kernel selftest tree contained a bunch of changes that made
the original selftest fail to compile so there are small fixups in
here make them compile without warnings"

* tag 'close-range-cloexec-unshare-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
selftests/core: add regression test for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC
selftests/core: add test for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC
selftests/core: handle missing syscall number for close_range
selftests/core: fix close_range_test build after XFAIL removal
close_range: unshare all fds for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC

467f8165

Merge tag 'for-linus-5.11-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · 3872f516

Linus Torvalds authored Dec 19, 2020

Pull more xen updates from Juergen Gross:
 "Some minor cleanup patches and a small series disentangling some Xen
  related Kconfig options"

* tag 'for-linus-5.11-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen: Kconfig: remove X86_64 depends from XEN_512GB
  xen/manage: Fix fall-through warnings for Clang
  xen-blkfront: Fix fall-through warnings for Clang
  xen: remove trailing semicolon in macro definition
  xen: Kconfig: nest Xen guest options
  xen: Remove Xen PVH/PVHVM dependency on PCI
  x86/xen: Convert to DEFINE_SHOW_ATTRIBUTE

3872f516

Merge branch 'pcmcia-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux · 651283d5

Linus Torvalds authored Dec 19, 2020

Pull pcmcia updates from Dominik Brodowski:
 "Besides a few PCMCIA odd fixes, the NEC VRC4173 CARDU driver is
  removed, as it has not compiled in ages"

* 'pcmcia-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux:
  pcmcia: omap: Fix error return code in omap_cf_probe()
  pcmcia: Remove NEC VRC4173 CARDU
  pcmcia: db1xxx_ss: remove unneeded semicolon
  pcmcia/electra_cf: Fix some return values in 'electra_cf_probe()' in case of error

651283d5

Merge tag 'i3c/for-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux · 190daf19

Linus Torvalds authored Dec 19, 2020

Pull i3c updates from Boris Brezillon:

 - Add the HCI driver

 - Add a missing destroy_workqueue() in an error path

 - Flag Alexandre Belloni as the new maintainer

* tag 'i3c/for-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux:
  i3c/master/mipi-i3c-hci: quiet maybe-unused variable warning
  i3c: Resign from my maintainer role
  i3c/master: Fix uninitialized variable next_addr
  i3c/master: introduce the mipi-i3c-hci driver
  dt-bindings: i3c: MIPI I3C Host Controller Interface
  i3c master: fix missing destroy_workqueue() on error in i3c_master_register

190daf19

Merge tag 'for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply · 11c33652

Linus Torvalds authored Dec 19, 2020

Pull power supply and reset updates from Sebastian Reichel:
 "Battery/charger driver changes:

   - collie_battery, generic-adc-battery, s3c-adc-battery: convert to
     GPIO descriptors (incl ARM board files)

   - misc cleanup and fixes

  Reset drivers:

   - new poweroff driver for force disabling a regulator

   - use printk format symbol resolver

   - ocelot: add support for Luton and Jaguar2"

* tag 'for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply: (31 commits)
  power: supply: Fix a typo in warning message
  Documentation: DT: binding documentation for regulator-poweroff
  power: reset: new driver regulator-poweroff
  power: supply: ab8500: Use dev_err_probe() for IIO channels
  power: supply: ab8500_fg: Request all IRQs as threaded
  power: supply: ab8500_charger: Oneshot threaded IRQs
  power: supply: ab8500: Convert to dev_pm_ops
  power: supply: ab8500: Use local helper
  power: supply: wm831x_power: remove unneeded break
  power: supply: bq24735: Drop unused include
  power: supply: bq24190_charger: Drop unused include
  power: supply: generic-adc-battery: Use GPIO descriptors
  power: supply: collie_battery: Convert to GPIO descriptors
  power: supply: bq24190_charger: fix reference leak
  power: supply: s3c-adc-battery: Convert to GPIO descriptors
  power: reset: Use printk format symbol resolver
  power: supply: axp20x_usb_power: Use power efficient workqueue for debounce
  power: supply: axp20x_usb_power: fix typo
  power: supply: max8997-charger: Improve getting charger status
  power: supply: max8997-charger: Fix platform data retrieval
  ...

11c33652

Merge tag 'hsi-for-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-hsi · c2703b66

Linus Torvalds authored Dec 19, 2020

Pull HSI updates from Sebastian Reichel:
 "Misc cleanups"

* tag 'hsi-for-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-hsi:
  HSI: core: fix a kernel-doc markup
  HSI: omap_ssi: Don't jump to free ID in ssi_add_controller()

c2703b66

Merge tag 'pwm/for-5.11-rc1' of... · d56154c7

Linus Torvalds authored Dec 19, 2020

Merge tag 'pwm/for-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm

Pull pwm updates from Thierry Reding:
 "This is a fairly big release cycle from the PWM framework's point of
  view.

  There's a large patcheset here which converts drivers to use the new
  devm_platform_ioremap_resource() helper and a bunch of minor fixes to
  existing drivers. Some of the existing drivers also add support for
  more hardware, such as Atmel SAMA 5D2 and Mediatek MT8183.

  Finally there's a couple of new drivers for Intel Keem Bay and LGM
  SoCs as well as the DesignWare PWM controller"

* tag 'pwm/for-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm: (66 commits)
  pwm: sun4i: Remove erroneous else branch
  pwm: sl28cpld: Set driver data before registering the PWM chip
  pwm: Remove unused function pwmchip_add_inversed()
  pwm: imx27: Fix overflow for bigger periods
  pwm: bcm2835: Support apply function for atomic configuration
  pwm: keembay: Fix build failure with -Os
  pwm: core: Use octal permission
  pwm: lpss: Make compilable with COMPILE_TEST
  pwm: Fix dependencies on HAS_IOMEM
  pwm: Use -EINVAL for unsupported polarity
  pwm: sti: Remove unnecessary blank line
  pwm: sti: Avoid conditional gotos
  pwm: Add PWM fan controller driver for LGM SoC
  Add DT bindings YAML schema for PWM fan controller of LGM SoC
  pwm: Add DesignWare PWM Controller Driver
  dt-bindings: pwm: mtk-disp: add MT8167 SoC binding
  pwm: mediatek: Add MT8183 SoC support
  pwm: mediatek: Always use bus clock
  dt-bindings: pwm: pwm-mediatek: Add documentation for MT8183 SoC
  pwm: Add PWM driver for Intel Keem Bay
  ...

d56154c7

Merge branch 'akpm' (patches from Andrew) · 1db98bcf

Linus Torvalds authored Dec 19, 2020

Merge still more updates from Andrew Morton:
 "18 patches.

  Subsystems affected by this patch series: mm (memcg and cleanups) and
  epoll"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm/Kconfig: fix spelling mistake "whats" -> "what's"
  selftests/filesystems: expand epoll with epoll_pwait2
  epoll: wire up syscall epoll_pwait2
  epoll: add syscall epoll_pwait2
  epoll: convert internal api to timespec64
  epoll: eliminate unnecessary lock for zero timeout
  epoll: replace gotos with a proper loop
  epoll: pull all code between fetch_events and send_event into the loop
  epoll: simplify and optimize busy loop logic
  epoll: move eavail next to the list_empty_careful check
  epoll: pull fatal signal checks into ep_send_events()
  epoll: simplify signal handling
  epoll: check for events when removing a timed out thread from the wait queue
  mm/memcontrol:rewrite mem_cgroup_page_lruvec()
  mm, kvm: account kvm_vcpu_mmap to kmemcg
  mm/memcg: remove unused definitions
  mm/memcg: warning on !memcg after readahead page charged
  mm/memcg: bail early from swap accounting if memcg disabled

1db98bcf

mm/Kconfig: fix spelling mistake "whats" -> "what's" · 01ab1ede

Colin Ian King authored Dec 18, 2020

There is a spelling mistake in the Kconfig help text. Fix it.

Link: https://lkml.kernel.org/r/20201217172717.58203-1-colin.king@canonical.comSigned-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

01ab1ede

selftests/filesystems: expand epoll with epoll_pwait2 · e9ce39b5

Willem de Bruijn authored Dec 18, 2020

Code coverage for the epoll_pwait2 syscall.

epoll62: Repeat basic test epoll1, but exercising the new syscall.
epoll63: Pass a timespec and exercise the timeout wakeup path.

Link: https://lkml.kernel.org/r/20201121144401.3727659-5-willemdebruijn.kernel@gmail.comSigned-off-by: Willem de Bruijn <willemb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e9ce39b5

epoll: wire up syscall epoll_pwait2 · b0a0c261

Willem de Bruijn authored Dec 18, 2020

Split off from prev patch in the series that implements the syscall.

Link: https://lkml.kernel.org/r/20201121144401.3727659-4-willemdebruijn.kernel@gmail.comSigned-off-by: Willem de Bruijn <willemb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b0a0c261

epoll: add syscall epoll_pwait2 · 58169a52

Willem de Bruijn authored Dec 18, 2020

Add syscall epoll_pwait2, an epoll_wait variant with nsec resolution that
replaces int timeout with struct timespec.  It is equivalent otherwise.

    int epoll_pwait2(int fd, struct epoll_event *events,
                     int maxevents,
                     const struct timespec *timeout,
                     const sigset_t *sigset);

The underlying hrtimer is already programmed with nsec resolution.
pselect and ppoll also set nsec resolution timeout with timespec.

The sigset_t in epoll_pwait has a compat variant. epoll_pwait2 needs
the same.

For timespec, only support this new interface on 2038 aware platforms
that define __kernel_timespec_t. So no CONFIG_COMPAT_32BIT_TIME.

Link: https://lkml.kernel.org/r/20201121144401.3727659-3-willemdebruijn.kernel@gmail.comSigned-off-by: Willem de Bruijn <willemb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

58169a52

epoll: convert internal api to timespec64 · 7cdf7c20

Willem de Bruijn authored Dec 18, 2020

Patch series "add epoll_pwait2 syscall", v4.

Enable nanosecond timeouts for epoll.

Analogous to pselect and ppoll, introduce an epoll_wait syscall
variant that takes a struct timespec instead of int timeout.

This patch (of 4):

Make epoll more consistent with select/poll: pass along the timeout as
timespec64 pointer.

In anticipation of additional changes affecting all three polling
mechanisms:

- add epoll_pwait2 syscall with timespec semantics,
  and share poll_select_set_timeout implementation.
- compute slack before conversion to absolute time,
  to save one ktime_get_ts64 call.

Link: https://lkml.kernel.org/r/20201121144401.3727659-1-willemdebruijn.kernel@gmail.com
Link: https://lkml.kernel.org/r/20201121144401.3727659-2-willemdebruijn.kernel@gmail.comSigned-off-by: Willem de Bruijn <willemb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7cdf7c20

epoll: eliminate unnecessary lock for zero timeout · e59d3c64

Soheil Hassas Yeganeh authored Dec 18, 2020

We call ep_events_available() under lock when timeout is 0, and then call
it without locks in the loop for the other cases.

Instead, call ep_events_available() without lock for all cases.  For
non-zero timeouts, we will recheck after adding the thread to the wait
queue.  For zero timeout cases, by definition, user is opportunistically
polling and will have to call epoll_wait again in the future.

Note that this lock was kept in c5a282e9 because the whole loop was
historically under lock.

This patch results in a 1% CPU/RPC reduction in RPC benchmarks.

Link: https://lkml.kernel.org/r/20201106231635.3528496-9-soheil.kdev@gmail.comSigned-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Cc: Guantao Liu <guantaol@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e59d3c64

epoll: replace gotos with a proper loop · 00b27634

Soheil Hassas Yeganeh authored Dec 18, 2020

The existing loop is pointless, and the labels make it really hard to
follow the structure.

Replace that control structure with a simple loop that returns when there
are new events, there is a signal, or the thread has timed out.

Link: https://lkml.kernel.org/r/20201106231635.3528496-8-soheil.kdev@gmail.comSigned-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Cc: Guantao Liu <guantaol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

00b27634

epoll: pull all code between fetch_events and send_event into the loop · e8c85328

Soheil Hassas Yeganeh authored Dec 18, 2020

This is a no-op change which simplifies the follow up patches.

Link: https://lkml.kernel.org/r/20201106231635.3528496-7-soheil.kdev@gmail.comSigned-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Cc: Guantao Liu <guantaol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e8c85328

epoll: simplify and optimize busy loop logic · 1493c47f

Soheil Hassas Yeganeh authored Dec 18, 2020

ep_events_available() is called multiple times around the busy loop logic,
even though the logic is generally not used. ep_reset_busy_poll_napi_id()
is similarly always called, even when busy loop is not used.

Eliminate ep_reset_busy_poll_napi_id() and inline it inside
ep_busy_loop(). Make ep_busy_loop() return whether there are any events
available after the busy loop. This will eliminate unnecessary loads and
branches, and simplifies the loop.

Link: https://lkml.kernel.org/r/20201106231635.3528496-6-soheil.kdev@gmail.comSigned-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Cc: Guantao Liu <guantaol@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1493c47f

epoll: move eavail next to the list_empty_careful check · e411596d

Soheil Hassas Yeganeh authored Dec 18, 2020

This is a no-op change and simply to make the code more coherent.

Link: https://lkml.kernel.org/r/20201106231635.3528496-5-soheil.kdev@gmail.comSigned-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Cc: Guantao Liu <guantaol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e411596d

epoll: pull fatal signal checks into ep_send_events() · cccd29bf

Soheil Hassas Yeganeh authored Dec 18, 2020

To simplify the code, pull in checking the fatal signals into
ep_send_events().  ep_send_events() is called only from ep_poll().

Note that, previously, we were always checking fatal events, but it is
checked only if eavail is true.  This should be fine because the goal of
that check is to quickly return from epoll_wait() when there is a pending
fatal signal.

Link: https://lkml.kernel.org/r/20201106231635.3528496-4-soheil.kdev@gmail.comSigned-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Suggested-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Cc: Guantao Liu <guantaol@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

cccd29bf

epoll: simplify signal handling · 2efdaf76

Soheil Hassas Yeganeh authored Dec 18, 2020

Check signals before locking ep->lock, and immediately return -EINTR if
there is any signal pending.

This saves a few loads, stores, and branches from the hot path and
simplifies the loop structure for follow up patches.

Link: https://lkml.kernel.org/r/20201106231635.3528496-3-soheil.kdev@gmail.comSigned-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Cc: Guantao Liu <guantaol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2efdaf76

epoll: check for events when removing a timed out thread from the wait queue · 289caf5d

Soheil Hassas Yeganeh authored Dec 18, 2020

Patch series "simplify ep_poll".

This patch series is a followup based on the suggestions and feedback by
Linus:
https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com

The first patch in the series is a fix for the epoll race in presence of
timeouts, so that it can be cleanly backported to all affected stable
kernels.

The rest of the patch series simplify the ep_poll() implementation.  Some
of these simplifications result in minor performance enhancements as well.
We have kept these changes under self tests and internal benchmarks for a
few days, and there are minor (1-2%) performance enhancements as a result.

This patch (of 8):

After abc610e0 ("fs/epoll: avoid barrier after an epoll_wait(2)
timeout"), we break out of the ep_poll loop upon timeout, without checking
whether there is any new events available.  Prior to that patch-series we
always called ep_events_available() after exiting the loop.

This can cause races and missed wakeups.  For example, consider the
following scenario reported by Guantao Liu:

Suppose we have an eventfd added using EPOLLET to an epollfd.

Thread 1: Sleeps for just below 5ms and then writes to an eventfd.
Thread 2: Calls epoll_wait with a timeout of 5 ms. If it sees an
          event of the eventfd, it will write back on that fd.
Thread 3: Calls epoll_wait with a negative timeout.

Prior to abc610e0, it is guaranteed that Thread 3 will wake up either
by Thread 1 or Thread 2.  After abc610e0, Thread 3 can be blocked
indefinitely if Thread 2 sees a timeout right before the write to the
eventfd by Thread 1.  Thread 2 will be woken up from
schedule_hrtimeout_range and, with evail 0, it will not call
ep_send_events().

To fix this issue:
1) Simplify the timed_out case as suggested by Linus.
2) while holding the lock, recheck whether the thread was woken up
   after its time out has reached.

Note that (2) is different from Linus' original suggestion: It do not set
"eavail = ep_events_available(ep)" to avoid unnecessary contention (when
there are too many timed-out threads and a small number of events), as
well as races mentioned in the discussion thread.

This is the first patch in the series so that the backport to stable
releases is straightforward.

Link: https://lkml.kernel.org/r/20201106231635.3528496-1-soheil.kdev@gmail.com
Link: https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com
Link: https://lkml.kernel.org/r/20201106231635.3528496-2-soheil.kdev@gmail.com
Fixes: abc610e0 ("fs/epoll: avoid barrier after an epoll_wait(2) timeout")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Tested-by: Guantao Liu <guantaol@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Guantao Liu <guantaol@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

289caf5d

mm/memcontrol:rewrite mem_cgroup_page_lruvec() · 9a1ac228

Hui Su authored Dec 18, 2020

mem_cgroup_page_lruvec() in memcontrol.c and mem_cgroup_lruvec() in
memcontrol.h is very similar except for the param(page and memcg) which
also can be convert to each other.

So rewrite mem_cgroup_page_lruvec() with mem_cgroup_lruvec().

[alex.shi@linux.alibaba.com: add missed warning in mem_cgroup_lruvec]
  Link: https://lkml.kernel.org/r/94f17bb7-ec61-5b72-3555-fabeb5a4d73b@linux.alibaba.com
[lstoakes@gmail.com: warn on missing memcg on mem_cgroup_page_lruvec()]
  Link: https://lkml.kernel.org/r/20201125112202.387009-1-lstoakes@gmail.com

Link: https://lkml.kernel.org/r/20201108143731.GA74138@rlkSigned-off-by: Hui Su <sh_def@163.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9a1ac228

mm, kvm: account kvm_vcpu_mmap to kmemcg · 93bb59ca

Shakeel Butt authored Dec 18, 2020

A VCPU of a VM can allocate couple of pages which can be mmap'ed by the
user space application. At the moment this memory is not charged to the
memcg of the VMM. On a large machine running large number of VMs or
small number of VMs having large number of VCPUs, this unaccounted
memory can be very significant. So, charge this memory to the memcg of
the VMM. Please note that lifetime of these allocations corresponds to
the lifetime of the VMM.

Link: https://lkml.kernel.org/r/20201106202923.2087414-1-shakeelb@google.comSigned-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

93bb59ca

mm/memcg: remove unused definitions · bec78efd

Wei Yang authored Dec 18, 2020

Some definitions are left unused, just clean them.

Link: https://lkml.kernel.org/r/20201108003834.12669-1-richard.weiyang@gmail.comSigned-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

bec78efd

mm/memcg: warning on !memcg after readahead page charged · a4055888

Alex Shi authored Dec 18, 2020

Add VM_WARN_ON_ONCE_PAGE() macro.

Since readahead page is charged on memcg too, in theory we don't have to
check this exception now.  Before safely remove them all, add a warning
for the unexpected !memcg.

Link: https://lkml.kernel.org/r/1604283436-18880-3-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a4055888

mm/memcg: bail early from swap accounting if memcg disabled · 76358ab5

Alex Shi authored Dec 18, 2020

Patch series "bail out early for memcg disable".

These 2 patches are indepenedent from per memcg lru lock, and may
encounter unexpected warning, so let's move out them from per memcg
lru locking patchset.

This patch (of 2):

We could bail out early when memcg wasn't enabled.

Link: https://lkml.kernel.org/r/1604283436-18880-1-git-send-email-alex.shi@linux.alibaba.com
Link: https://lkml.kernel.org/r/1604283436-18880-2-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

76358ab5

selftests/core: add regression test for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC · 6abc20f8

Christian Brauner authored Dec 18, 2020

This test is a minimalized version of the reproducer given by syzbot
(cf. [1]).

After introducing CLOSE_RANGE_CLOEXEC syzbot reported a crash when
CLOSE_RANGE_CLOEXEC is specified in conjunction with
CLOSE_RANGE_UNSHARE. When CLOSE_RANGE_UNSHARE is specified the caller
will receive a private file descriptor table in case their file
descriptor table is currently shared.
For the case where the caller has requested all file descriptors to be
actually closed via e.g. close_range(3, ~0U, 0) the kernel knows that
the caller does not need any of the file descriptors anymore and will
optimize the close operation by only copying all files in the range from
0 to 3 and no others.

However, if the caller requested CLOSE_RANGE_CLOEXEC together with
CLOSE_RANGE_UNSHARE the caller wants to still make use of the file
descriptors so the kernel needs to copy all of them and can't optimize.

The original patch didn't account for this and thus could cause oopses
as evidenced by the syzbot report. Add tests for this regression.

We first create a huge gap in the fd table. When we now call
CLOSE_RANGE_UNSHARE with a shared fd table and and with ~0U as upper
bound the kernel will only copy up to fd1 file descriptors into the new
fd table. If the kernel is buggy and doesn't handle CLOSE_RANGE_CLOEXEC
correctly it will not have copied all file descriptors and we will oops!

This test passes on a fixed kernel and will trigger an oops on a buggy
kernel.

[1]: https://syzkaller.appspot.com/text?tag=KernelConfig&x=db720fe37a6a41d8

Cc: Giuseppe Scrivano <gscrivan@redhat.com>
Cc: linux-fsdevel@vger.kernel.org
Link: syzbot+96cfd2b22b3213646a93@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20201218145415.801063-4-christian.brauner@ubuntu.comSigned-off-by: Christian Brauner <christian.brauner@ubuntu.com>

6abc20f8