1. 06 Dec, 2020 2 commits
  2. 03 Dec, 2020 1 commit
  3. 01 Dec, 2020 1 commit
  4. 27 Nov, 2020 1 commit
  5. 25 Nov, 2020 1 commit
    • Anand K Mistry's avatar
      x86/speculation: Fix prctl() when spectre_v2_user={seccomp,prctl},ibpb · 33fc379d
      Anand K Mistry authored
      When spectre_v2_user={seccomp,prctl},ibpb is specified on the command
      line, IBPB is force-enabled and STIPB is conditionally-enabled (or not
      available).
      
      However, since
      
        21998a35 ("x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.")
      
      the spectre_v2_user_ibpb variable is set to SPECTRE_V2_USER_{PRCTL,SECCOMP}
      instead of SPECTRE_V2_USER_STRICT, which is the actual behaviour.
      Because the issuing of IBPB relies on the switch_mm_*_ibpb static
      branches, the mitigations behave as expected.
      
      Since
      
        1978b3a5 ("x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP")
      
      this discrepency caused the misreporting of IB speculation via prctl().
      
      On CPUs with STIBP always-on and spectre_v2_user=seccomp,ibpb,
      prctl(PR_GET_SPECULATION_CTRL) would return PR_SPEC_PRCTL |
      PR_SPEC_ENABLE instead of PR_SPEC_DISABLE since both IBPB and STIPB are
      always on. It also allowed prctl(PR_SET_SPECULATION_CTRL) to set the IB
      speculation mode, even though the flag is ignored.
      
      Similarly, for CPUs without SMT, prctl(PR_GET_SPECULATION_CTRL) should
      also return PR_SPEC_DISABLE since IBPB is always on and STIBP is not
      available.
      
       [ bp: Massage commit message. ]
      
      Fixes: 21998a35 ("x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.")
      Fixes: 1978b3a5 ("x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP")
      Signed-off-by: default avatarAnand K Mistry <amistry@google.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201110123349.1.Id0cbf996d2151f4c143c90f9028651a5b49a5908@changeid
      33fc379d
  6. 24 Nov, 2020 2 commits
    • Xiaochen Shen's avatar
      x86/resctrl: Add necessary kernfs_put() calls to prevent refcount leak · 75899924
      Xiaochen Shen authored
      On resource group creation via a mkdir an extra kernfs_node reference is
      obtained by kernfs_get() to ensure that the rdtgroup structure remains
      accessible for the rdtgroup_kn_unlock() calls where it is removed on
      deletion. Currently the extra kernfs_node reference count is only
      dropped by kernfs_put() in rdtgroup_kn_unlock() while the rdtgroup
      structure is removed in a few other locations that lack the matching
      reference drop.
      
      In call paths of rmdir and umount, when a control group is removed,
      kernfs_remove() is called to remove the whole kernfs nodes tree of the
      control group (including the kernfs nodes trees of all child monitoring
      groups), and then rdtgroup structure is freed by kfree(). The rdtgroup
      structures of all child monitoring groups under the control group are
      freed by kfree() in free_all_child_rdtgrp().
      
      Before calling kfree() to free the rdtgroup structures, the kernfs node
      of the control group itself as well as the kernfs nodes of all child
      monitoring groups still take the extra references which will never be
      dropped to 0 and the kernfs nodes will never be freed. It leads to
      reference count leak and kernfs_node_cache memory leak.
      
      For example, reference count leak is observed in these two cases:
        (1) mount -t resctrl resctrl /sys/fs/resctrl
            mkdir /sys/fs/resctrl/c1
            mkdir /sys/fs/resctrl/c1/mon_groups/m1
            umount /sys/fs/resctrl
      
        (2) mkdir /sys/fs/resctrl/c1
            mkdir /sys/fs/resctrl/c1/mon_groups/m1
            rmdir /sys/fs/resctrl/c1
      
      The same reference count leak issue also exists in the error exit paths
      of mkdir in mkdir_rdt_prepare() and rdtgroup_mkdir_ctrl_mon().
      
      Fix this issue by following changes to make sure the extra kernfs_node
      reference on rdtgroup is dropped before freeing the rdtgroup structure.
        (1) Introduce rdtgroup removal helper rdtgroup_remove() to wrap up
        kernfs_put() and kfree().
      
        (2) Call rdtgroup_remove() in rdtgroup removal path where the rdtgroup
        structure is about to be freed by kfree().
      
        (3) Call rdtgroup_remove() or kernfs_put() as appropriate in the error
        exit paths of mkdir where an extra reference is taken by kernfs_get().
      
      Fixes: f3cbeaca ("x86/intel_rdt/cqm: Add rmdir support")
      Fixes: e02737d5 ("x86/intel_rdt: Add tasks files")
      Fixes: 60cf5e10 ("x86/intel_rdt: Add mkdir to resctrl file system")
      Reported-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarXiaochen Shen <xiaochen.shen@intel.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Reviewed-by: default avatarReinette Chatre <reinette.chatre@intel.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1604085088-31707-1-git-send-email-xiaochen.shen@intel.com
      75899924
    • Xiaochen Shen's avatar
      x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak · fd8d9db3
      Xiaochen Shen authored
      Willem reported growing of kernfs_node_cache entries in slabtop when
      repeatedly creating and removing resctrl subdirectories as well as when
      repeatedly mounting and unmounting the resctrl filesystem.
      
      On resource group (control as well as monitoring) creation via a mkdir
      an extra kernfs_node reference is obtained to ensure that the rdtgroup
      structure remains accessible for the rdtgroup_kn_unlock() calls where it
      is removed on deletion. The kernfs_node reference count is dropped by
      kernfs_put() in rdtgroup_kn_unlock().
      
      With the above explaining the need for one kernfs_get()/kernfs_put()
      pair in resctrl there are more places where a kernfs_node reference is
      obtained without a corresponding release. The excessive amount of
      reference count on kernfs nodes will never be dropped to 0 and the
      kernfs nodes will never be freed in the call paths of rmdir and umount.
      It leads to reference count leak and kernfs_node_cache memory leak.
      
      Remove the superfluous kernfs_get() calls and expand the existing
      comments surrounding the remaining kernfs_get()/kernfs_put() pair that
      remains in use.
      
      Superfluous kernfs_get() calls are removed from two areas:
      
        (1) In call paths of mount and mkdir, when kernfs nodes for "info",
        "mon_groups" and "mon_data" directories and sub-directories are
        created, the reference count of newly created kernfs node is set to 1.
        But after kernfs_create_dir() returns, superfluous kernfs_get() are
        called to take an additional reference.
      
        (2) kernfs_get() calls in rmdir call paths.
      
      Fixes: 17eafd07 ("x86/intel_rdt: Split resource group removal in two")
      Fixes: 4af4a88e ("x86/intel_rdt/cqm: Add mount,umount support")
      Fixes: f3cbeaca ("x86/intel_rdt/cqm: Add rmdir support")
      Fixes: d89b7379 ("x86/intel_rdt/cqm: Add mon_data")
      Fixes: c7d9aac6 ("x86/intel_rdt/cqm: Add mkdir support for RDT monitoring")
      Fixes: 5dc1d5c6 ("x86/intel_rdt: Simplify info and base file lists")
      Fixes: 60cf5e10 ("x86/intel_rdt: Add mkdir to resctrl file system")
      Fixes: 4e978d06 ("x86/intel_rdt: Add "info" files to resctrl file system")
      Reported-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarXiaochen Shen <xiaochen.shen@intel.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Reviewed-by: default avatarReinette Chatre <reinette.chatre@intel.com>
      Tested-by: default avatarWillem de Bruijn <willemb@google.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1604085053-31639-1-git-send-email-xiaochen.shen@intel.com
      fd8d9db3
  7. 22 Nov, 2020 20 commits
    • Linus Torvalds's avatar
      Linux 5.10-rc5 · 418baf2c
      Linus Torvalds authored
      418baf2c
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid · d5530d82
      Linus Torvalds authored
      Pull HID fixes from Jiri Kosina:
      
       - Various functionality / regression fixes for Logitech devices from
         Hans de Goede
      
       - Fix for (recently added) GPIO support in mcp2221 driver from Lars
         Povlsen
      
       - Power management handling fix/quirk in i2c-hid driver for certain
         BIOSes that have strange aproach to power-cycle from Hans de Goede
      
       - a few device ID additions and device-specific quirks
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
        HID: logitech-dj: Fix Dinovo Mini when paired with a MX5x00 receiver
        HID: logitech-dj: Fix an error in mse_bluetooth_descriptor
        HID: Add Logitech Dinovo Edge battery quirk
        HID: logitech-hidpp: Add HIDPP_CONSUMER_VENDOR_KEYS quirk for the Dinovo Edge
        HID: logitech-dj: Handle quad/bluetooth keyboards with a builtin trackpad
        HID: add HID_QUIRK_INCREMENT_USAGE_ON_DUPLICATE for Gamevice devices
        HID: mcp2221: Fix GPIO output handling
        HID: hid-sensor-hub: Fix issue with devices with no report ID
        HID: i2c-hid: Put ACPI enumerated devices in D3 on shutdown
        HID: add support for Sega Saturn
        HID: cypress: Support Varmilo Keyboards' media hotkeys
        HID: ite: Replace ABS_MISC 120/121 events with touchpad on/off keypresses
        HID: logitech-hidpp: Add PID for MX Anywhere 2
        HID: uclogic: Add ID for Trust Flex Design Tablet
      d5530d82
    • Linus Torvalds's avatar
      Merge tag 'sched-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · f4b936f5
      Linus Torvalds authored
      Pull scheduler fixes from Thomas Gleixner:
       "A couple of scheduler fixes:
      
         - Make the conditional update of the overutilized state work
           correctly by caching the relevant flags state before overwriting
           them and checking them afterwards.
      
         - Fix a data race in the wakeup path which caused loadavg on ARM64
           platforms to become a random number generator.
      
         - Fix the ordering of the iowaiter accounting operations so it can't
           be decremented before it is incremented.
      
         - Fix a bug in the deadline scheduler vs. priority inheritance when a
           non-deadline task A has inherited the parameters of a deadline task
           B and then blocks on a non-deadline task C.
      
           The second inheritance step used the static deadline parameters of
           task A, which are usually 0, instead of further propagating task
           B's parameters. The zero initialized parameters trigger a bug in
           the deadline scheduler"
      
      * tag 'sched-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/deadline: Fix priority inheritance with multiple scheduling classes
        sched: Fix rq->nr_iowait ordering
        sched: Fix data-race in wakeup
        sched/fair: Fix overutilized update in enqueue_task_fair()
      f4b936f5
    • Linus Torvalds's avatar
      Merge tag 'perf-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 48da3305
      Linus Torvalds authored
      Pull perf fix from Thomas Gleixner:
       "A single fix for the x86 perf sysfs interfaces which used kobject
        attributes instead of device attributes and therefore making clang's
        control flow integrity checker upset"
      
      * tag 'perf-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86: fix sysfs type mismatches
      48da3305
    • Linus Torvalds's avatar
      Merge tag 'locking-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 855cf1ee
      Linus Torvalds authored
      Pull locking fix from Thomas Gleixner:
       "A single fix for lockdep which makes the recursion protection cover
        graph lock/unlock"
      
      * tag 'locking-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        lockdep: Put graph lock/unlock under lock_recursion protection
      855cf1ee
    • Linus Torvalds's avatar
      Merge tag 'efi-urgent-for-v5.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 68d3fa23
      Linus Torvalds authored
      Pull EFI fixes from Borislav Petkov:
       "Forwarded EFI fixes from Ard Biesheuvel:
      
         - fix memory leak in efivarfs driver
      
         - fix HYP mode issue in 32-bit ARM version of the EFI stub when built
           in Thumb2 mode
      
         - avoid leaking EFI pgd pages on allocation failure"
      
      * tag 'efi-urgent-for-v5.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        efi/x86: Free efi_pgd with free_pages()
        efivarfs: fix memory leak in efivarfs_create()
        efi/arm: set HSCTLR Thumb2 bit correctly for HVC calls from HYP
      68d3fa23
    • Linus Torvalds's avatar
      Merge tag 'x86_urgent_for_v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7d53be55
      Linus Torvalds authored
      Pull x86 fixes from Borislav Petkov:
      
       - An IOMMU VT-d build fix when CONFIG_PCI_ATS=n along with a revert of
         same because the proper one is going through the IOMMU tree (Thomas
         Gleixner)
      
       - An Intel microcode loader fix to save the correct microcode patch to
         apply during resume (Chen Yu)
      
       - A fix to not access user memory of other processes when dumping
         opcode bytes (Thomas Gleixner)
      
      * tag 'x86_urgent_for_v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        Revert "iommu/vt-d: Take CONFIG_PCI_ATS into account"
        x86/dumpstack: Do not try to access user space code of other tasks
        x86/microcode/intel: Check patch signature before saving microcode for early loading
        iommu/vt-d: Take CONFIG_PCI_ATS into account
      7d53be55
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 4a51c60a
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "8 patches.
      
        Subsystems affected by this patch series: mm (madvise, pagemap,
        readahead, memcg, userfaultfd), kbuild, and vfs"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm: fix madvise WILLNEED performance problem
        libfs: fix error cast of negative value in simple_attr_write()
        mm/userfaultfd: do not access vma->vm_mm after calling handle_userfault()
        mm: memcg/slab: fix root memcg vmstats
        mm: fix readahead_page_batch for retry entries
        mm: fix phys_to_target_node() and memory_add_physaddr_to_nid() exports
        compiler-clang: remove version check for BPF Tracing
        mm/madvise: fix memory leak from process_madvise
      4a51c60a
    • Linus Torvalds's avatar
      Merge tag 'staging-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · d27637ec
      Linus Torvalds authored
      Pull staging and IIO fixes from Greg KH:
       "Here are some small Staging and IIO driver fixes for 5.10-rc5. They
        include:
      
         - IIO fixes for reported regressions and problems
      
         - new device ids for IIO drivers
      
         - new device id for rtl8723bs driver
      
         - staging ralink driver Kconfig dependency fix
      
         - staging mt7621-pci bus resource fix
      
        All of these have been in linux-next all week with no reported issues"
      
      * tag 'staging-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        iio: accel: kxcjk1013: Add support for KIOX010A ACPI DSM for setting tablet-mode
        iio: accel: kxcjk1013: Replace is_smo8500_device with an acpi_type enum
        docs: ABI: testing: iio: stm32: remove re-introduced unsupported ABI
        iio: light: fix kconfig dependency bug for VCNL4035
        iio/adc: ingenic: Fix AUX/VBAT readings when touchscreen is used
        iio/adc: ingenic: Fix battery VREF for JZ4770 SoC
        staging: rtl8723bs: Add 024c:0627 to the list of SDIO device-ids
        staging: ralink-gdma: fix kconfig dependency bug for DMA_RALINK
        staging: mt7621-pci: avoid to request pci bus resources
        iio: imu: st_lsm6dsx: set 10ms as min shub slave timeout
        counter/ti-eqep: Fix regmap max_register
        iio: adc: stm32-adc: fix a regression when using dma and irq
        iio: adc: mediatek: fix unset field
        iio: cros_ec: Use default frequencies when EC returns invalid information
      d27637ec
    • Linus Torvalds's avatar
      Merge tag 'tty-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · de758035
      Linus Torvalds authored
      Pull tty fixes from Greg KH:
       "Here are some small tty/serial fixes for 5.10-rc5 that resolve some
        reported issues:
      
         - speakup crash when telling the kernel to use a device that isn't
           really there
      
         - imx serial driver fixes for reported problems
      
         - ar933x_uart driver fix for probe error handling path
      
        All have been in linux-next for a while with no reported issues"
      
      * tag 'tty-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        serial: ar933x_uart: disable clk on error handling path in probe
        tty: serial: imx: keep console clocks always on
        speakup: Do not let the line discipline be used several times
        tty: serial: imx: fix potential deadlock
      de758035
    • Linus Torvalds's avatar
      Merge tag 'ext4_for_linus_fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · a7f07fc1
      Linus Torvalds authored
      Pull ext4 fixes from Ted Ts'o:
       "A final set of miscellaneous bug fixes for ext4"
      
      * tag 'ext4_for_linus_fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
        ext4: fix bogus warning in ext4_update_dx_flag()
        jbd2: fix kernel-doc markups
        ext4: drop fast_commit from /proc/mounts
      a7f07fc1
    • David Howells's avatar
      afs: Fix speculative status fetch going out of order wrt to modifications · a9e5c87c
      David Howells authored
      When doing a lookup in a directory, the afs filesystem uses a bulk
      status fetch to speculatively retrieve the statuses of up to 48 other
      vnodes found in the same directory and it will then either update extant
      inodes or create new ones - effectively doing 'lookup ahead'.
      
      To avoid the possibility of deadlocking itself, however, the filesystem
      doesn't lock all of those inodes; rather just the directory inode is
      locked (by the VFS).
      
      When the operation completes, afs_inode_init_from_status() or
      afs_apply_status() is called, depending on whether the inode already
      exists, to commit the new status.
      
      A case exists, however, where the speculative status fetch operation may
      straddle a modification operation on one of those vnodes.  What can then
      happen is that the speculative bulk status RPC retrieves the old status,
      and whilst that is happening, the modification happens - which returns
      an updated status, then the modification status is committed, then we
      attempt to commit the speculative status.
      
      This results in something like the following being seen in dmesg:
      
      	kAFS: vnode modified {100058:861} 8->9 YFS.InlineBulkStatus
      
      showing that for vnode 861 on volume 100058, we saw YFS.InlineBulkStatus
      say that the vnode had data version 8 when we'd already recorded version
      9 due to a local modification.  This was causing the cache to be
      invalidated for that vnode when it shouldn't have been.  If it happens
      on a data file, this might lead to local changes being lost.
      
      Fix this by ignoring speculative status updates if the data version
      doesn't match the expected value.
      
      Note that it is possible to get a DV regression if a volume gets
      restored from a backup - but we should get a callback break in such a
      case that should trigger a recheck anyway.  It might be worth checking
      the volume creation time in the volsync info and, if a change is
      observed in that (as would happen on a restore), invalidate all caches
      associated with the volume.
      
      Fixes: 5cf9dd55 ("afs: Prospectively look up extra files when doing a single lookup")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9e5c87c
    • Matthew Wilcox (Oracle)'s avatar
      mm: fix madvise WILLNEED performance problem · 66383800
      Matthew Wilcox (Oracle) authored
      The calculation of the end page index was incorrect, leading to a
      regression of 70% when running stress-ng.
      
      With this fix, we instead see a performance improvement of 3%.
      
      Fixes: e6e88712 ("mm: optimise madvise WILLNEED")
      Reported-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarXing Zhengjun <zhengjun.xing@linux.intel.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: "Chen, Rong A" <rong.a.chen@intel.com>
      Link: https://lkml.kernel.org/r/20201109134851.29692-1-willy@infradead.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66383800
    • Yicong Yang's avatar
      libfs: fix error cast of negative value in simple_attr_write() · 488dac0c
      Yicong Yang authored
      The attr->set() receive a value of u64, but simple_strtoll() is used for
      doing the conversion.  It will lead to the error cast if user inputs a
      negative value.
      
      Use kstrtoull() instead of simple_strtoll() to convert a string got from
      the user to an unsigned value.  The former will return '-EINVAL' if it
      gets a negetive value, but the latter can't handle the situation
      correctly.  Make 'val' unsigned long long as what kstrtoull() takes,
      this will eliminate the compile warning on no 64-bit architectures.
      
      Fixes: f7b88631 ("fs/libfs.c: fix simple_attr_write() on 32bit machines")
      Signed-off-by: default avatarYicong Yang <yangyicong@hisilicon.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Link: https://lkml.kernel.org/r/1605341356-11872-1-git-send-email-yangyicong@hisilicon.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      488dac0c
    • Gerald Schaefer's avatar
      mm/userfaultfd: do not access vma->vm_mm after calling handle_userfault() · bfe8cc1d
      Gerald Schaefer authored
      Alexander reported a syzkaller / KASAN finding on s390, see below for
      complete output.
      
      In do_huge_pmd_anonymous_page(), the pre-allocated pagetable will be
      freed in some cases.  In the case of userfaultfd_missing(), this will
      happen after calling handle_userfault(), which might have released the
      mmap_lock.  Therefore, the following pte_free(vma->vm_mm, pgtable) will
      access an unstable vma->vm_mm, which could have been freed or re-used
      already.
      
      For all architectures other than s390 this will go w/o any negative
      impact, because pte_free() simply frees the page and ignores the
      passed-in mm.  The implementation for SPARC32 would also access
      mm->page_table_lock for pte_free(), but there is no THP support in
      SPARC32, so the buggy code path will not be used there.
      
      For s390, the mm->context.pgtable_list is being used to maintain the 2K
      pagetable fragments, and operating on an already freed or even re-used
      mm could result in various more or less subtle bugs due to list /
      pagetable corruption.
      
      Fix this by calling pte_free() before handle_userfault(), similar to how
      it is already done in __do_huge_pmd_anonymous_page() for the WRITE /
      non-huge_zero_page case.
      
      Commit 6b251fc9 ("userfaultfd: call handle_userfault() for
      userfaultfd_missing() faults") actually introduced both, the
      do_huge_pmd_anonymous_page() and also __do_huge_pmd_anonymous_page()
      changes wrt to calling handle_userfault(), but only in the latter case
      it put the pte_free() before calling handle_userfault().
      
        BUG: KASAN: use-after-free in do_huge_pmd_anonymous_page+0xcda/0xd90 mm/huge_memory.c:744
        Read of size 8 at addr 00000000962d6988 by task syz-executor.0/9334
      
        CPU: 1 PID: 9334 Comm: syz-executor.0 Not tainted 5.10.0-rc1-syzkaller-07083-g4c9720875573 #0
        Hardware name: IBM 3906 M04 701 (KVM/Linux)
        Call Trace:
          do_huge_pmd_anonymous_page+0xcda/0xd90 mm/huge_memory.c:744
          create_huge_pmd mm/memory.c:4256 [inline]
          __handle_mm_fault+0xe6e/0x1068 mm/memory.c:4480
          handle_mm_fault+0x288/0x748 mm/memory.c:4607
          do_exception+0x394/0xae0 arch/s390/mm/fault.c:479
          do_dat_exception+0x34/0x80 arch/s390/mm/fault.c:567
          pgm_check_handler+0x1da/0x22c arch/s390/kernel/entry.S:706
          copy_from_user_mvcos arch/s390/lib/uaccess.c:111 [inline]
          raw_copy_from_user+0x3a/0x88 arch/s390/lib/uaccess.c:174
          _copy_from_user+0x48/0xa8 lib/usercopy.c:16
          copy_from_user include/linux/uaccess.h:192 [inline]
          __do_sys_sigaltstack kernel/signal.c:4064 [inline]
          __s390x_sys_sigaltstack+0xc8/0x240 kernel/signal.c:4060
          system_call+0xe0/0x28c arch/s390/kernel/entry.S:415
      
        Allocated by task 9334:
          slab_alloc_node mm/slub.c:2891 [inline]
          slab_alloc mm/slub.c:2899 [inline]
          kmem_cache_alloc+0x118/0x348 mm/slub.c:2904
          vm_area_dup+0x9c/0x2b8 kernel/fork.c:356
          __split_vma+0xba/0x560 mm/mmap.c:2742
          split_vma+0xca/0x108 mm/mmap.c:2800
          mlock_fixup+0x4ae/0x600 mm/mlock.c:550
          apply_vma_lock_flags+0x2c6/0x398 mm/mlock.c:619
          do_mlock+0x1aa/0x718 mm/mlock.c:711
          __do_sys_mlock2 mm/mlock.c:738 [inline]
          __s390x_sys_mlock2+0x86/0xa8 mm/mlock.c:728
          system_call+0xe0/0x28c arch/s390/kernel/entry.S:415
      
        Freed by task 9333:
          slab_free mm/slub.c:3142 [inline]
          kmem_cache_free+0x7c/0x4b8 mm/slub.c:3158
          __vma_adjust+0x7b2/0x2508 mm/mmap.c:960
          vma_merge+0x87e/0xce0 mm/mmap.c:1209
          userfaultfd_release+0x412/0x6b8 fs/userfaultfd.c:868
          __fput+0x22c/0x7a8 fs/file_table.c:281
          task_work_run+0x200/0x320 kernel/task_work.c:151
          tracehook_notify_resume include/linux/tracehook.h:188 [inline]
          do_notify_resume+0x100/0x148 arch/s390/kernel/signal.c:538
          system_call+0xe6/0x28c arch/s390/kernel/entry.S:416
      
        The buggy address belongs to the object at 00000000962d6948 which belongs to the cache vm_area_struct of size 200
        The buggy address is located 64 bytes inside of 200-byte region [00000000962d6948, 00000000962d6a10)
        The buggy address belongs to the page: page:00000000313a09fe refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x962d6 flags: 0x3ffff00000000200(slab)
        raw: 3ffff00000000200 000040000257e080 0000000c0000000c 000000008020ba00
        raw: 0000000000000000 000f001e00000000 ffffffff00000001 0000000096959501
        page dumped because: kasan: bad access detected
        page->mem_cgroup:0000000096959501
      
        Memory state around the buggy address:
         00000000962d6880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
         00000000962d6900: 00 fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb
        >00000000962d6980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                              ^
         00000000962d6a00: fb fb fc fc fc fc fc fc fc fc 00 00 00 00 00 00
         00000000962d6a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        ==================================================================
      
      Fixes: 6b251fc9 ("userfaultfd: call handle_userfault() for userfaultfd_missing() faults")
      Reported-by: default avatarAlexander Egorenkov <egorenar@linux.ibm.com>
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.3+]
      Link: https://lkml.kernel.org/r/20201110190329.11920-1-gerald.schaefer@linux.ibm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bfe8cc1d
    • Muchun Song's avatar
      mm: memcg/slab: fix root memcg vmstats · 8faeb1ff
      Muchun Song authored
      If we reparent the slab objects to the root memcg, when we free the slab
      object, we need to update the per-memcg vmstats to keep it correct for
      the root memcg.  Now this at least affects the vmstat of
      NR_KERNEL_STACK_KB for !CONFIG_VMAP_STACK when the thread stack size is
      smaller than the PAGE_SIZE.
      
      David said:
       "I assume that without this fix that the root memcg's vmstat would
        always be inflated if we reparented"
      
      Fixes: ec9f0238 ("mm: workingset: fix vmstat counters for shadow nodes")
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: <stable@vger.kernel.org>	[5.3+]
      Link: https://lkml.kernel.org/r/20201110031015.15715-1-songmuchun@bytedance.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8faeb1ff
    • Matthew Wilcox (Oracle)'s avatar
      mm: fix readahead_page_batch for retry entries · 4349a83a
      Matthew Wilcox (Oracle) authored
      Both btrfs and fuse have reported faults caused by seeing a retry entry
      instead of the page they were looking for.  This was caused by a missing
      check in the iterator.
      
      As can be seen in the below panic log, the accessing 0x402 causes a
      panic.  In the xarray.h, 0x402 means RETRY_ENTRY.
      
        BUG: kernel NULL pointer dereference, address: 0000000000000402
        CPU: 14 PID: 306003 Comm: as Not tainted 5.9.0-1-amd64 #1 Debian 5.9.1-1
        Hardware name: Lenovo ThinkSystem SR665/7D2VCTO1WW, BIOS D8E106Q-1.01 05/30/2020
        RIP: 0010:fuse_readahead+0x152/0x470 [fuse]
        Code: 41 8b 57 18 4c 8d 54 10 ff 4c 89 d6 48 8d 7c 24 10 e8 d2 e3 28 f9 48 85 c0 0f 84 fe 00 00 00 44 89 f2 49 89 04 d4 44 8d 72 01 <48> 8b 10 41 8b 4f 1c 48 c1 ea 10 83 e2 01 80 fa 01 19 d2 81 e2 01
        RSP: 0018:ffffad99ceaebc50 EFLAGS: 00010246
        RAX: 0000000000000402 RBX: 0000000000000001 RCX: 0000000000000002
        RDX: 0000000000000000 RSI: ffff94c5af90bd98 RDI: ffffad99ceaebc60
        RBP: ffff94ddc1749a00 R08: 0000000000000402 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000100 R12: ffff94de6c429ce0
        R13: ffff94de6c4d3700 R14: 0000000000000001 R15: ffffad99ceaebd68
        FS:  00007f228c5c7040(0000) GS:ffff94de8ed80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000402 CR3: 0000001dbd9b4000 CR4: 0000000000350ee0
        Call Trace:
          read_pages+0x83/0x270
          page_cache_readahead_unbounded+0x197/0x230
          generic_file_buffered_read+0x57a/0xa20
          new_sync_read+0x112/0x1a0
          vfs_read+0xf8/0x180
          ksys_read+0x5f/0xe0
          do_syscall_64+0x33/0x80
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 042124cc ("mm: add new readahead_control API")
      Reported-by: default avatarDavid Sterba <dsterba@suse.com>
      Reported-by: default avatarWonhyuk Yang <vvghjk1234@gmail.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201103142852.8543-1-willy@infradead.org
      Link: https://lkml.kernel.org/r/20201103124349.16722-1-vvghjk1234@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4349a83a
    • Dan Williams's avatar
      mm: fix phys_to_target_node() and memory_add_physaddr_to_nid() exports · a927bd6b
      Dan Williams authored
      The core-mm has a default __weak implementation of phys_to_target_node()
      to mirror the weak definition of memory_add_physaddr_to_nid().  That
      symbol is exported for modules.  However, while the export in
      mm/memory_hotplug.c exported the symbol in the configuration cases of:
      
      	CONFIG_NUMA_KEEP_MEMINFO=y
      	CONFIG_MEMORY_HOTPLUG=y
      
      ...and:
      
      	CONFIG_NUMA_KEEP_MEMINFO=n
      	CONFIG_MEMORY_HOTPLUG=y
      
      ...it failed to export the symbol in the case of:
      
      	CONFIG_NUMA_KEEP_MEMINFO=y
      	CONFIG_MEMORY_HOTPLUG=n
      
      Not only is that broken, but Christoph points out that the kernel should
      not be exporting any __weak symbol, which means that
      memory_add_physaddr_to_nid() example that phys_to_target_node() copied
      is broken too.
      
      Rework the definition of phys_to_target_node() and
      memory_add_physaddr_to_nid() to not require weak symbols.  Move to the
      common arch override design-pattern of an asm header defining a symbol
      to replace the default implementation.
      
      The only common header that all memory_add_physaddr_to_nid() producing
      architectures implement is asm/sparsemem.h.  In fact, powerpc already
      defines its memory_add_physaddr_to_nid() helper in sparsemem.h.
      Double-down on that observation and define phys_to_target_node() where
      necessary in asm/sparsemem.h.  An alternate consideration that was
      discarded was to put this override in asm/numa.h, but that entangles
      with the definition of MAX_NUMNODES relative to the inclusion of
      linux/nodemask.h, and requires powerpc to grow a new header.
      
      The dependency on NUMA_KEEP_MEMINFO for DEV_DAX_HMEM_DEVICES is invalid
      now that the symbol is properly exported / stubbed in all combinations
      of CONFIG_NUMA_KEEP_MEMINFO and CONFIG_MEMORY_HOTPLUG.
      
      [dan.j.williams@intel.com: v4]
        Link: https://lkml.kernel.org/r/160461461867.1505359.5301571728749534585.stgit@dwillia2-desk3.amr.corp.intel.com
      [dan.j.williams@intel.com: powerpc: fix create_section_mapping compile warning]
        Link: https://lkml.kernel.org/r/160558386174.2948926.2740149041249041764.stgit@dwillia2-desk3.amr.corp.intel.com
      
      Fixes: a035b6bf ("mm/memory_hotplug: introduce default phys_to_target_node() implementation")
      Reported-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reported-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarChristoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Tested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: https://lkml.kernel.org/r/160447639846.1133764.7044090803980177548.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a927bd6b
    • Nick Desaulniers's avatar
      compiler-clang: remove version check for BPF Tracing · bc2dc440
      Nick Desaulniers authored
      bpftrace parses the kernel headers and uses Clang under the hood.
      
      Remove the version check when __BPF_TRACING__ is defined (as bpftrace
      does) so that this tool can continue to parse kernel headers, even with
      older clang sources.
      
      Fixes: commit 1f7a44f6 ("compiler-clang: add build check for clang 10.0.1")
      Reported-by: default avatarChen Yu <yu.chen.surf@gmail.com>
      Reported-by: default avatarJarkko Sakkinen <jarkko@kernel.org>
      Signed-off-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarJarkko Sakkinen <jarkko@kernel.org>
      Acked-by: default avatarJarkko Sakkinen <jarkko@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Acked-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Acked-by: default avatarMiguel Ojeda <ojeda@kernel.org>
      Link: https://lkml.kernel.org/r/20201104191052.390657-1-ndesaulniers@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc2dc440
    • Eric Dumazet's avatar
      mm/madvise: fix memory leak from process_madvise · 450677dc
      Eric Dumazet authored
      The early return in process_madvise() will produce a memory leak.
      
      Fix it.
      
      Fixes: ecb8ac8b ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: https://lkml.kernel.org/r/20201116155132.GA3805951@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      450677dc
  8. 21 Nov, 2020 4 commits
  9. 20 Nov, 2020 8 commits