1. 10 Dec, 2020 4 commits
  2. 09 Dec, 2020 10 commits
  3. 08 Dec, 2020 7 commits
  4. 06 Dec, 2020 19 commits
    • Linus Torvalds's avatar
      Linux 5.10-rc7 · 0477e928
      Linus Torvalds authored
      0477e928
    • Linus Torvalds's avatar
      Merge tag 'char-misc-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · ab91292c
      Linus Torvalds authored
      Pull char/misc driver fixes from Greg KH:
       "Here are some small driver fixes, and one "large" revert, for
        5.10-rc7.
      
        They include:
      
         - revert mei patch from 5.10-rc1 that was using a reserved userspace
           value. It will be resubmitted once the proper id has been assigned
           by the virtio people.
      
         - habanalabs fixes found by the fall-through audit from Gustavo
      
         - speakup driver fixes for reported issues
      
         - fpga config build fix for reported issue.
      
        All of these except the revert have been in linux-next with no
        reported issues. The revert is "clean" and just removes a
        previously-added driver, so no real issue there"
      
      * tag 'char-misc-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        Revert "mei: virtio: virtualization frontend driver"
        fpga: Specify HAS_IOMEM dependency for FPGA_DFL
        habanalabs: put devices before driver removal
        habanalabs: free host huge va_range if not used
        speakup: Reject setting the speakup line discipline outside of speakup
      ab91292c
    • Linus Torvalds's avatar
      Merge tag 'tty-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · d49248eb
      Linus Torvalds authored
      Pull tty fixes from Greg KH:
       "Here are two tty core fixes for 5.10-rc7.
      
        They resolve some reported locking issues in the tty core. While they
        have not been in a released linux-next yet, they have passed all of
        the 0-day bot testing as well as the submitter's testing"
      
      * tag 'tty-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        tty: Fix ->session locking
        tty: Fix ->pgrp locking in tiocspgrp()
      d49248eb
    • Linus Torvalds's avatar
      Merge tag 'usb-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · f5226f1d
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "Here are some small USB fixes for 5.10-rc7 that resolve a number of
        reported issues, and add some new device ids.
      
        Nothing major here, but these solve some problems that people were
        having with the 5.10-rc tree:
      
         - reverts for USB storage dma settings that broke working devices
      
         - thunderbolt use-after-free fix
      
         - cdns3 driver fixes
      
         - gadget driver userspace copy fix
      
         - new device ids
      
        All of these except for the reverts have been in linux-next with no
        reported issues. The reverts are "clean" and were tested by Hans, as
        well as passing the 0-day tests"
      
      * tag 'usb-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
        usb: gadget: f_fs: Use local copy of descriptors for userspace copy
        usb: ohci-omap: Fix descriptor conversion
        Revert "usb-storage: fix sdev->host->dma_dev"
        Revert "uas: fix sdev->host->dma_dev"
        Revert "uas: bump hw_max_sectors to 2048 blocks for SS or faster drives"
        USB: serial: kl5kusb105: fix memleak on open
        USB: serial: ch341: sort device-id entries
        USB: serial: ch341: add new Product ID for CH341A
        USB: serial: option: fix Quectel BG96 matching
        usb: cdns3: core: fix goto label for error path
        usb: cdns3: gadget: clear trb->length as zero after preparing every trb
        usb: cdns3: Fix hardware based role switch
        USB: serial: option: add support for Thales Cinterion EXS82
        USB: serial: option: add Fibocom NL668 variants
        thunderbolt: Fix use-after-free in remove_unplugged_switch()
      f5226f1d
    • Linus Torvalds's avatar
      Merge tag 'x86-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 8100a580
      Linus Torvalds authored
      Pull x86 fixes from Thomas Gleixner:
       "A set of fixes for x86:
      
         - Make the AMD L3 QoS code and data priorization enable/disable
           mechanism work correctly.
      
           The control bit was only set/cleared on one of the CPUs in a L3
           domain, but it has to be modified on all CPUs in the domain. The
           initial documentation was not clear about this, but the updated one
           from Oct 2020 spells it out.
      
         - Fix an off by one in the UV platform detection code which causes
           the UV hubs to be identified wrongly.
      
           The chip revisions start at 1 not at 0.
      
         - Fix a long standing bug in the evaluation of prefixes in the
           uprobes code which fails to handle repeated prefixes properly.
      
           The aggregate size of the prefixes can be larger than the bytes
           array but the code blindly iterated over the aggregate size beyond
           the array boundary. Add a macro to handle this case properly and
           use it at the affected places"
      
      * tag 'x86-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/sev-es: Use new for_each_insn_prefix() macro to loop over prefixes bytes
        x86/insn-eval: Use new for_each_insn_prefix() macro to loop over prefixes bytes
        x86/uprobes: Do not use prefixes.nbytes when looping over prefixes.bytes
        x86/platform/uv: Fix UV4 hub revision adjustment
        x86/resctrl: Fix AMD L3 QOS CDP enable/disable
      8100a580
    • Linus Torvalds's avatar
      Merge tag 'perf-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 9f6b28d4
      Linus Torvalds authored
      Pull perf fixes from Thomas Gleixner:
       "Two fixes for performance monitoring on X86:
      
         - Add recursion protection to another callchain invoked from
           x86_pmu_stop() which can recurse back into x86_pmu_stop(). The
           first attempt to fix this missed this extra code path.
      
         - Use the already filtered status variable to check for PEBS counter
           overflow bits and not the unfiltered full status read from
           IA32_PERF_GLOBAL_STATUS which can have unrelated bits check which
           would be evaluated incorrectly"
      
      * tag 'perf-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/intel: Check PEBS status correctly
        perf/x86/intel: Fix a warning on x86_pmu_stop() with large PEBS
      9f6b28d4
    • Linus Torvalds's avatar
      Merge tag 'irq-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 592d9a08
      Linus Torvalds authored
      Pull irq fixes from Thomas Gleixner:
       "A set of updates for the interrupt subsystem:
      
         - Make multiqueue devices which use the managed interrupt affinity
           infrastructure work on PowerPC/Pseries. PowerPC does not use the
           generic infrastructure for setting up PCI/MSI interrupts and the
           multiqueue changes failed to update the legacy PCI/MSI
           infrastructure. Make this work by passing the affinity setup
           information down to the mapping and allocation functions.
      
         - Move Jason Cooper from MAINTAINERS to CREDITS as his mail is
           bouncing and he's not reachable. We hope all is well with him and
           say thanks for his work over the years"
      
      * tag 'irq-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        powerpc/pseries: Pass MSI affinity to irq_create_mapping()
        genirq/irqdomain: Add an irq_create_mapping_affinity() function
        MAINTAINERS: Move Jason Cooper to CREDITS
      592d9a08
    • Linus Torvalds's avatar
      Merge tag 'locking-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ff615c98
      Linus Torvalds authored
      Pull intel_idle build fix from Thomas Gleixner:
       "A tiny build fix for a recent change in the intel_idle driver which
        missed a CONFIG dependency and broke the build for certain
        configurations"
      
      * tag 'locking-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        intel_idle: Build fix
      ff615c98
    • Linus Torvalds's avatar
      Merge tag 'kbuild-fixes-v5.10-2' of... · e6585a49
      Linus Torvalds authored
      Merge tag 'kbuild-fixes-v5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
      
      Pull Kbuild fixes from Masahiro Yamada:
      
       - Move -Wcast-align to W=3, which tends to be false-positive and there
         is no tree-wide solution.
      
       - Pass -fmacro-prefix-map to KBUILD_CPPFLAGS because it is a
         preprocessor option and makes sense for .S files as well.
      
       - Disable -gdwarf-2 for Clang's integrated assembler to avoid warnings.
      
       - Disable --orphan-handling=warn for LLD 10.0.1 to avoid warnings.
      
       - Fix undesirable line breaks in *.mod files.
      
      * tag 'kbuild-fixes-v5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
        kbuild: avoid split lines in .mod files
        kbuild: Disable CONFIG_LD_ORPHAN_WARN for ld.lld 10.0.1
        kbuild: Hoist '--orphan-handling' into Kconfig
        Kbuild: do not emit debug info for assembly with LLVM_IAS=1
        kbuild: use -fmacro-prefix-map for .S sources
        Makefile.extrawarn: move -Wcast-align to W=3
      e6585a49
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 12c0ab66
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "12 patches.
      
        Subsystems affected by this patch series: mm (memcg, zsmalloc, swap,
        mailmap, selftests, pagecache, hugetlb, pagemap), lib, and coredump"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm/mmap.c: fix mmap return value when vma is merged after call_mmap()
        hugetlb_cgroup: fix offline of hugetlb cgroup with reservations
        mm/filemap: add static for function __add_to_page_cache_locked
        userfaultfd: selftests: fix SIGSEGV if huge mmap fails
        tools/testing/selftests/vm: fix build error
        mailmap: add two more addresses of Uwe Kleine-König
        mm/swapfile: do not sleep with a spin lock held
        mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING
        mm: list_lru: set shrinker map bit when child nr_items is not zero
        mm: memcg/slab: fix obj_cgroup_charge() return value handling
        coredump: fix core_pattern parse error
        zlib: export S390 symbols for zlib modules
      12c0ab66
    • Liu Zixian's avatar
      mm/mmap.c: fix mmap return value when vma is merged after call_mmap() · 309d08d9
      Liu Zixian authored
      On success, mmap should return the begin address of newly mapped area,
      but patch "mm: mmap: merge vma after call_mmap() if possible" set
      vm_start of newly merged vma to return value addr.  Users of mmap will
      get wrong address if vma is merged after call_mmap().  We fix this by
      moving the assignment to addr before merging vma.
      
      We have a driver which changes vm_flags, and this bug is found by our
      testcases.
      
      Fixes: d70cec89 ("mm: mmap: merge vma after call_mmap() if possible")
      Signed-off-by: default avatarLiu Zixian <liuzixian4@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Hongxiang Lou <louhongxiang@huawei.com>
      Cc: Hu Shiyuan <hushiyuan@huawei.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: https://lkml.kernel.org/r/20201203085350.22624-1-liuzixian4@huawei.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      309d08d9
    • Mike Kravetz's avatar
      hugetlb_cgroup: fix offline of hugetlb cgroup with reservations · 7a5bde37
      Mike Kravetz authored
      Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload
      using hugetlbfs.  In this environment the issue is reproduced by:
      
       - Start a simple pod that uses the recently added HugePages medium
         feature (pod yaml attached)
      
       - Start a DPDK app. It doesn't need to run successfully (as in transfer
         packets) nor interact with real hardware. It seems just initializing
         the EAL layer (which handles hugepage reservation and locking) is
         enough to trigger the issue
      
       - Delete the Pod (or let it "Complete").
      
      This would result in a kworker thread going into a tight loop (top output):
      
         1425 root      20   0       0      0      0 R  99.7   0.0   5:22.45 kworker/28:7+cgroup_destroy
      
      'perf top -g' reports:
      
        -   63.28%     0.01%  [kernel]                    [k] worker_thread
           - 49.97% worker_thread
              - 52.64% process_one_work
                 - 62.08% css_killed_work_fn
                    - hugetlb_cgroup_css_offline
                         41.52% _raw_spin_lock
                       - 2.82% _cond_resched
                            rcu_all_qs
                         2.66% PageHuge
              - 0.57% schedule
                 - 0.57% __schedule
      
      We are spinning in the do-while loop in hugetlb_cgroup_css_offline.
      Worse yet, we are holding the master cgroup lock (cgroup_mutex) while
      infinitely spinning.  Little else can be done on the system as the
      cgroup_mutex can not be acquired.
      
      Do note that the issue can be reproduced by simply offlining a hugetlb
      cgroup containing pages with reservation counts.
      
      The loop in hugetlb_cgroup_css_offline is moving page counts from the
      cgroup being offlined to the parent cgroup.  This is done for each
      hstate, and is repeated until hugetlb_cgroup_have_usage returns false.
      The routine moving counts (hugetlb_cgroup_move_parent) is only moving
      'usage' counts.  The routine hugetlb_cgroup_have_usage is checking for
      both 'usage' and 'reservation' counts.  Discussion about what to do with
      reservation counts when reparenting was discussed here:
      
      https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/
      
      The decision was made to leave a zombie cgroup for with reservation
      counts.  Unfortunately, the code checking reservation counts was
      incorrectly added to hugetlb_cgroup_have_usage.
      
      To fix the issue, simply remove the check for reservation counts.  While
      fixing this issue, a related bug in hugetlb_cgroup_css_offline was
      noticed.  The hstate index is not reinitialized each time through the
      do-while loop.  Fix this as well.
      
      Fixes: 1adc4d41 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
      Reported-by: default avatarAdrian Moreno <amorenoz@redhat.com>
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarAdrian Moreno <amorenoz@redhat.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201203220242.158165-1-mike.kravetz@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7a5bde37
    • Alex Shi's avatar
      mm/filemap: add static for function __add_to_page_cache_locked · 3351b16a
      Alex Shi authored
        mm/filemap.c:830:14: warning: no previous prototype for `__add_to_page_cache_locked' [-Wmissing-prototypes]
      Signed-off-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Link: https://lkml.kernel.org/r/1604661895-5495-1-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3351b16a
    • Axel Rasmussen's avatar
      userfaultfd: selftests: fix SIGSEGV if huge mmap fails · 573a2593
      Axel Rasmussen authored
      The error handling in hugetlb_allocate_area() was incorrect for the
      hugetlb_shared test case.
      
      Previously the behavior was:
      
      - mmap a hugetlb area
        - If this fails, set the pointer to NULL, and carry on
      - mmap an alias of the same hugetlb fd
        - If this fails, munmap the original area
      
      If the original mmap failed, it's likely the second one did too.  If
      both failed, we'd blindly try to munmap a NULL pointer, causing a
      SIGSEGV.  Instead, "goto fail" so we return before trying to mmap the
      alias.
      
      This issue can be hit "in real life" by forgetting to set
      /proc/sys/vm/nr_hugepages (leaving it at 0), and then trying to run the
      hugetlb_shared test.
      
      Another small improvement is, when the original mmap fails, don't just
      print "it failed": perror(), so we can see *why*.  :)
      Signed-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Alan Gilbert <dgilbert@redhat.com>
      Link: https://lkml.kernel.org/r/20201204203443.2714693-1-axelrasmussen@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      573a2593
    • Xingxing Su's avatar
      tools/testing/selftests/vm: fix build error · d8cbe8bf
      Xingxing Su authored
      Only x86 and PowerPC implement the pkey-xxx.h, and an error was reported
      when compiling protection_keys.c.
      
      Add a Arch judgment to compile "protection_keys" in the Makefile.
      
      If other arch implement this, add the arch name to the Makefile.
      eg:
          ifneq (,$(findstring $(ARCH),powerpc mips ... ))
      
      Following build errors:
      
          pkey-helpers.h:93:2: error: #error Architecture not supported
           #error Architecture not supported
          pkey-helpers.h:96:20: error: `PKEY_DISABLE_ACCESS' undeclared
           #define PKEY_MASK (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE)
                              ^
          protection_keys.c:218:45: error: `PKEY_DISABLE_WRITE' undeclared
           pkey_assert(flags & (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
                                                      ^
      Signed-off-by: default avatarXingxing Su <suxingxing@loongson.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Link: https://lkml.kernel.org/r/1606826876-30656-1-git-send-email-suxingxing@loongson.cnSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8cbe8bf
    • Uwe Kleine-König's avatar
      mailmap: add two more addresses of Uwe Kleine-König · 4e60340c
      Uwe Kleine-König authored
      This fixes attribution for the commits (among others)
      
       - d4097456 ("video/framebuffer: move the probe func into
         .devinit.text in Blackfin LCD driver")
      
       - 0312e024 ("mfd: mc13xxx: Add support for mc34708")
      Signed-off-by: default avatarUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: https://lkml.kernel.org/r/20201127213358.3440830-1-u.kleine-koenig@pengutronix.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e60340c
    • Qian Cai's avatar
      mm/swapfile: do not sleep with a spin lock held · b11a76b3
      Qian Cai authored
      We can't call kvfree() with a spin lock held, so defer it.  Fixes a
      might_sleep() runtime warning.
      
      Fixes: 873d7bcf ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation")
      Signed-off-by: default avatarQian Cai <qcai@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201202151549.10350-1-qcai@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b11a76b3
    • Minchan Kim's avatar
      mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING · e91d8d78
      Minchan Kim authored
      While I was doing zram testing, I found sometimes decompression failed
      since the compression buffer was corrupted.  With investigation, I found
      below commit calls cond_resched unconditionally so it could make a
      problem in atomic context if the task is reschedule.
      
        BUG: sleeping function called from invalid context at mm/vmalloc.c:108
        in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog
        3 locks held by memhog/946:
         #0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160
         #1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160
         #2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0
        CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
        Call Trace:
          unmap_kernel_range_noflush+0x2eb/0x350
          unmap_kernel_range+0x14/0x30
          zs_unmap_object+0xd5/0xe0
          zram_bvec_rw.isra.0+0x38c/0x8e0
          zram_rw_page+0x90/0x101
          bdev_write_page+0x92/0xe0
          __swap_writepage+0x94/0x4a0
          pageout+0xe3/0x3a0
          shrink_page_list+0xb94/0xd60
          shrink_inactive_list+0x158/0x460
      
      We can fix this by removing the ZSMALLOC_PGTABLE_MAPPING feature (which
      contains the offending calling code) from zsmalloc.
      
      Even though this option showed some amount improvement(e.g., 30%) in
      some arm32 platforms, it has been headache to maintain since it have
      abused APIs[1](e.g., unmap_kernel_range in atomic context).
      
      Since we are approaching to deprecate 32bit machines and already made
      the config option available for only builtin build since v5.8, lastly it
      has been not default option in zsmalloc, it's time to drop the option
      for better maintenance.
      
      [1] http://lore.kernel.org/linux-mm/20201105170249.387069-1-minchan@kernel.org
      
      Fixes: e47110e9 ("mm/vunmap: add cond_resched() in vunmap_pmd_range")
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Lindgren <tony@atomide.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Harish Sriram <harish@linux.ibm.com>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201117202916.GA3856507@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e91d8d78
    • Yang Shi's avatar
      mm: list_lru: set shrinker map bit when child nr_items is not zero · 8199be00
      Yang Shi authored
      When investigating a slab cache bloat problem, significant amount of
      negative dentry cache was seen, but confusingly they neither got shrunk
      by reclaimer (the host has very tight memory) nor be shrunk by dropping
      cache.  The vmcore shows there are over 14M negative dentry objects on
      lru, but tracing result shows they were even not scanned at all.
      
      Further investigation shows the memcg's vfs shrinker_map bit is not set.
      So the reclaimer or dropping cache just skip calling vfs shrinker.  So
      we have to reboot the hosts to get the memory back.
      
      I didn't manage to come up with a reproducer in test environment, and
      the problem can't be reproduced after rebooting.  But it seems there is
      race between shrinker map bit clear and reparenting by code inspection.
      The hypothesis is elaborated as below.
      
      The memcg hierarchy on our production environment looks like:
      
                      root
                     /    \
                system   user
      
      The main workloads are running under user slice's children, and it
      creates and removes memcg frequently.  So reparenting happens very often
      under user slice, but no task is under user slice directly.
      
      So with the frequent reparenting and tight memory pressure, the below
      hypothetical race condition may happen:
      
             CPU A                            CPU B
      reparent
          dst->nr_items == 0
                                       shrinker:
                                           total_objects == 0
          add src->nr_items to dst
          set_bit
                                           return SHRINK_EMPTY
                                           clear_bit
      child memcg offline
          replace child's kmemcg_id with
          parent's (in memcg_offline_kmem())
                                        list_lru_del() between shrinker runs
                                           see parent's kmemcg_id
                                           dec dst->nr_items
      reparent again
          dst->nr_items may go negative
          due to concurrent list_lru_del()
      
                                       The second run of shrinker:
                                           read nr_items without any
                                           synchronization, so it may
                                           see intermediate negative
                                           nr_items then total_objects
                                           may return 0 coincidently
      
                                           keep the bit cleared
          dst->nr_items != 0
          skip set_bit
          add scr->nr_item to dst
      
      After this point dst->nr_item may never go zero, so reparenting will not
      set shrinker_map bit anymore.  And since there is no task under user
      slice directly, so no new object will be added to its lru to set the
      shrinker map bit either.  That bit is kept cleared forever.
      
      How does list_lru_del() race with reparenting? It is because reparenting
      replaces children's kmemcg_id to parent's without protecting from
      nlru->lock, so list_lru_del() may see parent's kmemcg_id but actually
      deleting items from child's lru, but dec'ing parent's nr_items, so the
      parent's nr_items may go negative as commit 2788cf0c ("memcg:
      reparent list_lrus and free kmemcg_id on css offline") says.
      
      Since it is impossible that dst->nr_items goes negative and
      src->nr_items goes zero at the same time, so it seems we could set the
      shrinker map bit iff src->nr_items != 0.  We could synchronize
      list_lru_count_one() and reparenting with nlru->lock, but it seems
      checking src->nr_items in reparenting is the simplest and avoids lock
      contention.
      
      Fixes: fae91d6d ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
      Suggested-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.19]
      Link: https://lkml.kernel.org/r/20201202171749.264354-1-shy828301@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8199be00