1. 06 Mar, 2019 40 commits
    • Yang Shi's avatar
      mm: swap: check if swap backing device is congested or not · 8fd2e0b5
      Yang Shi authored
      Swap readahead would read in a few pages regardless if the underlying
      device is busy or not.  It may incur long waiting time if the device is
      congested, and it may also exacerbate the congestion.
      
      Use inode_read_congested() to check if the underlying device is busy or
      not like what file page readahead does.  Get inode from
      swap_info_struct.
      
      Although we can add inode information in swap_address_space
      (address_space->host), it may lead some unexpected side effect, i.e.  it
      may break mapping_cap_account_dirty().  Using inode from
      swap_info_struct seems simple and good enough.
      
      Just does the check in vma_cluster_readahead() since
      swap_vma_readahead() is just used for non-rotational device which much
      less likely has congestion than traditional HDD.
      
      Although swap slots may be consecutive on swap partition, it still may
      be fragmented on swap file.  This check would help to reduce excessive
      stall for such case.
      
      The test with page_fault1 of will-it-scale (sometimes tracing may just
      show runtest.py that is the wrapper script of page_fault1), which
      basically launches NR_CPU threads to generate 128MB anonymous pages for
      each thread, on my virtual machine with congested HDD shows long tail
      latency is reduced significantly.
      
      Without the patch
       page_fault1_thr-1490  [023]   129.311706: funcgraph_entry:      #57377.796 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.369103: funcgraph_entry:        5.642us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.369119: funcgraph_entry:      #1289.592 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.370411: funcgraph_entry:        4.957us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.370419: funcgraph_entry:        1.940us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.378847: funcgraph_entry:      #1411.385 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.380262: funcgraph_entry:        3.916us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.380275: funcgraph_entry:      #4287.751 us |  do_swap_page();
      
      With the patch
            runtest.py-1417  [020]   301.925911: funcgraph_entry:      #9870.146 us |  do_swap_page();
            runtest.py-1417  [020]   301.935785: funcgraph_entry:        9.802us   |  do_swap_page();
            runtest.py-1417  [020]   301.935799: funcgraph_entry:        3.551us   |  do_swap_page();
            runtest.py-1417  [020]   301.935806: funcgraph_entry:        2.142us   |  do_swap_page();
            runtest.py-1417  [020]   301.935853: funcgraph_entry:        6.938us   |  do_swap_page();
            runtest.py-1417  [020]   301.935864: funcgraph_entry:        3.765us   |  do_swap_page();
            runtest.py-1417  [020]   301.935871: funcgraph_entry:        3.600us   |  do_swap_page();
            runtest.py-1417  [020]   301.935878: funcgraph_entry:        7.202us   |  do_swap_page();
      
      [akpm@linux-foundation.org: code cleanup]
      [yang.shi@linux.alibaba.com: add comment]
        Link: http://lkml.kernel.org/r/bbc7bda7-62d0-df1a-23ef-d369e865bdca@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1546543673-108536-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: default avatarTim Chen <tim.c.chen@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Hugh Dickins <hughd@google.com
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8fd2e0b5
    • Matthew Wilcox's avatar
      mm/filemap.c: remove redundant test from find_get_pages_contig · 14ef1fc7
      Matthew Wilcox authored
      After we establish a reference on the page, we check the pointer
      continues to be in the correct position in i_pages.  Checking
      page->index afterwards is unnecessary; if it were to change, then the
      pointer to it from the page cache would also move.  The check used to be
      done before grabbing a reference on the page which was racy (see commit
      9cbb4cb2 ("mm: find_get_pages_contig fixlet")), but nobody noticed
      that moving the check after grabbing the reference was redundant.
      
      Link: http://lkml.kernel.org/r/20190107200224.13260-1-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14ef1fc7
    • Gustavo A. R. Silva's avatar
      mm/memcontrol.c: use struct_size() in kmalloc() · 67b8046f
      Gustavo A. R. Silva authored
      One of the more common cases of allocation size calculations is finding
      the size of a structure that has a zero-sized array at the end, along
      with memory for some number of elements for that array.  For example:
      
        struct foo {
            int stuff;
            void *entry[];
        };
      
        instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
      
      Instead of leaving these open-coded and prone to type mistakes, we can
      now use the new struct_size() helper:
      
        instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
      
      This code was detected with the help of Coccinelle.
      
      Link: http://lkml.kernel.org/r/20190104183726.GA6374@embeddedorSigned-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67b8046f
    • Wei Yang's avatar
      mm: remove extra drain pages on pcp list · c52e7593
      Wei Yang authored
      In the current implementation, there are two places to isolate a range
      of page: __offline_pages() and alloc_contig_range().  During this
      procedure, it will drain pages on pcp list.
      
      Below is a brief call flow:
      
        __offline_pages()/alloc_contig_range()
            start_isolate_page_range()
                set_migratetype_isolate()
                    drain_all_pages()
            drain_all_pages()                 <--- A
      
      This snippet shows the current logic is isolate and drain pcp list for
      each pageblock and drain pcp list again for the whole range.
      
      start_isolate_page_range is responsible for isolating the given pfn
      range.  One part of that job is to make sure that also pages that are on
      the allocator pcp lists are properly isolated.  Otherwise they could be
      reused and the range wouldn't be completely isolated until the memory is
      freed back.  While there is no strict guarantee here because pages might
      get allocated at any time before drain_all_pages is called there doesn't
      seem to be any strong demand for such a guarantee.
      
      In any case, draining is already done at the isolation level and there
      is no need to do it again later by start_isolate_page_range callers
      (memory hotplug and CMA allocator currently).  Therefore remove
      pointless draining in existing callers to make the code more clear and
      functionally correct.
      
      [mhocko@suse.com: provide a clearer changelog for the last two paragraphs]
      Link: http://lkml.kernel.org/r/20190105233141.2329-1-richard.weiyang@gmail.comSigned-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c52e7593
    • Anshuman Khandual's avatar
      arm64/mm: enable HugeTLB migration for contiguous bit HugeTLB pages · 5480280d
      Anshuman Khandual authored
      Let arm64 subscribe to the previously added framework in which
      architecture can inform whether a given huge page size is supported for
      migration.  This just overrides the default function
      arch_hugetlb_migration_supported() and enables migration for all
      possible HugeTLB page sizes on arm64.
      
      With this, HugeTLB migration support on arm64 now covers all possible
      HugeTLB options.
      
                CONT PTE    PMD    CONT PMD    PUD
                --------    ---    --------    ---
        4K:        64K      2M        32M      1G
        16K:        2M     32M         1G
        64K:        2M    512M        16G
      
      Link: http://lkml.kernel.org/r/1545121450-1663-6-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarSteve Capper <steve.capper@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5480280d
    • Anshuman Khandual's avatar
      arm64/mm: enable HugeTLB migration · 4a03a058
      Anshuman Khandual authored
      Let arm64 subscribe to generic HugeTLB page migration framework.  Right
      now this only works on the following PMD and PUD level HugeTLB page
      sizes with various kernel base page size combinations.
      
               CONT PTE    PMD    CONT PMD    PUD
               --------    ---    --------    ---
        4K:         NA     2M         NA      1G
        16K:        NA    32M         NA
        64K:        NA   512M         NA
      
      Link: http://lkml.kernel.org/r/1545121450-1663-5-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarSteve Capper <steve.capper@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a03a058
    • Anshuman Khandual's avatar
      mm/hugetlb: enable arch specific huge page size support for migration · e693de18
      Anshuman Khandual authored
      Architectures like arm64 have HugeTLB page sizes which are different
      than generic sizes at PMD, PUD, PGD level and implemented via contiguous
      bits.  At present these special size HugeTLB pages cannot be identified
      through macros like (PMD|PUD|PGDIR)_SHIFT and hence chosen not be
      migrated.
      
      Enabling migration support for these special HugeTLB page sizes along
      with the generic ones (PMD|PUD|PGD) would require identifying all of
      them on a given platform.  A platform specific hook can precisely
      enumerate all huge page sizes supported for migration.  Instead of
      comparing against standard huge page orders let
      hugetlb_migration_support() function call a platform hook
      arch_hugetlb_migration_support().  Default definition for the platform
      hook maintains existing semantics which checks standard huge page order.
      But an architecture can choose to override the default and provide
      support for a comprehensive set of huge page sizes.
      
      Link: http://lkml.kernel.org/r/1545121450-1663-4-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarSteve Capper <steve.capper@arm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e693de18
    • Anshuman Khandual's avatar
      mm/hugetlb: enable PUD level huge page migration · 9b553bf5
      Anshuman Khandual authored
      Architectures like arm64 have PUD level HugeTLB pages for certain configs
      (1GB huge page is PUD based on ARM64_4K_PAGES base page size) that can
      be enabled for migration.  It can be achieved through checking for
      PUD_SHIFT order based HugeTLB pages during migration.
      
      Link: http://lkml.kernel.org/r/1545121450-1663-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarSteve Capper <steve.capper@arm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9b553bf5
    • Anshuman Khandual's avatar
      mm/hugetlb: distinguish between migratability and movability · 7ed2c31d
      Anshuman Khandual authored
      Patch series "arm64/mm: Enable HugeTLB migration", v4.
      
      This patch series enables HugeTLB migration support for all supported
      huge page sizes at all levels including contiguous bit implementation.
      Following HugeTLB migration support matrix has been enabled with this
      patch series.  All permutations have been tested except for the 16GB.
      
                 CONT PTE    PMD    CONT PMD    PUD
                 --------    ---    --------    ---
        4K:         64K     2M         32M     1G
        16K:         2M    32M          1G
        64K:         2M   512M         16G
      
      First the series adds migration support for PUD based huge pages.  It
      then adds a platform specific hook to query an architecture if a given
      huge page size is supported for migration while also providing a default
      fallback option preserving the existing semantics which just checks for
      (PMD|PUD|PGDIR)_SHIFT macros.  The last two patches enables HugeTLB
      migration on arm64 and subscribe to this new platform specific hook by
      defining an override.
      
      The second patch differentiates between movability and migratability
      aspects of huge pages and implements hugepage_movable_supported() which
      can then be used during allocation to decide whether to place the huge
      page in movable zone or not.
      
      This patch (of 5):
      
      During huge page allocation it's migratability is checked to determine
      if it should be placed under movable zones with GFP_HIGHUSER_MOVABLE.
      But the movability aspect of the huge page could depend on other factors
      than just migratability.  Movability in itself is a distinct property
      which should not be tied with migratability alone.
      
      This differentiates these two and implements an enhanced movability check
      which also considers huge page size to determine if it is feasible to be
      placed under a movable zone.  At present it just checks for gigantic pages
      but going forward it can incorporate other enhanced checks.
      
      Link: http://lkml.kernel.org/r/1545121450-1663-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarSteve Capper <steve.capper@arm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Suggested-by: default avatarMichal Hocko <mhocko@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ed2c31d
    • Matthew Wilcox's avatar
      mm: remove sysctl_extfrag_handler() · 6b7e5cad
      Matthew Wilcox authored
      sysctl_extfrag_handler() neglects to propagate the return value from
      proc_dointvec_minmax() to its caller.  It's a wrapper that doesn't need
      to exist, so just use proc_dointvec_minmax() directly.
      
      Link: http://lkml.kernel.org/r/20190104032557.3056-1-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reported-by: default avatarAditya Pakki <pakki001@umn.edu>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b7e5cad
    • Uladzislau Rezki (Sony)'s avatar
      selftests/vm: add script helper for CONFIG_TEST_VMALLOC_MODULE · a05ef00c
      Uladzislau Rezki (Sony) authored
      Add the test script for the kernel test driver to analyse vmalloc
      allocator for benchmarking and stressing purposes.  It is just a kernel
      module loader.  You can specify and pass different parameters in order
      to investigate allocations behaviour.  See "usage" output for more
      details.
      
      Also add basic vmalloc smoke test to the "run_vmtests" suite.
      
      Link: http://lkml.kernel.org/r/20190103142108.20744-4-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarShuah Khan <shuah@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a05ef00c
    • Uladzislau Rezki (Sony)'s avatar
      vmalloc: add test driver to analyse vmalloc allocator · 3f21a6b7
      Uladzislau Rezki (Sony) authored
      This adds a new kernel module for analysis of vmalloc allocator.  It is
      only enabled as a module.  There are two main reasons this module should
      be used for: performance evaluation and stressing of vmalloc subsystem.
      
      It consists of several test cases.  As of now there are 8.  The module
      has five parameters we can specify to change its the behaviour.
      
      1) run_test_mask - set of tests to be run
      
      id: 1,   name: fix_size_alloc_test
      id: 2,   name: full_fit_alloc_test
      id: 4,   name: long_busy_list_alloc_test
      id: 8,   name: random_size_alloc_test
      id: 16,  name: fix_align_alloc_test
      id: 32,  name: random_size_align_alloc_test
      id: 64,  name: align_shift_alloc_test
      id: 128, name: pcpu_alloc_test
      
      By default all tests are in run test mask.  If you want to select some
      specific tests it is possible to pass the mask.  For example for first,
      second and fourth tests we go 11 value.
      
      2) test_repeat_count - how many times each test should be repeated
      By default it is one time per test. It is possible to pass any number.
      As high the value is the test duration gets increased.
      
      3) test_loop_count - internal test loop counter. By default it is set
      to 1000000.
      
      4) single_cpu_test - use one CPU to run the tests
      By default this parameter is set to false. It means that all online
      CPUs execute tests. By setting it to 1, the tests are executed by
      first online CPU only.
      
      5) sequential_test_order - run tests in sequential order
      By default this parameter is set to false. It means that before running
      tests the order is shuffled. It is possible to make it sequential, just
      set it to 1.
      
      Performance analysis:
      In order to evaluate performance of vmalloc allocations, usually it
      makes sense to use only one CPU that runs tests, use sequential order,
      number of repeat tests can be different as well as set of test mask.
      
      For example if we want to run all tests, to use one CPU and repeat each
      test 3 times. Insert the module passing following parameters:
      
      single_cpu_test=1 sequential_test_order=1 test_repeat_count=3
      
      with following output:
      
      <snip>
      Summary: fix_size_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 901177 usec
      Summary: full_fit_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 1039341 usec
      Summary: long_busy_list_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 11775763 usec
      Summary: random_size_alloc_test passed 3: failed: 0 repeat: 3 loops: 1000000 avg: 6081992 usec
      Summary: fix_align_alloc_test passed: 3 failed: 0 repeat: 3, loops: 1000000 avg: 2003712 usec
      Summary: random_size_align_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 2895689 usec
      Summary: align_shift_alloc_test passed: 0 failed: 3 repeat: 3 loops: 1000000 avg: 573 usec
      Summary: pcpu_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 95802 usec
      All test took CPU0=192945605995 cycles
      <snip>
      
      The align_shift_alloc_test is expected to be failed.
      
      Stressing:
      In order to stress the vmalloc subsystem we run all available test cases
      on all available CPUs simultaneously. In order to prevent constant behaviour
      pattern, the test cases array is shuffled by default to randomize the order
      of test execution.
      
      For example if we want to run all tests(default), use all online CPUs(default)
      with shuffled order(default) and to repeat each test 30 times. The command
      would be like:
      
      modprobe vmalloc_test test_repeat_count=30
      
      Expected results are the system is alive, there are no any BUG_ONs or Kernel
      Panics the tests are completed, no memory leaks.
      
      [urezki@gmail.com: fix 32-bit builds]
        Link: http://lkml.kernel.org/r/20190106214839.ffvjvmrn52uqog7k@pc636
      [urezki@gmail.com: make CONFIG_TEST_VMALLOC depend on CONFIG_MMU]
        Link: http://lkml.kernel.org/r/20190219085441.s6bg2gpy4esny5vw@pc636
      Link: http://lkml.kernel.org/r/20190103142108.20744-3-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f21a6b7
    • Uladzislau Rezki (Sony)'s avatar
      vmalloc: export __vmalloc_node_range for CONFIG_TEST_VMALLOC_MODULE · 153178ed
      Uladzislau Rezki (Sony) authored
      Export __vmaloc_node_range() function if CONFIG_TEST_VMALLOC_MODULE is
      enabled.  Some test cases in vmalloc test suite module require and make
      use of that function.  Please note, that it is not supposed to be used
      for other purposes.
      
      We need it only for performance analysis, stressing and stability check
      of vmalloc allocator.
      
      Link: http://lkml.kernel.org/r/20190103142108.20744-2-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      153178ed
    • Roman Penyaev's avatar
      mm/vmalloc: pass VM_USERMAP flags directly to __vmalloc_node_range() · bc84c535
      Roman Penyaev authored
      vmalloc_user*() calls differ from normal vmalloc() only in that they set
      VM_USERMAP flags for the area.  During the whole history of vmalloc.c
      changes now it is possible simply to pass VM_USERMAP flags directly to
      __vmalloc_node_range() call instead of finding the area (which obviously
      takes time) after the allocation.
      
      Link: http://lkml.kernel.org/r/20190103145954.16942-4-rpenyaev@suse.deSigned-off-by: default avatarRoman Penyaev <rpenyaev@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc84c535
    • Roman Penyaev's avatar
      mm/vmalloc: do not call kmemleak_free() on not yet accounted memory · c67dc624
      Roman Penyaev authored
      __vmalloc_area_node() calls vfree() on error path, which in turn calls
      kmemleak_free(), but area is not yet accounted by kmemleak_vmalloc().
      
      Link: http://lkml.kernel.org/r/20190103145954.16942-3-rpenyaev@suse.deSigned-off-by: default avatarRoman Penyaev <rpenyaev@suse.de>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c67dc624
    • Roman Penyaev's avatar
      mm/vmalloc: fix size check for remap_vmalloc_range_partial() · 401592d2
      Roman Penyaev authored
      When VM_NO_GUARD is not set area->size includes adjacent guard page,
      thus for correct size checking get_vm_area_size() should be used, but
      not area->size.
      
      This fixes possible kernel oops when userspace tries to mmap an area on
      1 page bigger than was allocated by vmalloc_user() call: the size check
      inside remap_vmalloc_range_partial() accounts non-existing guard page
      also, so check successfully passes but vmalloc_to_page() returns NULL
      (guard page does not physically exist).
      
      The following code pattern example should trigger an oops:
      
        static int oops_mmap(struct file *file, struct vm_area_struct *vma)
        {
              void *mem;
      
              mem = vmalloc_user(4096);
              BUG_ON(!mem);
              /* Do not care about mem leak */
      
              return remap_vmalloc_range(vma, mem, 0);
        }
      
      And userspace simply mmaps size + PAGE_SIZE:
      
        mmap(NULL, 8192, PROT_WRITE|PROT_READ, MAP_PRIVATE, fd, 0);
      
      Possible candidates for oops which do not have any explicit size
      checks:
      
         *** drivers/media/usb/stkwebcam/stk-webcam.c:
         v4l_stk_mmap[789]   ret = remap_vmalloc_range(vma, sbuf->buffer, 0);
      
      Or the following one:
      
         *** drivers/video/fbdev/core/fbmem.c
         static int
         fb_mmap(struct file *file, struct vm_area_struct * vma)
              ...
              res = fb->fb_mmap(info, vma);
      
      Where fb_mmap callback calls remap_vmalloc_range() directly without any
      explicit checks:
      
         *** drivers/video/fbdev/vfb.c
         static int vfb_mmap(struct fb_info *info,
                   struct vm_area_struct *vma)
         {
             return remap_vmalloc_range(vma, (void *)info->fix.smem_start, vma->vm_pgoff);
         }
      
      Link: http://lkml.kernel.org/r/20190103145954.16942-2-rpenyaev@suse.deSigned-off-by: default avatarRoman Penyaev <rpenyaev@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      401592d2
    • Roman Penyaev's avatar
      mm/vmalloc.c: make vmalloc_32_user() align base kernel virtual address to SHMLBA · 5a82ac71
      Roman Penyaev authored
      This patch repeats the original one from David S Miller:
      
        2dca6999 ("mm, perf_event: Make vmalloc_user() align base kernel virtual address to SHMLBA")
      
      but for missed vmalloc_32_user() case, which also requires correct
      alignment of virtual address on kernel side to avoid D-caches aliases.
      A bit of copy-paste from original patch to recover in memory of what is
      all about:
      
        When a vmalloc'd area is mmap'd into userspace, some kind of
        co-ordination is necessary for this to work on platforms with cpu
        D-caches which can have aliases.
      
        Otherwise kernel side writes won't be seen properly in userspace and
        vice versa.
      
        If the kernel side mapping and the user side one have the same
        alignment, modulo SHMLBA, this can work as long as VM_SHARED is shared
        of VMA and for all current users this is true. VM_SHARED will force
        SHMLBA alignment of the user side mmap on platforms with D-cache
        aliasing matters.
      
        David S. Miller
      
      > What are the user-visible runtime effects of this change?
      
      In simple words: proper alignment avoids possible difference in data,
      seen by different virtual mapings: userspace and kernel in our case.
      I.e. userspace reads cache line A, kernel writes to cache line B.  Both
      cache lines correspond to the same physical memory (thus aliases).
      
      So this should fix data corruption for archs with vivt and vipt caches,
      e.g. armv6.  Personally I've never worked with this archs, I just
      spotted the strange difference in code: for one case we do alignment,
      for another - not.  I have a strong feeling that David simply missed
      vmalloc_32_user() case.
      
      >
      > Is a -stable backport needed?
      
      No, I do not think so.  The only one user of vmalloc_32_user() is
      virtual frame buffer device drivers/video/fbdev/vfb.c, which has in the
      description "The main use of this frame buffer device is testing and
      debugging the frame buffer subsystem.  Do NOT enable it for normal
      systems!".
      
      And it seems to me that this vfb.c does not need 32bit addressable pages
      (vmalloc_32_user() case), because it is virtual device and should not
      care about things like dma32 zones, etc.  Probably is better to clean
      the code and switch vfb.c from vmalloc_32_user() to vmalloc_user() case
      and wipe out vmalloc_32_user() from vmalloc.c completely.  But I'm not
      very much sure that this is worth to do, that's so minor, so we can
      leave it as is.
      
      Link: http://lkml.kernel.org/r/20190108110944.23591-1-rpenyaev@suse.deSigned-off-by: default avatarRoman Penyaev <rpenyaev@suse.de>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a82ac71
    • Shakeel Butt's avatar
      memcg: localize memcg_kmem_enabled() check · 60cd4bcd
      Shakeel Butt authored
      Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge
      functions, so, the users don't have to explicitly check that condition.
      
      This is purely code cleanup patch without any functional change.  Only
      the order of checks in memcg_charge_slab() can potentially be changed
      but the functionally it will be same.  This should not matter as
      memcg_charge_slab() is not in the hot path.
      
      Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.comSigned-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60cd4bcd
    • Wei Yang's avatar
      mm, slub: make the comment of put_cpu_partial() complete · 9234bae9
      Wei Yang authored
      There are two cases when put_cpu_partial() is invoked.
      
          * __slab_free
          * get_partial_node
      
      This patch just makes it cover these two cases.
      
      Link: http://lkml.kernel.org/r/20181025094437.18951-3-richard.weiyang@gmail.comSigned-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9234bae9
    • Kirill Tkhai's avatar
      mm: reuse only-pte-mapped KSM page in do_wp_page() · 52d1e606
      Kirill Tkhai authored
      Add an optimization for KSM pages almost in the same way that we have
      for ordinary anonymous pages.  If there is a write fault in a page,
      which is mapped to an only pte, and it is not related to swap cache; the
      page may be reused without copying its content.
      
      [ Note that we do not consider PageSwapCache() pages at least for now,
        since we don't want to complicate __get_ksm_page(), which has nice
        optimization based on this (for the migration case). Currenly it is
        spinning on PageSwapCache() pages, waiting for when they have
        unfreezed counters (i.e., for the migration finish). But we don't want
        to make it also spinning on swap cache pages, which we try to reuse,
        since there is not a very high probability to reuse them. So, for now
        we do not consider PageSwapCache() pages at all. ]
      
      So in reuse_ksm_page() we check for 1) PageSwapCache() and 2)
      page_stable_node(), to skip a page, which KSM is currently trying to
      link to stable tree.  Then we do page_ref_freeze() to prohibit KSM to
      merge one more page into the page, we are reusing.  After that, nobody
      can refer to the reusing page: KSM skips !PageSwapCache() pages with
      zero refcount; and the protection against of all other participants is
      the same as for reused ordinary anon pages pte lock, page lock and
      mmap_sem.
      
      [akpm@linux-foundation.org: replace BUG_ON()s with WARN_ON()s]
      Link: http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52d1e606
    • Stephen Rothwell's avatar
      tools/: replace open encodings for NUMA_NO_NODE · 7c9eefe8
      Stephen Rothwell authored
      This replaces all open encodings in tools with NUMA_NO_NODE.  Also
      linux/numa.h is now needed for the perf build.
      
      [sfr@canb.auug.org.au: fix for replace open encodings for NUMA_NO_NODE]
        Link: http://lkml.kernel.org/r/20190108131141.730e9c4f@canb.auug.org.au
      Link: http://lkml.kernel.org/r/1545127933-10711-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
      Cc: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
      Cc: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c9eefe8
    • Anshuman Khandual's avatar
      mm: replace all open encodings for NUMA_NO_NODE · 98fa15f3
      Anshuman Khandual authored
      Patch series "Replace all open encodings for NUMA_NO_NODE", v3.
      
      All these places for replacement were found by running the following
      grep patterns on the entire kernel code.  Please let me know if this
      might have missed some instances.  This might also have replaced some
      false positives.  I will appreciate suggestions, inputs and review.
      
      1. git grep "nid == -1"
      2. git grep "node == -1"
      3. git grep "nid = -1"
      4. git grep "node = -1"
      
      This patch (of 2):
      
      At present there are multiple places where invalid node number is
      encoded as -1.  Even though implicitly understood it is always better to
      have macros in there.  Replace these open encodings for an invalid node
      number with the global macro NUMA_NO_NODE.  This helps remove NUMA
      related assumptions like 'invalid node' from various places redirecting
      them to a common definition.
      
      Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
      Acked-by: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
      Acked-by: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
      Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
      Acked-by: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98fa15f3
    • Liviu Dudau's avatar
      mm/vmalloc.c: don't dereference possible NULL pointer in __vunmap() · 6ade2032
      Liviu Dudau authored
      find_vmap_area() can return a NULL pointer and we're going to
      dereference it without checking it first.  Use the existing
      find_vm_area() function which does exactly what we want and checks for
      the NULL pointer.
      
      Link: http://lkml.kernel.org/r/20181228171009.22269-1-liviu@dudau.co.uk
      Fixes: f3c01d2f ("mm: vmalloc: avoid racy handling of debugobjects in vunmap")
      Signed-off-by: default avatarLiviu Dudau <liviu@dudau.co.uk>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6ade2032
    • David Hildenbrand's avatar
      PM/Hibernate: exclude all PageOffline() pages · abd02ac6
      David Hildenbrand authored
      The content of pages that are marked PG_offline is not of interest (e.g.
      inflated by a balloon driver), let's skip these pages.
      
      In saveable_highmem_page(), move the PageReserved() check to a new check
      along with the PageOffline() check to separate it from the swsusp
      checks.
      
      [david@redhat.com: v2]
        Link: http://lkml.kernel.org/r/20181122100627.5189-9-david@redhat.com
      Link: http://lkml.kernel.org/r/20181119101616.8901-9-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPavel Machek <pavel@ucw.cz>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pankaj gupta <pagupta@redhat.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      abd02ac6
    • David Hildenbrand's avatar
      PM/Hibernate: use pfn_to_online_page() · 5b56db37
      David Hildenbrand authored
      Let's use pfn_to_online_page() instead of pfn_to_page() when checking
      for saveable pages to not save/restore offline memory sections.
      
      Link: http://lkml.kernel.org/r/20181119101616.8901-8-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Suggested-by: default avatarMichal Hocko <mhocko@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarPavel Machek <pavel@ucw.cz>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pankaj gupta <pagupta@redhat.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b56db37
    • David Hildenbrand's avatar
      vmw_balloon: mark inflated pages PG_offline · 8165540c
      David Hildenbrand authored
      Mark inflated and never onlined pages PG_offline, to tell the world that
      the content is stale and should not be dumped.
      
      [david@redhat.com: use vmballoon_page_in_frames more widely]
        Link: http://lkml.kernel.org/r/20181122100627.5189-7-david@redhat.com
      Link: http://lkml.kernel.org/r/20181119101616.8901-7-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarNadav Amit <namit@vmware.com>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pankaj gupta <pagupta@redhat.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8165540c
    • David Hildenbrand's avatar
      hv_balloon: mark inflated pages PG_offline · fae42c4d
      David Hildenbrand authored
      Mark inflated and never onlined pages PG_offline, to tell the world that
      the content is stale and should not be dumped.
      
      Link: http://lkml.kernel.org/r/20181119101616.8901-6-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPankaj gupta <pagupta@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fae42c4d
    • David Hildenbrand's avatar
      xen/balloon: mark inflated pages PG_offline · 77c4adf6
      David Hildenbrand authored
      Mark inflated and never onlined pages PG_offline, to tell the world that
      the content is stale and should not be dumped.
      
      Link: http://lkml.kernel.org/r/20181119101616.8901-5-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pankaj gupta <pagupta@redhat.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77c4adf6
    • David Hildenbrand's avatar
      kexec: export PG_offline to VMCOREINFO · e04b742f
      David Hildenbrand authored
      Right now, pages inflated as part of a balloon driver will be dumped by
      dump tools like makedumpfile.  While XEN is able to check in the crash
      kernel whether a certain pfn is actuall backed by memory in the
      hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
      other balloon inflated memory will essentially result in zero pages
      getting allocated by the hypervisor and the dump getting filled with
      this data.
      
      The allocation and reading of zero pages can directly be avoided if a
      dumping tool could know which pages only contain stale information not
      to be dumped.
      
      We now have PG_offline which can be (and already is by virtio-balloon)
      used for marking pages as logically offline.  Follow up patches will
      make use of this flag also in other balloon implementations.
      
      Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so makedumpfile
      can directly skip pages that are logically offline and the content
      therefore stale.
      
      Please note that this is also helpful for a problem we were seeing under
      Hyper-V: Dumping logically offline memory (pages kept fake offline while
      onlining a section via online_page_callback) would under some condicions
      result in a kernel panic when dumping them.
      
      Link: http://lkml.kernel.org/r/20181119101616.8901-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarDave Young <dyoung@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pankaj gupta <pagupta@redhat.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e04b742f
    • David Hildenbrand's avatar
      mm: convert PG_balloon to PG_offline · ca215086
      David Hildenbrand authored
      PG_balloon was introduced to implement page migration/compaction for
      pages inflated in virtio-balloon.  Nowadays, it is only a marker that a
      page is part of virtio-balloon and therefore logically offline.
      
      We also want to make use of this flag in other balloon drivers - for
      inflated pages or when onlining a section but keeping some pages offline
      (e.g.  used right now by XEN and Hyper-V via set_online_page_callback()).
      
      We are going to expose this flag to dump tools like makedumpfile.  But
      instead of exposing PG_balloon, let's generalize the concept of marking
      pages as logically offline, so it can be reused for other purposes later
      on.
      
      Rename PG_balloon to PG_offline.  This is an indicator that the page is
      logically offline, the content stale and that it should not be touched
      (e.g.  a hypervisor would have to allocate backing storage in order for
      the guest to dump an unused page).  We can then e.g.  exclude such pages
      from dumps.
      
      We replace and reuse KPF_BALLOON (23), as this shouldn't really harm
      (and for now the semantics stay the same).  In following patches, we
      will make use of this bit also in other balloon drivers.  While at it,
      document PGTABLE.
      
      [akpm@linux-foundation.org: fix comment text, per David]
      Link: http://lkml.kernel.org/r/20181119101616.8901-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarPankaj gupta <pagupta@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ca215086
    • David Hildenbrand's avatar
      mm: balloon: update comment about isolation/migration/compaction · 4d3467e1
      David Hildenbrand authored
      Patch series "mm/kdump: allow to exclude pages that are logically
      offline"
      
      Right now, pages inflated as part of a balloon driver will be dumped by
      dump tools like makedumpfile.  While XEN is able to check in the crash
      kernel whether a certain pfn is actuall backed by memory in the
      hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
      virtio-balloon, hv-balloon and VMWare balloon inflated memory will
      essentially result in zero pages getting allocated by the hypervisor and
      the dump getting filled with this data.
      
      The allocation and reading of zero pages can directly be avoided if a
      dumping tool could know which pages only contain stale information not
      to be dumped.
      
      Also for XEN, calling into the kernel and asking the hypervisor if a pfn
      is backed can be avoided if the duming tool would skip such pages right
      from the beginning.
      
      Dumping tools have no idea whether a given page is part of a balloon
      driver and shall not be dumped.  Esp.  PG_reserved cannot be used for
      that purpose as all memory allocated during early boot is also
      PG_reserved, see discussion at [1].  So some other way of indication is
      required and a new page flag is frowned upon.
      
      We have PG_balloon (MAPCOUNT value), which is essentially unused now.  I
      suggest renaming it to something more generic (PG_offline) to mark pages
      as logically offline.  This flag can than e.g.  also be used by
      virtio-mem in the future to mark subsections as offline.  Or by other
      code that wants to put pages logically offline (e.g.  later maybe
      poisoned pages that shall no longer be used).
      
      This series converts PG_balloon to PG_offline, allows dumping tools to
      query the value to detect such pages and marks pages in the hv-balloon
      and XEN balloon properly as PG_offline.  Note that virtio-balloon
      already set pages to PG_balloon (and now PG_offline).
      
      Please note that this is also helpful for a problem we were seeing under
      Hyper-V: Dumping logically offline memory (pages kept fake offline while
      onlining a section via online_page_callback) would under some condicions
      result in a kernel panic when dumping them.
      
      As I don't have access to neither XEN nor Hyper-V nor VMWare
      installations, this was only tested with the virtio-balloon and pages
      were properly skipped when dumping.  I'll also attach the makedumpfile
      patch to this series.
      
      [1] https://lkml.org/lkml/2018/7/20/566
      
      This patch (of 8):
      
      Commit b1123ea6 ("mm: balloon: use general non-lru movable page
      feature") reworked balloon handling to make use of the general non-lru
      movable page feature.  The big comment block in balloon_compaction.h
      contains quite some outdated information.  Let's fix this.
      
      Link: http://lkml.kernel.org/r/20181119101616.8901-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pankaj gupta <pagupta@redhat.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4d3467e1
    • Arun KS's avatar
      mm/page_alloc.c: memory hotplug: free pages as higher order · a9cd410a
      Arun KS authored
      When freeing pages are done with higher order, time spent on coalescing
      pages by buddy allocator can be reduced.  With section size of 256MB,
      hot add latency of a single section shows improvement from 50-60 ms to
      less than 1 ms, hence improving the hot add latency by 60 times.  Modify
      external providers of online callback to align with the change.
      
      [arunks@codeaurora.org: v11]
        Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
      [akpm@linux-foundation.org: remove unused local, per Arun]
      [akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
      [akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
      [arunks@codeaurora.org: v8]
        Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
      [arunks@codeaurora.org: v9]
        Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
      Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.orgSigned-off-by: default avatarArun KS <arunks@codeaurora.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9cd410a
    • Qian Cai's avatar
      mm/slub.c: remove an unused addr argument · 278d7756
      Qian Cai authored
      "addr" function argument is not used in alloc_consistency_checks() at
      all, so remove it.
      
      Link: http://lkml.kernel.org/r/20190211123214.35592-1-cai@lca.pw
      Fixes: becfda68 ("slub: convert SLAB_DEBUG_FREE to SLAB_CONSISTENCY_CHECKS")
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      278d7756
    • Tobin C. Harding's avatar
      include/linux/slub_def.h: comment fixes · de810f49
      Tobin C. Harding authored
      Capitialize comment string, use C89 comment style, correct
      grammar/punctuation in comments.
      
      Link: http://lkml.kernel.org/r/20190204005713.9463-2-tobin@kernel.org
      Link: http://lkml.kernel.org/r/20190204005713.9463-3-tobin@kernel.org
      Link: http://lkml.kernel.org/r/20190204005713.9463-4-tobin@kernel.orgSigned-off-by: default avatarTobin C. Harding <tobin@kernel.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de810f49
    • Qian Cai's avatar
      mm/slab.c: kmemleak no scan alien caches · 92d1d07d
      Qian Cai authored
      Kmemleak throws endless warnings during boot due to in
      __alloc_alien_cache(),
      
          alc = kmalloc_node(memsize, gfp, node);
          init_arraycache(&alc->ac, entries, batch);
          kmemleak_no_scan(ac);
      
      Kmemleak does not track the array cache (alc->ac) but the alien cache
      (alc) instead, so let it track the latter by lifting kmemleak_no_scan()
      out of init_arraycache().
      
      There is another place that calls init_arraycache(), but
      alloc_kmem_cache_cpus() uses the percpu allocation where will never be
      considered as a leak.
      
        kmemleak: Found object by alias at 0xffff8007b9aa7e38
        CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
        Call trace:
         dump_backtrace+0x0/0x168
         show_stack+0x24/0x30
         dump_stack+0x88/0xb0
         lookup_object+0x84/0xac
         find_and_get_object+0x84/0xe4
         kmemleak_no_scan+0x74/0xf4
         setup_kmem_cache_node+0x2b4/0x35c
         __do_tune_cpucache+0x250/0x2d4
         do_tune_cpucache+0x4c/0xe4
         enable_cpucache+0xc8/0x110
         setup_cpu_cache+0x40/0x1b8
         __kmem_cache_create+0x240/0x358
         create_cache+0xc0/0x198
         kmem_cache_create_usercopy+0x158/0x20c
         kmem_cache_create+0x50/0x64
         fsnotify_init+0x58/0x6c
         do_one_initcall+0x194/0x388
         kernel_init_freeable+0x668/0x688
         kernel_init+0x18/0x124
         ret_from_fork+0x10/0x18
        kmemleak: Object 0xffff8007b9aa7e00 (size 256):
        kmemleak:   comm "swapper/0", pid 1, jiffies 4294697137
        kmemleak:   min_count = 1
        kmemleak:   count = 0
        kmemleak:   flags = 0x1
        kmemleak:   checksum = 0
        kmemleak:   backtrace:
             kmemleak_alloc+0x84/0xb8
             kmem_cache_alloc_node_trace+0x31c/0x3a0
             __kmalloc_node+0x58/0x78
             setup_kmem_cache_node+0x26c/0x35c
             __do_tune_cpucache+0x250/0x2d4
             do_tune_cpucache+0x4c/0xe4
             enable_cpucache+0xc8/0x110
             setup_cpu_cache+0x40/0x1b8
             __kmem_cache_create+0x240/0x358
             create_cache+0xc0/0x198
             kmem_cache_create_usercopy+0x158/0x20c
             kmem_cache_create+0x50/0x64
             fsnotify_init+0x58/0x6c
             do_one_initcall+0x194/0x388
             kernel_init_freeable+0x668/0x688
             kernel_init+0x18/0x124
        kmemleak: Not scanning unknown object at 0xffff8007b9aa7e38
        CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
        Call trace:
         dump_backtrace+0x0/0x168
         show_stack+0x24/0x30
         dump_stack+0x88/0xb0
         kmemleak_no_scan+0x90/0xf4
         setup_kmem_cache_node+0x2b4/0x35c
         __do_tune_cpucache+0x250/0x2d4
         do_tune_cpucache+0x4c/0xe4
         enable_cpucache+0xc8/0x110
         setup_cpu_cache+0x40/0x1b8
         __kmem_cache_create+0x240/0x358
         create_cache+0xc0/0x198
         kmem_cache_create_usercopy+0x158/0x20c
         kmem_cache_create+0x50/0x64
         fsnotify_init+0x58/0x6c
         do_one_initcall+0x194/0x388
         kernel_init_freeable+0x668/0x688
         kernel_init+0x18/0x124
         ret_from_fork+0x10/0x18
      
      Link: http://lkml.kernel.org/r/20190129184518.39808-1-cai@lca.pw
      Fixes: 1fe00d50 ("slab: factor out initialization of array cache")
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      92d1d07d
    • Peng Wang's avatar
      mm/slub.c: freelist is ensured to be NULL when new_slab() fails · edde82b6
      Peng Wang authored
      new_slab_objects() will return immediately if freelist is not NULL.
      
               if (freelist)
                       return freelist;
      
      One more assignment operation could be avoided.
      
      Link: http://lkml.kernel.org/r/20181229062512.30469-1-rocking@whu.edu.cnSigned-off-by: default avatarPeng Wang <rocking@whu.edu.cn>
      Reviewed-by: default avatarPekka Enberg <penberg@kernel.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      edde82b6
    • Shuriyc Chu's avatar
      fs/file.c: initialize init_files.resize_wait · 5704a068
      Shuriyc Chu authored
      (Taken from https://bugzilla.kernel.org/show_bug.cgi?id=200647)
      
      'get_unused_fd_flags' in kthread cause kernel crash.  It works fine on
      4.1, but causes crash after get 64 fds.  It also cause crash on
      ubuntu1404/1604/1804, centos7.5, and the crash messages are almost the
      same.
      
      The crash message on centos7.5 shows below:
      
        start fd 61
        start fd 62
        start fd 63
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: __wake_up_common+0x2e/0x90
        PGD 0
        Oops: 0000 [#1] SMP
        Modules linked in: test(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter devlink sunrpc kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg ppdev pcspkr virtio_balloon parport_pc parport i2c_piix4 joydev ip_tables xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi virtio_scsi virtio_console virtio_net cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common crc32c_intel drm ata_piix serio_raw libata virtio_pci virtio_ring i2c_core
         virtio floppy dm_mirror dm_region_hash dm_log dm_mod
        CPU: 2 PID: 1820 Comm: test_fd Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.3.3.el7.x86_64 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
        task: ffff8e92b9431fa0 ti: ffff8e94247a0000 task.ti: ffff8e94247a0000
        RIP: 0010:__wake_up_common+0x2e/0x90
        RSP: 0018:ffff8e94247a2d18  EFLAGS: 00010086
        RAX: 0000000000000000 RBX: ffffffff9d09daa0 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffff9d09daa0
        RBP: ffff8e94247a2d50 R08: 0000000000000000 R09: ffff8e92b95dfda8
        R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9d09daa8
        R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003
        FS:  0000000000000000(0000) GS:ffff8e9434e80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 000000017c686000 CR4: 00000000000207e0
        Call Trace:
          __wake_up+0x39/0x50
          expand_files+0x131/0x250
          __alloc_fd+0x47/0x170
          get_unused_fd_flags+0x30/0x40
          test_fd+0x12a/0x1c0 [test]
          kthread+0xd1/0xe0
          ret_from_fork_nospec_begin+0x21/0x21
        Code: 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 54 49 89 fc 49 83 c4 08 53 48 83 ec 10 48 8b 47 08 89 55 cc 4c 89 45 d0 <48> 8b 08 49 39 c4 48 8d 78 e8 4c 8d 69 e8 75 08 eb 3b 4c 89 ef
        RIP   __wake_up_common+0x2e/0x90
         RSP <ffff8e94247a2d18>
        CR2: 0000000000000000
      
      This issue exists since CentOS 7.5 3.10.0-862 and CentOS 7.4
      (3.10.0-693.21.1 ) is ok.  Root cause: the item 'resize_wait' is not
      initialized before being used.
      Reported-by: default avatarRichard Zhang <zhang.zijian@h3c.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5704a068
    • Vineet Gupta's avatar
      fs/inode.c: inode_set_flags(): replace opencoded set_mask_bits() · a905737f
      Vineet Gupta authored
      It seems that commits 5f16f322 and 00a1a053, both with same
      commitlog ("ext4: atomically set inode->i_flags in ext4_set_inode_flags()")
      introduced the set_mask_bits API, but somehow missed not using it in ext4
      in the end.
      
      Also, set_mask_bits() is used in fs quite a bit and we can possibly come
      up with a generic llsc based implementation (w/o the cmpxchg loop)
      
      Link: http://lkml.kernel.org/r/1548275584-18096-3-git-send-email-vgupta@synopsys.comSigned-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Reviewed-by: default avatarAnthony Yznaga <anthony.yznaga@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jani Nikula <jani.nikula@intel.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a905737f
    • Gustavo A. R. Silva's avatar
      ocfs2: Use zero-sized array and struct_size() in kzalloc() · f402cf03
      Gustavo A. R. Silva authored
      Update the code to use a zero-sized array instead of a pointer in
      structure ocfs2_slot_info and use struct_size() in kzalloc().
      
      Notice that one of the more common cases of allocation size calculations
      is finding the size of a structure that has a zero-sized array at the
      end, along with memory for some number of elements for that array.  For
      example:
      
        struct foo {
            int stuff;
            void *entry[];
        };
      
        instance = kzalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
      
      Instead of leaving these open-coded and prone to type mistakes, we can
      now use the new struct_size() helper:
      
        instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);
      
      This code was detected with the help of Coccinelle.
      
      Link: http://lkml.kernel.org/r/20190108191903.GA22056@embeddedorSigned-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f402cf03
    • Gang He's avatar
      ocfs2: fix the application IO timeout when fstrim is running · 5500ab4e
      Gang He authored
      The user reported this problem, the upper application IO was timeout
      when fstrim was running on this ocfs2 partition.  the application
      monitoring resource agent considered that this application did not work,
      then this node was fenced by the cluster brain (e.g.  pacemaker).
      
      The root cause is that fstrim thread always holds main_bm meta-file
      related locks until all the cluster groups are trimmed.  This patch will
      make fstrim thread release main_bm meta-file related locks when each
      cluster group is trimmed, this will let the current application IO has a
      chance to claim the clusters from main_bm meta-file.
      
      Link: http://lkml.kernel.org/r/20190111090014.31645-1-ghe@suse.comSigned-off-by: default avatarGang He <ghe@suse.com>
      Reviewed-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5500ab4e