1. 07 Sep, 2017 40 commits
    • Vlastimil Babka's avatar
      mm, page_owner: don't grab zone->lock for init_pages_in_zone() · 10903027
      Vlastimil Babka authored
      init_pages_in_zone() is run under zone->lock, which means a long lock
      time and disabled interrupts on large machines.  This is currently not
      an issue since it runs early in boot, but a later patch will change
      that.
      
      However, like other pfn scanners, we don't actually need zone->lock even
      when other cpus are running.  The only potentially dangerous operation
      here is reading bogus buddy page owner due to race, and we already know
      how to handle that.  The worst that can happen is that we skip some
      early allocated pages, which should not affect the debugging power of
      page_owner noticeably.
      
      Link: http://lkml.kernel.org/r/20170720134029.25268-4-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10903027
    • Vlastimil Babka's avatar
      mm, page_ext: periodically reschedule during page_ext_init() · 0fc542b7
      Vlastimil Babka authored
      page_ext_init() can take long on large machines, so add a cond_resched()
      point after each section is processed.  This will allow moving the init
      to a later point at boot without triggering lockup reports.
      
      Link: http://lkml.kernel.org/r/20170720134029.25268-3-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0fc542b7
    • Vlastimil Babka's avatar
      mm, page_owner: make init_pages_in_zone() faster · dab4ead1
      Vlastimil Babka authored
      In init_pages_in_zone() we currently use the generic set_page_owner()
      function to initialize page_owner info for early allocated pages.  This
      means we needlessly do lookup_page_ext() twice for each page, and more
      importantly save_stack(), which has to unwind the stack and find the
      corresponding stack depot handle.  Because the stack is always the same
      for the initialization, unwind it once in init_pages_in_zone() and reuse
      the handle.  Also avoid the repeated lookup_page_ext().
      
      This can significantly reduce boot times with page_owner=on on large
      machines, especially for kernels built without frame pointer, where the
      stack unwinding is noticeably slower.
      
      [vbabka@suse.cz: don't duplicate code of __set_page_owner(), per Michal Hocko]
      [akpm@linux-foundation.org: coding-style fixes]
      [vbabka@suse.cz: create statically allocated fake stack trace for early allocated pages, per Michal]
        Link: http://lkml.kernel.org/r/45813564-2342-fc8d-d31a-f4b68a724325@suse.cz
      Link: http://lkml.kernel.org/r/20170720134029.25268-2-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dab4ead1
    • Michal Hocko's avatar
      mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations · b95046b0
      Michal Hocko authored
      Commit f52407ce ("memory hotplug: alloc page from other node in
      memory online") has introduced N_HIGH_MEMORY checks to only use NUMA
      aware allocations when there is some memory present because the
      respective node might not have any memory yet at the time and so it
      could fail or even OOM.
      
      Things have changed since then though.  Zonelists are now always
      initialized before we do any allocations even for hotplug (see
      959ecc48 ("mm/memory_hotplug.c: fix building of node hotplug
      zonelist")).
      
      Therefore these checks are not really needed.  In fact caller of the
      allocator should never care about whether the node is populated because
      that might change at any time.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-10-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b95046b0
    • Michal Hocko's avatar
      mm, memory_hotplug: get rid of zonelists_mutex · b93e0f32
      Michal Hocko authored
      zonelists_mutex was introduced by commit 4eaf3f64 ("mem-hotplug: fix
      potential race while building zonelist for new populated zone") to
      protect zonelist building from races.  This is no longer needed though
      because both memory online and offline are fully serialized.  New users
      have grown since then.
      
      Notably setup_per_zone_wmarks wants to prevent from races between memory
      hotplug, khugepaged setup and manual min_free_kbytes update via sysctl
      (see cfd3da1e ("mm: Serialize access to min_free_kbytes").  Let's
      add a private lock for that purpose.  This will not prevent from seeing
      halfway through memory hotplug operation but that shouldn't be a big
      deal becuse memory hotplug will update watermarks explicitly so we will
      eventually get a full picture.  The lock just makes sure we won't race
      when updating watermarks leading to weird results.
      
      Also __build_all_zonelists manipulates global data so add a private lock
      for it as well.  This doesn't seem to be necessary today but it is more
      robust to have a lock there.
      
      While we are at it make sure we document that memory online/offline
      depends on a full serialization either via mem_hotplug_begin() or
      device_lock.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-9-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Haicheng Li <haicheng.li@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b93e0f32
    • Michal Hocko's avatar
      mm, page_alloc: remove stop_machine from build_all_zonelists · 11cd8638
      Michal Hocko authored
      build_all_zonelists has been (ab)using stop_machine to make sure that
      zonelists do not change while somebody is looking at them.  This is is
      just a gross hack because a) it complicates the context from which we
      can call build_all_zonelists (see 3f906ba2 ("mm/memory-hotplug:
      switch locking to a percpu rwsem")) and b) is is not really necessary
      especially after "mm, page_alloc: simplify zonelist initialization" and
      c) it doesn't really provide the protection it claims (see below).
      
      Updates of the zonelists happen very seldom, basically only when a zone
      becomes populated during memory online or when it loses all the memory
      during offline.  A racing iteration over zonelists could either miss a
      zone or try to work on one zone twice.  Both of these are something we
      can live with occasionally because there will always be at least one
      zone visible so we are not likely to fail allocation too easily for
      example.
      
      Please note that the original stop_machine approach doesn't really
      provide a better exclusion because the iteration might be interrupted
      half way (unless the whole iteration is preempt disabled which is not
      the case in most cases) so the some zones could still be seen twice or a
      zone missed.
      
      I have run the pathological online/offline of the single memblock in the
      movable zone while stressing the same small node with some memory
      pressure.
      
      Node 1, zone      DMA
        pages free     0
              min      0
              low      0
              high     0
              spanned  0
              present  0
              managed  0
              protection: (0, 943, 943, 943)
      Node 1, zone    DMA32
        pages free     227310
              min      8294
              low      10367
              high     12440
              spanned  262112
              present  262112
              managed  241436
              protection: (0, 0, 0, 0)
      Node 1, zone   Normal
        pages free     0
              min      0
              low      0
              high     0
              spanned  0
              present  0
              managed  0
              protection: (0, 0, 0, 1024)
      Node 1, zone  Movable
        pages free     32722
              min      85
              low      117
              high     149
              spanned  32768
              present  32768
              managed  32768
              protection: (0, 0, 0, 0)
      
      root@test1:/sys/devices/system/node/node1# while true
      do
      	echo offline > memory34/state
      	echo online_movable > memory34/state
      done
      
      root@test1:/mnt/data/test/linux-3.7-rc5# numactl --preferred=1 make -j4
      
      and it survived without any unexpected behavior.  While this is not
      really a great testing coverage it should exercise the allocation path
      quite a lot.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-8-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11cd8638
    • Michal Hocko's avatar
      mm, page_alloc: simplify zonelist initialization · 9d3be21b
      Michal Hocko authored
      build_zonelists gradually builds zonelists from the nearest to the most
      distant node.  As we do not know how many populated zones we will have
      in each node we rely on the _zoneref to terminate initialized part of
      the zonelist by a NULL zone.  While this is functionally correct it is
      quite suboptimal because we cannot allow updaters to race with zonelists
      users because they could see an empty zonelist and fail the allocation
      or hit the OOM killer in the worst case.
      
      We can do much better, though.  We can store the node ordering into an
      already existing node_order array and then give this array to
      build_zonelists_in_node_order and do the whole initialization at once.
      zonelists consumers still might see halfway initialized state but that
      should be much more tolerateable because the list will not be empty and
      they would either see some zone twice or skip over some zone(s) in the
      worst case which shouldn't lead to immediate failures.
      
      While at it let's simplify build_zonelists_node which is rather
      confusing now.  It gets an index into the zoneref array and returns the
      updated index for the next iteration.  Let's rename the function to
      build_zonerefs_node to better reflect its purpose and give it zoneref
      array to update.  The function doesn't the index anymore.  It just
      returns the number of added zones so that the caller can advance the
      zonered array start for the next update.
      
      This patch alone doesn't introduce any functional change yet, though, it
      is merely a preparatory work for later changes.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-7-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d3be21b
    • Michal Hocko's avatar
      mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node · 34ad1296
      Michal Hocko authored
      try_online_node calls hotadd_new_pgdat which already calls
      build_all_zonelists.  So the additional call is redundant.  Even though
      hotadd_new_pgdat will only initialize zonelists of the new node this is
      the right thing to do because such a node doesn't have any memory so
      other zonelists would ignore all the zones from this node anyway.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-6-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34ad1296
    • Michal Hocko's avatar
      mm, memory_hotplug: drop zone from build_all_zonelists · 72675e13
      Michal Hocko authored
      build_all_zonelists gets a zone parameter to initialize zone's pagesets.
      There is only a single user which gives a non-NULL zone parameter and
      that one doesn't really need the rest of the build_all_zonelists (see
      commit 6dcd73d7 ("memory-hotplug: allocate zone's pcp before
      onlining pages")).
      
      Therefore remove setup_zone_pageset from build_all_zonelists and call it
      from its only user directly.  This will also remove a pointless zonlists
      rebuilding which is always good.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72675e13
    • Michal Hocko's avatar
      mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization · d9c9a0b9
      Michal Hocko authored
      __build_all_zonelists reinitializes each online cpu local node for
      CONFIG_HAVE_MEMORYLESS_NODES.  This makes sense because previously
      memory less nodes could gain some memory during memory hotplug and so
      the local node should be changed for CPUs close to such a node.  It
      makes less sense to do that unconditionally for a newly creaded NUMA
      node which is still offline and without any memory.
      
      Let's also simplify the cpu loop and use for_each_online_cpu instead of
      an explicit cpu_online check for all possible cpus.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-4-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9c9a0b9
    • Michal Hocko's avatar
      mm, page_alloc: remove boot pageset initialization from memory hotplug · afb6ebb3
      Michal Hocko authored
      boot_pageset is a boot time hack which gets superseded by normal
      pagesets later in the boot process.  It makes zero sense to reinitialize
      it again and again during memory hotplug.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      afb6ebb3
    • Michal Hocko's avatar
      mm, page_alloc: rip out ZONELIST_ORDER_ZONE · c9bff3ee
      Michal Hocko authored
      Patch series "cleanup zonelists initialization", v1.
      
      This is aimed at cleaning up the zonelists initialization code we have
      but the primary motivation was bug report [2] which got resolved but the
      usage of stop_machine is just too ugly to live.  Most patches are
      straightforward but 3 of them need a special consideration.
      
      Patch 1 removes zone ordered zonelists completely.  I am CCing linux-api
      because this is a user visible change.  As I argue in the patch
      description I do not think we have a strong usecase for it these days.
      I have kept sysctl in place and warn into the log if somebody tries to
      configure zone lists ordering.  If somebody has a real usecase for it we
      can revert this patch but I do not expect anybody will actually notice
      runtime differences.  This patch is not strictly needed for the rest but
      it made patch 6 easier to implement.
      
      Patch 7 removes stop_machine from build_all_zonelists without adding any
      special synchronization between iterators and updater which I _believe_
      is acceptable as explained in the changelog.  I hope I am not missing
      anything.
      
      Patch 8 then removes zonelists_mutex which is kind of ugly as well and
      not really needed AFAICS but a care should be taken when double checking
      my thinking.
      
      This patch (of 9):
      
      Supporting zone ordered zonelists costs us just a lot of code while the
      usefulness is arguable if existent at all.  Mel has already made node
      ordering default on 64b systems.  32b systems are still using
      ZONELIST_ORDER_ZONE because it is considered better to fallback to a
      different NUMA node rather than consume precious lowmem zones.
      
      This argument is, however, weaken by the fact that the memory reclaim
      has been reworked to be node rather than zone oriented.  This means that
      lowmem requests have to skip over all highmem pages on LRUs already and
      so zone ordering doesn't save the reclaim time much.  So the only
      advantage of the zone ordering is under a light memory pressure when
      highmem requests do not ever hit into lowmem zones and the lowmem
      pressure doesn't need to reclaim.
      
      Considering that 32b NUMA systems are rather suboptimal already and it
      is generally advisable to use 64b kernel on such a HW I believe we
      should rather care about the code maintainability and just get rid of
      ZONELIST_ORDER_ZONE altogether.  Keep systcl in place and warn if
      somebody tries to set zone ordering either from kernel command line or
      the sysctl.
      
      [mhocko@suse.com: reading vm.numa_zonelist_order will never terminate]
      Link: http://lkml.kernel.org/r/20170721143915.14161-2-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9bff3ee
    • Minchan Kim's avatar
      zram: add config and doc file for writeback feature · 5a47074f
      Minchan Kim authored
      This patch adds document and kconfig for using of writeback feature.
      
      Link: http://lkml.kernel.org/r/1498459987-24562-10-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a47074f
    • Minchan Kim's avatar
      zram: read page from backing device · 8e654f8f
      Minchan Kim authored
      This patch enables read IO from backing device.  For the feature, it
      implements two IO read functions to transfer data from backing storage.
      
      One is asynchronous IO function and other is synchronous one.
      
      A reason I need synchrnous IO is due to partial write which need to
      complete read IO before the overwriting partial data.
      
      We can make the partial IO's case asynchronous, too but at the moment, I
      don't feel adding more complexity to support such rare use cases so want
      to go with simple.
      
      [xieyisheng1@huawei.com: read_from_bdev_async(): return 1 to avoid call page_endio() in zram_rw_page()]
        Link: http://lkml.kernel.org/r/1502707447-6944-1-git-send-email-xieyisheng1@huawei.com
      Link: http://lkml.kernel.org/r/1498459987-24562-9-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8e654f8f
    • Minchan Kim's avatar
      zram: write incompressible pages to backing device · db8ffbd4
      Minchan Kim authored
      This patch enables write IO to transfer data to backing device.  For
      that, it implements write_to_bdev function which creates new bio and
      chaining with parent bio to make the parent bio asynchrnous.
      
      For rw_page which don't have parent bio, it submit owned bio and handle
      IO completion by zram_page_end_io.
      
      Also, this patch defines new flag ZRAM_WB to mark written page for later
      read IO.
      
      [xieyisheng1@huawei.com: fix typo in comment]
        Link: http://lkml.kernel.org/r/1502707447-6944-2-git-send-email-xieyisheng1@huawei.com
      Link: http://lkml.kernel.org/r/1498459987-24562-8-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      db8ffbd4
    • Minchan Kim's avatar
      zram: identify asynchronous IO's return value · ae85a807
      Minchan Kim authored
      For upcoming asynchronous IO like writeback, zram_rw_page should be
      aware of that whether requested IO was completed or submitted
      successfully, otherwise error.
      
      For the goal, zram_bvec_rw has three return values.
      
      -errno: returns error number
           0: IO request is done synchronously
           1: IO request is issued successfully.
      
      Link: http://lkml.kernel.org/r/1498459987-24562-7-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae85a807
    • Minchan Kim's avatar
      zram: add free space management in backing device · 1363d466
      Minchan Kim authored
      With backing device, zram needs management of free space of backing
      device.
      
      This patch adds bitmap logic to manage free space which is very naive.
      However, it would be simple enough as considering uncompressible pages's
      frequenty in zram.
      
      Link: http://lkml.kernel.org/r/1498459987-24562-6-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1363d466
    • Minchan Kim's avatar
      zram: add interface to specif backing device · 013bf95a
      Minchan Kim authored
      For writeback feature, user should set up backing device before the zram
      working.
      
      This patch enables the interface via /sys/block/zramX/backing_dev.
      
      Currently, it supports block device only but it could be enhanced for
      file as well.
      
      Link: http://lkml.kernel.org/r/1498459987-24562-5-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      013bf95a
    • Minchan Kim's avatar
      zram: rename zram_decompress_page to __zram_bvec_read · 693dc1ce
      Minchan Kim authored
      zram_decompress_page naming is not proper because it doesn't decompress
      if page was dedup hit or stored with compression.
      
      Use more abstract term and consistent with write path function
      __zram_bvec_write.
      
      Link: http://lkml.kernel.org/r/1498459987-24562-4-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      693dc1ce
    • Minchan Kim's avatar
      zram: inline zram_compress · 97ec7c8b
      Minchan Kim authored
      zram_compress does several things, compress, entry alloc and check
      limitation.  I did for just readbility but it hurts modulization.:(
      
      So this patch removes zram_compress functions and inline it in
      __zram_bvec_write for upcoming patches.
      
      Link: http://lkml.kernel.org/r/1498459987-24562-3-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97ec7c8b
    • Minchan Kim's avatar
      zram: clean up duplicated codes in __zram_bvec_write · 4ebbe7f7
      Minchan Kim authored
      Patch series "writeback incompressible pages to storage", v1.
      
      zRam is useful for memory saving with compressible pages but sometime,
      workload can be changed and system has lots of incompressible pages
      which is very harmful for zram.
      
      This patch supports writeback feature of zram so admin can set up a
      block device and with it, zram can save the memory via writing out the
      incompressile pages once it found it's incompressible pages (1/4 comp
      ratio) instead of keeping the page in memory.
      
      [1-3] is just clean up and [4-8] is step by step feature enablement.
      [4-8] is logically not bisectable(ie, logical unit separation)
      although I tried to compiled out without breaking but I think it would
      be better to review.
      
      This patch (of 9):
      
      __zram_bvec_write has some of duplicated logic for zram meta data
      handling of same_page|compressed_page.  This patch aims to clean it up
      without behavior change.
      
      [xieyisheng1@huawei.com: fix compr_data_size stat]
        Link: http://lkml.kernel.org/r/1502707447-6944-1-git-send-email-xieyisheng1@huawei.com
      Link: http://lkml.kernel.org/r/1496019048-27016-1-git-send-email-minchan@kernel.org
      Link: http://lkml.kernel.org/r/1498459987-24562-2-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Juneho Choi <juno.choi@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ebbe7f7
    • Michal Hocko's avatar
      mm, memory_hotplug: remove zone restrictions · c6f03e29
      Michal Hocko authored
      Historically we have enforced that any kernel zone (e.g ZONE_NORMAL) has
      to precede the Movable zone in the physical memory range.  The purpose
      of the movable zone is, however, not bound to any physical memory
      restriction.  It merely defines a class of migrateable and reclaimable
      memory.
      
      There are users (e.g.  CMA) who might want to reserve specific physical
      memory ranges for their own purpose.  Moreover our pfn walkers have to
      be prepared for zones overlapping in the physical range already because
      we do support interleaving NUMA nodes and therefore zones can interleave
      as well.  This means we can allow each memory block to be associated
      with a different zone.
      
      Loosen the current onlining semantic and allow explicit onlining type on
      any memblock.  That means that online_{kernel,movable} will be allowed
      regardless of the physical address of the memblock as long as it is
      offline of course.  This might result in moveble zone overlapping with
      other kernel zones.  Default onlining then becomes a bit tricky but
      still sensible.  echo online > memoryXY/state will online the given
      block to
      
      	1) the default zone if the given range is outside of any zone
      	2) the enclosing zone if such a zone doesn't interleave with
      	   any other zone
              3) the default zone if more zones interleave for this range
      
      where default zone is movable zone only if movable_node is enabled
      otherwise it is a kernel zone.
      
      Here is an example of the semantic with (movable_node is not present but
      it work in an analogous way). We start with following memblocks, all of
      them offline:
      
        memory34/valid_zones:Normal Movable
        memory35/valid_zones:Normal Movable
        memory36/valid_zones:Normal Movable
        memory37/valid_zones:Normal Movable
        memory38/valid_zones:Normal Movable
        memory39/valid_zones:Normal Movable
        memory40/valid_zones:Normal Movable
        memory41/valid_zones:Normal Movable
      
      Now, we online block 34 in default mode and block 37 as movable
      
        root@test1:/sys/devices/system/node/node1# echo online > memory34/state
        root@test1:/sys/devices/system/node/node1# echo online_movable > memory37/state
        memory34/valid_zones:Normal
        memory35/valid_zones:Normal Movable
        memory36/valid_zones:Normal Movable
        memory37/valid_zones:Movable
        memory38/valid_zones:Normal Movable
        memory39/valid_zones:Normal Movable
        memory40/valid_zones:Normal Movable
        memory41/valid_zones:Normal Movable
      
      As we can see all other blocks can still be onlined both into Normal and
      Movable zones and the Normal is default because the Movable zone spans
      only block37 now.
      
        root@test1:/sys/devices/system/node/node1# echo online_movable > memory41/state
        memory34/valid_zones:Normal
        memory35/valid_zones:Normal Movable
        memory36/valid_zones:Normal Movable
        memory37/valid_zones:Movable
        memory38/valid_zones:Movable Normal
        memory39/valid_zones:Movable Normal
        memory40/valid_zones:Movable Normal
        memory41/valid_zones:Movable
      
      Now the default zone for blocks 37-41 has changed because movable zone
      spans that range.
      
        root@test1:/sys/devices/system/node/node1# echo online_kernel > memory39/state
        memory34/valid_zones:Normal
        memory35/valid_zones:Normal Movable
        memory36/valid_zones:Normal Movable
        memory37/valid_zones:Movable
        memory38/valid_zones:Normal Movable
        memory39/valid_zones:Normal
        memory40/valid_zones:Movable Normal
        memory41/valid_zones:Movable
      
      Note that the block 39 now belongs to the zone Normal and so block38
      falls into Normal by default as well.
      
      For completness
      
        root@test1:/sys/devices/system/node/node1# for i in memory[34]?
        do
      	echo online > $i/state 2>/dev/null
        done
      
        memory34/valid_zones:Normal
        memory35/valid_zones:Normal
        memory36/valid_zones:Normal
        memory37/valid_zones:Movable
        memory38/valid_zones:Normal
        memory39/valid_zones:Normal
        memory40/valid_zones:Movable
        memory41/valid_zones:Movable
      
      Implementation wise the change is quite straightforward.  We can get rid
      of allow_online_pfn_range altogether.  online_pages allows only offline
      nodes already.  The original default_zone_for_pfn will become
      default_kernel_zone_for_pfn.  New default_zone_for_pfn implements the
      above semantic.  zone_for_pfn_range is slightly reorganized to implement
      kernel and movable online type explicitly and MMOP_ONLINE_KEEP becomes a
      catch all default behavior.
      
      Link: http://lkml.kernel.org/r/20170714121233.16861-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: <slaoub@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6f03e29
    • Michal Hocko's avatar
      mm, memory_hotplug: display allowed zones in the preferred ordering · e5e68930
      Michal Hocko authored
      Prior to commit f1dd2cd1 ("mm, memory_hotplug: do not associate
      hotadded memory to zones until online") we used to allow to change the
      valid zone types of a memory block if it is adjacent to a different zone
      type.
      
      This fact was reflected in memoryNN/valid_zones by the ordering of
      printed zones.  The first one was default (echo online > memoryNN/state)
      and the other one could be onlined explicitly by online_{movable,kernel}.
      
      This behavior was removed by the said patch and as such the ordering was
      not all that important.  In most cases a kernel zone would be default
      anyway.  The only exception is movable_node handled by "mm,
      memory_hotplug: support movable_node for hotpluggable nodes".
      
      Let's reintroduce this behavior again because later patch will remove
      the zone overlap restriction and so user will be allowed to online
      kernel resp.  movable block regardless of its placement.  Original
      behavior will then become significant again because it would be
      non-trivial for users to see what is the default zone to online into.
      
      Implementation is really simple.  Pull out zone selection out of
      move_pfn_range into zone_for_pfn_range helper and use it in
      show_valid_zones to display the zone for default onlining and then both
      kernel and movable if they are allowed.  Default online zone is not
      duplicated.
      
      Link: http://lkml.kernel.org/r/20170714121233.16861-2-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: <slaoub@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e5e68930
    • Wei Yang's avatar
      mm/memory_hotplug: just build zonelist for newly added node · c1152583
      Wei Yang authored
      Commit 9adb62a5 ("mm/hotplug: correctly setup fallback zonelists
      when creating new pgdat") tries to build the correct zonelist for a
      newly added node, while it is not necessary to rebuild it for already
      exist nodes.
      
      In build_zonelists(), it will iterate on nodes with memory.  For a newly
      added node, it will have memory until node_states_set_node() is called
      in online_pages().
      
      This patch avoids rebuilding the zonelists for already existing nodes.
      
      build_zonelists_node() uses managed_zone(zone) checks, so it should not
      include empty zones anyway.  So effectively we avoid some pointless work
      under stop_machine().
      
      [akpm@linux-foundation.org: tweak comment text]
      [akpm@linux-foundation.org: coding-style tweak, per Vlastimil]
      Link: http://lkml.kernel.org/r/20170626035822.50155-1-richard.weiyang@gmail.comSigned-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1152583
    • Chris Wilson's avatar
      drm/i915: wire up shrinkctl->nr_scanned · 912d572d
      Chris Wilson authored
      shrink_slab() allows us to report back the number of objects we
      successfully scanned (out of the target shrinkctl->nr_to_scan).  As
      report the number of pages owned by each GEM object as a separate item
      to the shrinker, we cannot precisely control the number of shrinker
      objects we scan on each pass; and indeed may free more than requested.
      If we fail to tell the shrinker about the number of objects we process,
      it will continue to hold a grudge against us as any objects left
      unscanned are added to the next reclaim -- and so we will keep on
      "unfairly" shrinking our own slab in comparison to other slabs.
      
      Link: http://lkml.kernel.org/r/20170822135325.9191-2-chris@chris-wilson.co.ukSigned-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      912d572d
    • Chris Wilson's avatar
      mm: track actual nr_scanned during shrink_slab() · d460acb5
      Chris Wilson authored
      Some shrinkers may only be able to free a bunch of objects at a time,
      and so free more than the requested nr_to_scan in one pass.
      
      Whilst other shrinkers may find themselves even unable to scan as many
      objects as they counted, and so underreport.  Account for the extra
      freed/scanned objects against the total number of objects we intend to
      scan, otherwise we may end up penalising the slab far more than
      intended.  Similarly, we want to add the underperforming scan to the
      deferred pass so that we try harder and harder in future passes.
      
      Link: http://lkml.kernel.org/r/20170822135325.9191-1-chris@chris-wilson.co.ukSigned-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d460acb5
    • Alexander Popov's avatar
      mm/slub.c: add a naive detection of double free or corruption · ce6fa91b
      Alexander Popov authored
      Add an assertion similar to "fasttop" check in GNU C Library allocator
      as a part of SLAB_FREELIST_HARDENED feature.  An object added to a
      singly linked freelist should not point to itself.  That helps to detect
      some double free errors (e.g. CVE-2017-2636) without slub_debug and
      KASAN.
      
      Link: http://lkml.kernel.org/r/1502468246-1262-1-git-send-email-alex.popov@linux.comSigned-off-by: default avatarAlexander Popov <alex.popov@linux.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Paul E McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tycho Andersen <tycho@docker.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce6fa91b
    • Kees Cook's avatar
      mm: add SLUB free list pointer obfuscation · 2482ddec
      Kees Cook authored
      This SLUB free list pointer obfuscation code is modified from Brad
      Spengler/PaX Team's code in the last public patch of grsecurity/PaX
      based on my understanding of the code.  Changes or omissions from the
      original code are mine and don't reflect the original grsecurity/PaX
      code.
      
      This adds a per-cache random value to SLUB caches that is XORed with
      their freelist pointer address and value.  This adds nearly zero
      overhead and frustrates the very common heap overflow exploitation
      method of overwriting freelist pointers.
      
      A recent example of the attack is written up here:
      
        http://cyseclabs.com/blog/cve-2016-6187-heap-off-by-one-exploit
      
      and there is a section dedicated to the technique the book "A Guide to
      Kernel Exploitation: Attacking the Core".
      
      This is based on patches by Daniel Micay, and refactored to minimize the
      use of #ifdef.
      
      With 200-count cycles of "hackbench -g 20 -l 1000" I saw the following
      run times:
      
       before:
       	mean 10.11882499999999999995
      	variance .03320378329145728642
      	stdev .18221905304181911048
      
        after:
      	mean 10.12654000000000000014
      	variance .04700556623115577889
      	stdev .21680767106160192064
      
      The difference gets lost in the noise, but if the above is to be taken
      literally, using CONFIG_FREELIST_HARDENED is 0.07% slower.
      
      Link: http://lkml.kernel.org/r/20170802180609.GA66807@beastSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      Suggested-by: default avatarDaniel Micay <danielmicay@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tycho Andersen <tycho@docker.com>
      Cc: Alexander Popov <alex.popov@linux.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2482ddec
    • Alexander Potapenko's avatar
      slub: tidy up initialization ordering · ea37df54
      Alexander Potapenko authored
       - free_kmem_cache_nodes() frees the cache node before nulling out a
         reference to it
      
       - init_kmem_cache_nodes() publishes the cache node before initializing
         it
      
      Neither of these matter at runtime because the cache nodes cannot be
      looked up by any other thread.  But it's neater and more consistent to
      reorder these.
      
      Link: http://lkml.kernel.org/r/20170707083408.40410-1-glider@google.comSigned-off-by: default avatarAlexander Potapenko <glider@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea37df54
    • Jun Piao's avatar
      ocfs2: clean up some dead code · 964f14a0
      Jun Piao authored
      clean up some unused functions and parameters.
      
      Link: http://lkml.kernel.org/r/598A5E21.2080807@huawei.comSigned-off-by: default avatarJun Piao <piaojun@huawei.com>
      Reviewed-by: default avatarAlex Chen <alex.chen@huawei.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      964f14a0
    • Jan Kara's avatar
      ocfs2: make ocfs2_set_acl() static · 01ffb56b
      Jan Kara authored
      The function is never called outside of fs/ocfs2/acl.c.
      
      Link: http://lkml.kernel.org/r/20170801141252.19675-2-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01ffb56b
    • Masahiro Yamada's avatar
      modpost: simplify sec_name() · 6124c04c
      Masahiro Yamada authored
      There is code duplication between sec_name() and sech_name().  Simplify
      sec_name() by re-using sech_name().  Also, move them up to remove the
      forward declaration of sec_name().
      
      Link: http://lkml.kernel.org/r/1502248721-22009-1-git-send-email-yamada.masahiro@socionext.comSigned-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Jessica Yu <jeyu@redhat.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6124c04c
    • Nicolas Iooss's avatar
      dax: initialize variable pfn before using it · 2f52074d
      Nicolas Iooss authored
      dax_pmd_insert_mapping() contains the following code:
      
              pfn_t pfn;
              if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
                  goto fallback;
              /* ... */
          fallback:
            trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);
      
      When the condition in the if statement fails, the function calls
      trace_dax_pmd_insert_mapping_fallback() with an uninitialized pfn value.
      
      This issue has been found while building the kernel with clang.  The
      compiler reported:
      
          fs/dax.c:1280:6: error: variable 'pfn' is used uninitialized
          whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
              if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          fs/dax.c:1310:60: note: uninitialized use occurs here
            trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);
                                                                           ^~~
      
      Link: http://lkml.kernel.org/r/20170903083000.587-1-nicolas.iooss_linux@m4x.orgSigned-off-by: default avatarNicolas Iooss <nicolas.iooss_linux@m4x.org>
      Reviewed-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2f52074d
    • Ross Zwisler's avatar
      dax: use PG_PMD_COLOUR instead of open coding · 917f3452
      Ross Zwisler authored
      Use ~PG_PMD_COLOUR in dax_entry_waitqueue() instead of open coding an
      equivalent page offset mask.
      
      Link: http://lkml.kernel.org/r/20170822222436.18926-2-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: "Slusarz, Marcin" <marcin.slusarz@intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      917f3452
    • Ross Zwisler's avatar
      dax: explain how read(2)/write(2) addresses are validated · a2e050f5
      Ross Zwisler authored
      Add a comment explaining how the user addresses provided to read(2) and
      write(2) are validated in the DAX I/O path.
      
      We call dax_copy_from_iter() or copy_to_iter() on these without calling
      access_ok() first in the DAX code, and there was a concern that the user
      might be able to read/write to arbitrary kernel addresses with this
      path.
      
      Link: http://lkml.kernel.org/r/20170816173615.10098-1-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2e050f5
    • Ross Zwisler's avatar
      dax: move all DAX radix tree defs to fs/dax.c · 527b19d0
      Ross Zwisler authored
      Now that we no longer insert struct page pointers in DAX radix trees the
      page cache code no longer needs to know anything about DAX exceptional
      entries.  Move all the DAX exceptional entry definitions from dax.h to
      fs/dax.c.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-6-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      527b19d0
    • Ross Zwisler's avatar
      dax: remove DAX code from page_cache_tree_insert() · d01ad197
      Ross Zwisler authored
      Now that we no longer insert struct page pointers in DAX radix trees we
      can remove the special casing for DAX in page_cache_tree_insert().
      
      This also allows us to make dax_wake_mapping_entry_waiter() local to
      fs/dax.c, removing it from dax.h.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-5-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d01ad197
    • Ross Zwisler's avatar
      dax: use common 4k zero page for dax mmap reads · 91d25ba8
      Ross Zwisler authored
      When servicing mmap() reads from file holes the current DAX code
      allocates a page cache page of all zeroes and places the struct page
      pointer in the mapping->page_tree radix tree.
      
      This has three major drawbacks:
      
      1) It consumes memory unnecessarily. For every 4k page that is read via
         a DAX mmap() over a hole, we allocate a new page cache page. This
         means that if you read 1GiB worth of pages, you end up using 1GiB of
         zeroed memory. This is easily visible by looking at the overall
         memory consumption of the system or by looking at /proc/[pid]/smaps:
      
      	7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:             1048576 kB
      	Pss:             1048576 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:   1048576 kB
      	Private_Dirty:         0 kB
      	Referenced:      1048576 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      2) It is slower than using a common zero page because each page fault
         has more work to do. Instead of just inserting a common zero page we
         have to allocate a page cache page, zero it, and then insert it. Here
         are the average latencies of dax_load_hole() as measured by ftrace on
         a random test box:
      
          Old method, using zeroed page cache pages:	3.4 us
          New method, using the common 4k zero page:	0.8 us
      
         This was the average latency over 1 GiB of sequential reads done by
         this simple fio script:
      
           [global]
           size=1G
           filename=/root/dax/data
           fallocate=none
           [io]
           rw=read
           ioengine=mmap
      
      3) The fact that we had to check for both DAX exceptional entries and
         for page cache pages in the radix tree made the DAX code more
         complex.
      
      Solve these issues by following the lead of the DAX PMD code and using a
      common 4k zero page instead.  As with the PMD code we will now insert a
      DAX exceptional entry into the radix tree instead of a struct page
      pointer which allows us to remove all the special casing in the DAX
      code.
      
      Note that we do still pretty aggressively check for regular pages in the
      DAX radix tree, especially where we take action based on the bits set in
      the page.  If we ever find a regular page in our radix tree now that
      most likely means that someone besides DAX is inserting pages (which has
      happened lots of times in the past), and we want to find that out early
      and fail loudly.
      
      This solution also removes the extra memory consumption.  Here is that
      same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
      code:
      
      	7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:                   0 kB
      	Pss:                   0 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:         0 kB
      	Private_Dirty:         0 kB
      	Referenced:            0 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      Overall system memory consumption is similarly improved.
      
      Another major change is that we remove dax_pfn_mkwrite() from our fault
      flow, and instead rely on the page fault itself to make the PTE dirty
      and writeable.  The following description from the patch adding the
      vm_insert_mixed_mkwrite() call explains this a little more:
      
         "To be able to use the common 4k zero page in DAX we need to have our
          PTE fault path look more like our PMD fault path where a PTE entry
          can be marked as dirty and writeable as it is first inserted rather
          than waiting for a follow-up dax_pfn_mkwrite() =>
          finish_mkwrite_fault() call.
      
          Right now we can rely on having a dax_pfn_mkwrite() call because we
          can distinguish between these two cases in do_wp_page():
      
                  case 1: 4k zero page => writable DAX storage
                  case 2: read-only DAX storage => writeable DAX storage
      
          This distinction is made by via vm_normal_page(). vm_normal_page()
          returns false for the common 4k zero page, though, just as it does
          for DAX ptes. Instead of special casing the DAX + 4k zero page case
          we will simplify our DAX PTE page fault sequence so that it matches
          our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
          We will instead use dax_iomap_fault() to handle write-protection
          faults.
      
          This means that insert_pfn() needs to follow the lead of
          insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
          'mkwrite' is set insert_pfn() will do the work that was previously
          done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91d25ba8
    • Ross Zwisler's avatar
      dax: relocate some dax functions · e30331ff
      Ross Zwisler authored
      dax_load_hole() will soon need to call dax_insert_mapping_entry(), so it
      needs to be moved lower in dax.c so the definition exists.
      
      dax_wake_mapping_entry_waiter() will soon be removed from dax.h and be
      made static to dax.c, so we need to move its definition above all its
      callers.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-3-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e30331ff
    • Ross Zwisler's avatar
      mm: add vm_insert_mixed_mkwrite() · b2770da6
      Ross Zwisler authored
      When servicing mmap() reads from file holes the current DAX code
      allocates a page cache page of all zeroes and places the struct page
      pointer in the mapping->page_tree radix tree.  This has three major
      drawbacks:
      
      1) It consumes memory unnecessarily. For every 4k page that is read via
         a DAX mmap() over a hole, we allocate a new page cache page. This
         means that if you read 1GiB worth of pages, you end up using 1GiB of
         zeroed memory.
      
      2) It is slower than using a common zero page because each page fault
         has more work to do. Instead of just inserting a common zero page we
         have to allocate a page cache page, zero it, and then insert it.
      
      3) The fact that we had to check for both DAX exceptional entries and
         for page cache pages in the radix tree made the DAX code more
         complex.
      
      This series solves these issues by following the lead of the DAX PMD
      code and using a common 4k zero page instead.  This reduces memory usage
      and decreases latencies for some workloads, and it simplifies the DAX
      code, removing over 100 lines in total.
      
      This patch (of 5):
      
      To be able to use the common 4k zero page in DAX we need to have our PTE
      fault path look more like our PMD fault path where a PTE entry can be
      marked as dirty and writeable as it is first inserted rather than
      waiting for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault()
      call.
      
      Right now we can rely on having a dax_pfn_mkwrite() call because we can
      distinguish between these two cases in do_wp_page():
      
      	case 1: 4k zero page => writable DAX storage
      	case 2: read-only DAX storage => writeable DAX storage
      
      This distinction is made by via vm_normal_page().  vm_normal_page()
      returns false for the common 4k zero page, though, just as it does for
      DAX ptes.  Instead of special casing the DAX + 4k zero page case we will
      simplify our DAX PTE page fault sequence so that it matches our DAX PMD
      sequence, and get rid of the dax_pfn_mkwrite() helper.  We will instead
      use dax_iomap_fault() to handle write-protection faults.
      
      This means that insert_pfn() needs to follow the lead of
      insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag.  If 'mkwrite'
      is set insert_pfn() will do the work that was previously done by
      wp_page_reuse() as part of the dax_pfn_mkwrite() call path.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-2-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2770da6