• Honggang Li's avatar
    Revert "percpu: free percpu allocation info for uniprocessor system" · 38dc4ffb
    Honggang Li authored
    This reverts commit 3189eddb ("percpu: free percpu allocation info for
    uniprocessor system").
    
    The commit causes a hang with a crisv32 image. This may be an architecture
    problem, but at least for now the revert is necessary to be able to boot a
    crisv32 image.
    
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Honggang Li <enjoymindful@gmail.com>
    Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Fixes: 3189eddb ("percpu: free percpu allocation info for uniprocessor system")
    Cc: stable@vger.kernel.org # Please don't apply 3189eddb
    
    percpu-refcount: make percpu_ref based on longs instead of ints
    
    percpu_ref is currently based on ints and the number of refs it can
    cover is (1 << 31).  This makes it impossible to use a percpu_ref to
    count memory objects or pages on 64bit machines as it may overflow.
    This forces those users to somehow aggregate the references before
    contributing to the percpu_ref which is often cumbersome and sometimes
    challenging to get the same level of performance as using the
    percpu_ref directly.
    
    While using ints for the percpu counters makes them pack tighter on
    64bit machines, the possible gain from using ints instead of longs is
    extremely small compared to the overall gain from per-cpu operation.
    This patch makes percpu_ref based on longs so that it can be used to
    directly count memory objects or pages.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Kent Overstreet <kmo@daterainc.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    
    percpu-refcount: improve WARN messages
    
    percpu_ref's WARN messages can be a lot more helpful by indicating
    who's the culprit.  Make them report the release function that the
    offending percpu-refcount is associated with.  This should make it a
    lot easier to track down the reported invalid refcnting operations.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Kent Overstreet <kmo@daterainc.com>
    
    percpu: fix locking regression in the failure path of pcpu_alloc()
    
    While updating locking, b38d08f3 ("percpu: restructure locking")
    broke pcpu_create_chunk() creation path in pcpu_alloc().  It returns
    without releasing pcpu_alloc_mutex.  Fix it.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Reported-by: default avatarJulia Lawall <julia.lawall@lip6.fr>
    
    percpu-refcount: add @gfp to percpu_ref_init()
    
    Percpu allocator now supports allocation mask.  Add @gfp to
    percpu_ref_init() so that !GFP_KERNEL allocation masks can be used
    with percpu_refs too.
    
    This patch doesn't make any functional difference.
    
    v2: blk-mq conversion was missing.  Updated.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Kent Overstreet <koverstreet@google.com>
    Cc: Benjamin LaHaise <bcrl@kvack.org>
    Cc: Li Zefan <lizefan@huawei.com>
    Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
    Cc: Jens Axboe <axboe@kernel.dk>
    
    proportions: add @gfp to init functions
    
    Percpu allocator now supports allocation mask.  Add @gfp to
    [flex_]proportions init functions so that !GFP_KERNEL allocation masks
    can be used with them too.
    
    This patch doesn't make any functional difference.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Reviewed-by: default avatarJan Kara <jack@suse.cz>
    Cc: Peter Zijlstra <peterz@infradead.org>
    
    percpu_counter: add @gfp to percpu_counter_init()
    
    Percpu allocator now supports allocation mask.  Add @gfp to
    percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
    with percpu_counters too.
    
    We could have left percpu_counter_init() alone and added
    percpu_counter_init_gfp(); however, the number of users isn't that
    high and introducing _gfp variants to all percpu data structures would
    be quite ugly, so let's just do the conversion.  This is the one with
    the most users.  Other percpu data structures are a lot easier to
    convert.
    
    This patch doesn't make any functional difference.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Acked-by: default avatarJan Kara <jack@suse.cz>
    Acked-by: default avatar"David S. Miller" <davem@davemloft.net>
    Cc: x86@kernel.org
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: "Theodore Ts'o" <tytso@mit.edu>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    
    percpu_counter: make percpu_counters_lock irq-safe
    
    percpu_counter is scheduled to grow @gfp support to allow atomic
    initialization.  This patch makes percpu_counters_lock irq-safe so
    that it can be safely used from atomic contexts.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    
    percpu: implement asynchronous chunk population
    
    The percpu allocator now supports atomic allocations by only
    allocating from already populated areas but the mechanism to ensure
    that there's adequate amount of populated areas was missing.
    
    This patch expands pcpu_balance_work so that in addition to freeing
    excess free chunks it also populates chunks to maintain an adequate
    level of populated areas.  pcpu_alloc() schedules pcpu_balance_work if
    the amount of free populated areas is too low or after an atomic
    allocation failure.
    
    * PERPCU_DYNAMIC_RESERVE is increased by two pages to account for
      PCPU_EMPTY_POP_PAGES_LOW.
    
    * pcpu_async_enabled is added to gate both async jobs -
      chunk->map_extend_work and pcpu_balance_work - so that we don't end
      up scheduling them while the needed subsystems aren't up yet.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    
    percpu: rename pcpu_reclaim_work to pcpu_balance_work
    
    pcpu_reclaim_work will also be used to populate chunks asynchronously.
    Rename it to pcpu_balance_work in preparation.  pcpu_reclaim() is
    renamed to pcpu_balance_workfn() and some of its local variables are
    renamed too.
    
    This is pure rename.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    
    percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populated
    
    pcpu_nr_empty_pop_pages counts the number of empty populated pages
    across all chunks and chunk->nr_populated counts the number of
    populated pages in a chunk.  Both will be used to implement pre/async
    population for atomic allocations.
    
    pcpu_chunk_[de]populated() are added to update chunk->populated,
    chunk->nr_populated and pcpu_nr_empty_pop_pages together.  All
    successful chunk [de]populations should be followed by the
    corresponding pcpu_chunk_[de]populated() calls.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    
    percpu: make sure chunk->map array has available space
    
    An allocation attempt may require extending chunk->map array which
    requires GFP_KERNEL context which isn't available for atomic
    allocations.  This patch ensures that chunk->map array usually keeps
    some amount of available space by directly allocating buffer space
    during GFP_KERNEL allocations and scheduling async extension during
    atomic ones.  This should make atomic allocation failures from map
    space exhaustion rare.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    
    percpu: implement [__]alloc_percpu_gfp()
    
    Now that pcpu_alloc_area() can allocate only from populated areas,
    it's easy to add atomic allocation support to [__]alloc_percpu().
    Update pcpu_alloc() so that it accepts @gfp and skips all the blocking
    operations and allocates only from the populated areas if @gfp doesn't
    contain GFP_KERNEL.  New interface functions [__]alloc_percpu_gfp()
    are added.
    
    While this means that atomic allocations are possible, this isn't
    complete yet as there's no mechanism to ensure that certain amount of
    populated areas is kept available and atomic allocations may keep
    failing under certain conditions.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    
    percpu: indent the population block in pcpu_alloc()
    
    The next patch will conditionalize the population block in
    pcpu_alloc() which will end up making a rather large indentation
    change obfuscating the actual logic change.  This patch puts the block
    under "if (true)" so that the next patch can avoid indentation
    changes.  The defintions of the local variables which are used only in
    the block are moved into the block.
    
    This patch is purely cosmetic.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    
    percpu: make pcpu_alloc_area() capable of allocating only from populated areas
    
    Update pcpu_alloc_area() so that it can skip unpopulated areas if the
    new parameter @pop_only is true.  This is implemented by a new
    function, pcpu_fit_in_area(), which determines the amount of head
    padding considering the alignment and populated state.
    
    @pop_only is currently always false but this will be used to implement
    atomic allocation.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    
    percpu: restructure locking
    
    At first, the percpu allocator required a sleepable context for both
    alloc and free paths and used pcpu_alloc_mutex to protect everything.
    Later, pcpu_lock was introduced to protect the index data structure so
    that the free path can be invoked from atomic contexts.  The
    conversion only updated what's necessary and left most of the
    allocation path under pcpu_alloc_mutex.
    
    The percpu allocator is planned to add support for atomic allocation
    and this patch restructures locking so that the coverage of
    pcpu_alloc_mutex is further reduced.
    
    * pcpu_alloc() now grab pcpu_alloc_mutex only while creating a new
      chunk and populating the allocated area.  Everything else is now
      protected soley by pcpu_lock.
    
      After this change, multiple instances of pcpu_extend_area_map() may
      race but the function already implements sufficient synchronization
      using pcpu_lock.
    
      This also allows multiple allocators to arrive at new chunk
      creation.  To avoid creating multiple empty chunks back-to-back, a
      new chunk is created iff there is no other empty chunk after
      grabbing pcpu_alloc_mutex.
    
    * pcpu_lock is now held while modifying chunk->populated bitmap.
      After this, all data structures are protected by pcpu_lock.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    
    percpu: make percpu-km set chunk->populated bitmap properly
    
    percpu-km instantiates the whole chunk on creation and doesn't make
    use of chunk->populated bitmap and leaves it as zero.  While this
    currently doesn't cause any problem, the inconsistency makes it
    difficult to build further logic on top of chunk->populated.  This
    patch makes percpu-km fill chunk->populated on creation so that the
    bitmap is always consistent.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Acked-by: default avatarChristoph Lameter <cl@linux.com>
    
    percpu: move region iterations out of pcpu_[de]populate_chunk()
    
    Previously, pcpu_[de]populate_chunk() were called with the range which
    may contain multiple target regions in it and
    pcpu_[de]populate_chunk() iterated over the regions.  This has the
    benefit of batching up cache flushes for all the regions; however,
    we're planning to add more bookkeeping logic around [de]population to
    support atomic allocations and this delegation of iterations gets in
    the way.
    
    This patch moves the region iterations out of
    pcpu_[de]populate_chunk() into its callers - pcpu_alloc() and
    pcpu_reclaim() - so that we can later add logic to track more states
    around them.  This change may make cache and tlb flushes more frequent
    but multi-region [de]populations are rare anyway and if this actually
    becomes a problem, it's not difficult to factor out cache flushes as
    separate callbacks which are directly invoked from percpu.c.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    
    percpu: move common parts out of pcpu_[de]populate_chunk()
    
    percpu-vm and percpu-km implement separate versions of
    pcpu_[de]populate_chunk() and some part which is or should be common
    are currently in the specific implementations.  Make the following
    changes.
    
    * Allocate area clearing is moved from the pcpu_populate_chunk()
      implementations to pcpu_alloc().  This makes percpu-km's version
      noop.
    
    * Quick exit tests in pcpu_[de]populate_chunk() of percpu-vm are moved
      to their respective callers so that they are applied to percpu-km
      too.  This doesn't make any meaningful difference as both functions
      are noop for percpu-km; however, this is more consistent and will
      help implementing atomic allocation support.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    
    percpu: remove @may_alloc from pcpu_get_pages()
    
    pcpu_get_pages() creates the temp pages array if not already allocated
    and returns the pointer to it.  As the function is called from both
    [de]population paths and depopulation can only happen after at least
    one successful population, the param doesn't make any difference - the
    allocation will always happen on the population path anyway.
    
    Remove @may_alloc from pcpu_get_pages().  Also, add an lockdep
    assertion pcpu_alloc_mutex instead of vaguely stating that the
    exclusion is the caller's responsibility.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    
    percpu: remove the usage of separate populated bitmap in percpu-vm
    
    percpu-vm uses pcpu_get_pages_and_bitmap() to acquire temp pages array
    and populated bitmap and uses the two during [de]population.  The temp
    bitmap is used only to build the new bitmap that is copied to
    chunk->populated after the operation succeeds; however, the new bitmap
    can be trivially set after success without using the temp bitmap.
    
    This patch removes the temp populated bitmap usage from percpu-vm.c.
    
    * pcpu_get_pages_and_bitmap() is renamed to pcpu_get_pages() and no
      longer hands out the temp bitmap.
    
    * @populated arugment is dropped from all the related functions.
      @populated updates in pcpu_[un]map_pages() are dropped.
    
    * Two loops in pcpu_map_pages() are merged.
    
    * pcpu_[de]populated_chunk() modify chunk->populated bitmap directly
      from @page_start and @page_end after success.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Acked-by: default avatarChristoph Lameter <cl@linux.com>
    
    percpu: free percpu allocation info for uniprocessor system
    
    Currently, only SMP system free the percpu allocation info.
    Uniprocessor system should free it too. For example, one x86 UML
    virtual machine with 256MB memory, UML kernel wastes one page memory.
    Signed-off-by: default avatarHonggang Li <enjoymindful@gmail.com>
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: stable@vger.kernel.org
    
    (cherry picked from commit bb2e226b
    3189eddb)
    Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
    38dc4ffb
percpu.c 57.1 KB