Commit ffcb5f52 authored by Nhat Pham's avatar Nhat Pham Committed by Andrew Morton

workingset: refactor LRU refault to expose refault recency check

Patch series "cachestat: a new syscall for page cache state of files",
v13.

There is currently no good way to query the page cache statistics of large
files and directory trees.  There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really does not care about per-page information in that
case.  The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.

Some use cases where this information could come in handy:
  * Allowing database to decide whether to perform an index scan or direct
    table queries based on the in-memory cache state of the index.
  * Visibility into the writeback algorithm, for performance issues
    diagnostic.
  * Workload-aware writeback pacing: estimating IO fulfilled by page cache
    (and IO to be done) within a range of a file, allowing for more
    frequent syncing when and where there is IO capacity, and batching
    when there is not.
  * Computing memory usage of large files/directory trees, analogous to
    the du tool for disk usage.

More information about these use cases could be found in this thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/

This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty pages,
pages marked for writeback, evicted pages etc.) of a file, in a specified
range of bytes.  It also include a selftest suite that tests some typical
usage.  Currently, the syscall is only wired in for x86 architecture.

This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore.  Relevant
links:

https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html


I have also developed a small tool that computes the memory usage of files
and directories, analogous to the du utility.  User can choose between
mincore or cachestat (with cachestat exporting more information than
mincore).  To compare the performance of these two options, I benchmarked
the tool on the root directory of a Meta's server machine, each for five
runs:

Using cachestat
real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689

Using mincore:
real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046

I also ran both syscalls on a 2TB sparse file:

Using cachestat:
real    0m0.009s
user    0m0.000s
sys     0m0.009s

Using mincore:
real    0m37.510s
user    0m2.934s
sys     0m34.558s

Very large files like this are the pathological case for mincore.  In
fact, to compute the stats for a single 2TB file, mincore takes as long as
cachestat takes to compute the stats for the entire tree!  This could
easily happen inadvertently when we run it on subdirectories.  Mincore is
clearly not suitable for a general-purpose command line tool.

Regarding security concerns, cachestat() should not pose any additional
issues.  The caller already has read permission to the file itself (since
they need an fd to that file to call cachestat).  This means that the
caller can access the underlying data in its entirety, which is a much
greater source of information (and as a result, a much greater security
risk) than the cache status itself.

The latest API change (in v13 of the patch series) is suggested by Jens
Axboe.  It allows for 64-bit length argument, even on 32-bit architecture
(which is previously not possible due to the limit on the number of
syscall arguments).  Furthermore, it eliminates the need for compatibility
handling - every user can use the same ABI.


This patch (of 4):

In preparation for computing recently evicted pages in cachestat, refactor
workingset_refault and lru_gen_refault to expose a helper function that
would test if an evicted page is recently evicted.

[penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
  Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
parent 18b1d18b
......@@ -368,6 +368,7 @@ static inline void folio_set_swap_entry(struct folio *folio, swp_entry_t entry)
}
/* linux/mm/workingset.c */
bool workingset_test_recent(void *shadow, bool file, bool *workingset);
void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg);
void workingset_refault(struct folio *folio, void *shadow);
......
......@@ -255,6 +255,29 @@ static void *lru_gen_eviction(struct folio *folio)
return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs);
}
/*
* Tests if the shadow entry is for a folio that was recently evicted.
* Fills in @memcgid, @pglist_data, @token, @workingset with the values
* unpacked from shadow.
*/
static bool lru_gen_test_recent(void *shadow, bool file, int *memcgid,
struct pglist_data **pgdat, unsigned long *token, bool *workingset)
{
struct mem_cgroup *eviction_memcg;
struct lruvec *lruvec;
struct lru_gen_folio *lrugen;
unsigned long min_seq;
unpack_shadow(shadow, memcgid, pgdat, token, workingset);
eviction_memcg = mem_cgroup_from_id(*memcgid);
lruvec = mem_cgroup_lruvec(eviction_memcg, *pgdat);
lrugen = &lruvec->lrugen;
min_seq = READ_ONCE(lrugen->min_seq[file]);
return (*token >> LRU_REFS_WIDTH) == (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH));
}
static void lru_gen_refault(struct folio *folio, void *shadow)
{
int hist, tier, refs;
......@@ -269,23 +292,22 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
int type = folio_is_file_lru(folio);
int delta = folio_nr_pages(folio);
unpack_shadow(shadow, &memcg_id, &pgdat, &token, &workingset);
if (pgdat != folio_pgdat(folio))
return;
rcu_read_lock();
if (!lru_gen_test_recent(shadow, type, &memcg_id, &pgdat, &token,
&workingset))
goto unlock;
memcg = folio_memcg_rcu(folio);
if (memcg_id != mem_cgroup_id(memcg))
goto unlock;
if (pgdat != folio_pgdat(folio))
goto unlock;
lruvec = mem_cgroup_lruvec(memcg, pgdat);
lrugen = &lruvec->lrugen;
min_seq = READ_ONCE(lrugen->min_seq[type]);
if ((token >> LRU_REFS_WIDTH) != (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH)))
goto unlock;
hist = lru_hist_from_seq(min_seq);
/* see the comment in folio_lru_refs() */
......@@ -317,6 +339,12 @@ static void *lru_gen_eviction(struct folio *folio)
return NULL;
}
static bool lru_gen_test_recent(void *shadow, bool file, int *memcgid,
struct pglist_data **pgdat, unsigned long *token, bool *workingset)
{
return false;
}
static void lru_gen_refault(struct folio *folio, void *shadow)
{
}
......@@ -385,42 +413,34 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
}
/**
* workingset_refault - Evaluate the refault of a previously evicted folio.
* @folio: The freshly allocated replacement folio.
* @shadow: Shadow entry of the evicted folio.
*
* Calculates and evaluates the refault distance of the previously
* evicted folio in the context of the node and the memcg whose memory
* pressure caused the eviction.
* workingset_test_recent - tests if the shadow entry is for a folio that was
* recently evicted. Also fills in @workingset with the value unpacked from
* shadow.
* @shadow: the shadow entry to be tested.
* @file: whether the corresponding folio is from the file lru.
* @workingset: where the workingset value unpacked from shadow should
* be stored.
*
* Return: true if the shadow is for a recently evicted folio; false otherwise.
*/
void workingset_refault(struct folio *folio, void *shadow)
bool workingset_test_recent(void *shadow, bool file, bool *workingset)
{
bool file = folio_is_file_lru(folio);
struct mem_cgroup *eviction_memcg;
struct lruvec *eviction_lruvec;
unsigned long refault_distance;
unsigned long workingset_size;
struct pglist_data *pgdat;
struct mem_cgroup *memcg;
unsigned long eviction;
struct lruvec *lruvec;
unsigned long refault;
bool workingset;
int memcgid;
long nr;
struct pglist_data *pgdat;
unsigned long eviction;
if (lru_gen_enabled()) {
lru_gen_refault(folio, shadow);
return;
}
if (lru_gen_enabled())
return lru_gen_test_recent(shadow, file, &memcgid, &pgdat, &eviction,
workingset);
unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
eviction <<= bucket_order;
/* Flush stats (and potentially sleep) before holding RCU read lock */
mem_cgroup_flush_stats_ratelimited();
rcu_read_lock();
/*
* Look up the memcg associated with the stored ID. It might
* have been deleted since the folio's eviction.
......@@ -439,7 +459,8 @@ void workingset_refault(struct folio *folio, void *shadow)
*/
eviction_memcg = mem_cgroup_from_id(memcgid);
if (!mem_cgroup_disabled() && !eviction_memcg)
goto out;
return false;
eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
refault = atomic_long_read(&eviction_lruvec->nonresident_age);
......@@ -461,20 +482,6 @@ void workingset_refault(struct folio *folio, void *shadow)
*/
refault_distance = (refault - eviction) & EVICTION_MASK;
/*
* The activation decision for this folio is made at the level
* where the eviction occurred, as that is where the LRU order
* during folio reclaim is being determined.
*
* However, the cgroup that will own the folio is the one that
* is actually experiencing the refault event.
*/
nr = folio_nr_pages(folio);
memcg = folio_memcg(folio);
pgdat = folio_pgdat(folio);
lruvec = mem_cgroup_lruvec(memcg, pgdat);
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
/*
* Compare the distance to the existing workingset size. We
* don't activate pages that couldn't stay resident even if
......@@ -495,7 +502,54 @@ void workingset_refault(struct folio *folio, void *shadow)
NR_INACTIVE_ANON);
}
}
if (refault_distance > workingset_size)
return refault_distance <= workingset_size;
}
/**
* workingset_refault - Evaluate the refault of a previously evicted folio.
* @folio: The freshly allocated replacement folio.
* @shadow: Shadow entry of the evicted folio.
*
* Calculates and evaluates the refault distance of the previously
* evicted folio in the context of the node and the memcg whose memory
* pressure caused the eviction.
*/
void workingset_refault(struct folio *folio, void *shadow)
{
bool file = folio_is_file_lru(folio);
struct pglist_data *pgdat;
struct mem_cgroup *memcg;
struct lruvec *lruvec;
bool workingset;
long nr;
if (lru_gen_enabled()) {
lru_gen_refault(folio, shadow);
return;
}
/* Flush stats (and potentially sleep) before holding RCU read lock */
mem_cgroup_flush_stats_ratelimited();
rcu_read_lock();
/*
* The activation decision for this folio is made at the level
* where the eviction occurred, as that is where the LRU order
* during folio reclaim is being determined.
*
* However, the cgroup that will own the folio is the one that
* is actually experiencing the refault event.
*/
nr = folio_nr_pages(folio);
memcg = folio_memcg(folio);
pgdat = folio_pgdat(folio);
lruvec = mem_cgroup_lruvec(memcg, pgdat);
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
if (!workingset_test_recent(shadow, file, &workingset))
goto out;
folio_set_active(folio);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment