Skip to content

Commit

Permalink
More adaptive ARC eviction.
Browse files Browse the repository at this point in the history
Traditionally ARC adaptation was limited to MRU/MFU distribution.  But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO.  As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.

This change removes most of existing eviction logic, rewriting it from
scratch.  This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.

The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata.  To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions.  Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums.  This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.

This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.

Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do.  Instead just call
arc_prune_async() if too much metadata appear not evictable.

Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
  • Loading branch information
amotin committed Jan 8, 2023
1 parent a0105f6 commit 8e24992
Show file tree
Hide file tree
Showing 10 changed files with 460 additions and 776 deletions.
101 changes: 71 additions & 30 deletions cmd/arc_summary
Original file line number Diff line number Diff line change
Expand Up @@ -270,16 +270,14 @@ def draw_graph(kstats_dict):
arc_perc = f_perc(arc_stats['size'], arc_stats['c_max'])
mfu_size = f_bytes(arc_stats['mfu_size'])
mru_size = f_bytes(arc_stats['mru_size'])
meta_limit = f_bytes(arc_stats['arc_meta_limit'])
meta_size = f_bytes(arc_stats['arc_meta_used'])
dnode_limit = f_bytes(arc_stats['arc_dnode_limit'])
dnode_size = f_bytes(arc_stats['dnode_size'])

info_form = ('ARC: {0} ({1}) MFU: {2} MRU: {3} META: {4} ({5}) '
'DNODE {6} ({7})')
info_form = ('ARC: {0} ({1}) MFU: {2} MRU: {3} META: {4} '
'DNODE {5} ({6})')
info_line = info_form.format(arc_size, arc_perc, mfu_size, mru_size,
meta_size, meta_limit, dnode_size,
dnode_limit)
meta_size, dnode_size, dnode_limit)
info_spc = ' '*int((GRAPH_WIDTH-len(info_line))/2)
info_line = GRAPH_INDENT+info_spc+info_line

Expand Down Expand Up @@ -558,16 +556,28 @@ def section_arc(kstats_dict):
arc_target_size = arc_stats['c']
arc_max = arc_stats['c_max']
arc_min = arc_stats['c_min']
anon_size = arc_stats['anon_size']
mfu_size = arc_stats['mfu_size']
mru_size = arc_stats['mru_size']
mfug_size = arc_stats['mfu_ghost_size']
mrug_size = arc_stats['mru_ghost_size']
unc_size = arc_stats['uncached_size']
meta_limit = arc_stats['arc_meta_limit']
meta_size = arc_stats['arc_meta_used']
meta = arc_stats['meta']
pd = arc_stats['pd']
pm = arc_stats['pm']
anon_data = arc_stats['anon_data']
anon_metadata = arc_stats['anon_metadata']
mfu_data = arc_stats['mfu_data']
mfu_metadata = arc_stats['mfu_metadata']
mru_data = arc_stats['mru_data']
mru_metadata = arc_stats['mru_metadata']
mfug_data = arc_stats['mfu_ghost_data']
mfug_metadata = arc_stats['mfu_ghost_metadata']
mrug_data = arc_stats['mru_ghost_data']
mrug_metadata = arc_stats['mru_ghost_metadata']
unc_data = arc_stats['uncached_data']
unc_metadata = arc_stats['uncached_metadata']
bonus_size = arc_stats['bonus_size']
dnode_limit = arc_stats['arc_dnode_limit']
dnode_size = arc_stats['dnode_size']
dbuf_size = arc_stats['dbuf_size']
hdr_size = arc_stats['hdr_size']
l2_hdr_size = arc_stats['l2_hdr_size']
abd_chunk_waste_size = arc_stats['abd_chunk_waste_size']
target_size_ratio = '{0}:1'.format(int(arc_max) // int(arc_min))

prt_2('ARC size (current):',
Expand All @@ -578,25 +588,56 @@ def section_arc(kstats_dict):
f_perc(arc_min, arc_max), f_bytes(arc_min))
prt_i2('Max size (high water):',
target_size_ratio, f_bytes(arc_max))
caches_size = int(anon_size)+int(mfu_size)+int(mru_size)+int(unc_size)
prt_i2('Anonymouns data size:',
f_perc(anon_size, caches_size), f_bytes(anon_size))
prt_i2('Most Frequently Used (MFU) cache size:',
f_perc(mfu_size, caches_size), f_bytes(mfu_size))
prt_i2('Most Recently Used (MRU) cache size:',
f_perc(mru_size, caches_size), f_bytes(mru_size))
prt_i1('Most Frequently Used (MFU) ghost size:', f_bytes(mfug_size))
prt_i1('Most Recently Used (MRU) ghost size:', f_bytes(mrug_size))
caches_size = int(anon_data)+int(anon_metadata)+\
int(mfu_data)+int(mfu_metadata)+int(mru_data)+int(mru_metadata)+\
int(unc_data)+int(unc_metadata)
prt_i2('Anonymous data size:',
f_perc(anon_data, caches_size), f_bytes(anon_data))
prt_i2('Anonymous metadata size:',
f_perc(anon_metadata, caches_size), f_bytes(anon_metadata))
s = 4294967296
v = (s-int(pd))*(s-int(meta))/s
prt_i2('MFU data target:', f_perc(v, s),
f_bytes(v / 65536 * caches_size / 65536))
prt_i2('MFU data size:',
f_perc(mfu_data, caches_size), f_bytes(mfu_data))
prt_i1('MFU ghost data size:', f_bytes(mfug_data))
v = (s-int(pm))*int(meta)/s
prt_i2('MFU metadata target:', f_perc(v, s),
f_bytes(v / 65536 * caches_size / 65536))
prt_i2('MFU metadata size:',
f_perc(mfu_metadata, caches_size), f_bytes(mfu_metadata))
prt_i1('MFU ghost metadata size:', f_bytes(mfug_metadata))
v = int(pd)*(s-int(meta))/s
prt_i2('MRU data target:', f_perc(v, s),
f_bytes(v / 65536 * caches_size / 65536))
prt_i2('MRU data size:',
f_perc(mru_data, caches_size), f_bytes(mru_data))
prt_i1('MRU ghost data size:', f_bytes(mrug_data))
v = int(pm)*int(meta)/s
prt_i2('MRU metadata target:', f_perc(v, s),
f_bytes(v / 65536 * caches_size / 65536))
prt_i2('MRU metadata size:',
f_perc(mru_metadata, caches_size), f_bytes(mru_metadata))
prt_i1('MRU ghost metadata size:', f_bytes(mrug_metadata))
prt_i2('Uncached data size:',
f_perc(unc_size, caches_size), f_bytes(unc_size))
prt_i2('Metadata cache size (hard limit):',
f_perc(meta_limit, arc_max), f_bytes(meta_limit))
prt_i2('Metadata cache size (current):',
f_perc(meta_size, meta_limit), f_bytes(meta_size))
prt_i2('Dnode cache size (hard limit):',
f_perc(dnode_limit, meta_limit), f_bytes(dnode_limit))
prt_i2('Dnode cache size (current):',
f_perc(unc_data, caches_size), f_bytes(unc_data))
prt_i2('Uncached metadata size:',
f_perc(unc_metadata, caches_size), f_bytes(unc_metadata))
prt_i2('Bonus size:',
f_perc(bonus_size, arc_size), f_bytes(bonus_size))
prt_i2('Dnode cache target:',
f_perc(dnode_limit, arc_max), f_bytes(dnode_limit))
prt_i2('Dnode cache size:',
f_perc(dnode_size, dnode_limit), f_bytes(dnode_size))
prt_i2('Dbuf size:',
f_perc(dbuf_size, arc_size), f_bytes(dbuf_size))
prt_i2('Header size:',
f_perc(hdr_size, arc_size), f_bytes(hdr_size))
prt_i2('L2 header size:',
f_perc(l2_hdr_size, arc_size), f_bytes(l2_hdr_size))
prt_i2('ABD chunk waste size:',
f_perc(abd_chunk_waste_size, arc_size), f_bytes(abd_chunk_waste_size))
print()

print('ARC hash breakdown:')
Expand Down
5 changes: 2 additions & 3 deletions cmd/zdb/zdb.c
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,6 @@ zdb_ot_name(dmu_object_type_t type)

extern int reference_tracking_enable;
extern int zfs_recover;
extern unsigned long zfs_arc_meta_min, zfs_arc_meta_limit;
extern uint_t zfs_vdev_async_read_max_active;
extern boolean_t spa_load_verify_dryrun;
extern boolean_t spa_mode_readable_spacemaps;
Expand Down Expand Up @@ -8634,8 +8633,8 @@ main(int argc, char **argv)
* ZDB does not typically re-read blocks; therefore limit the ARC
* to 256 MB, which can be used entirely for metadata.
*/
zfs_arc_min = zfs_arc_meta_min = 2ULL << SPA_MAXBLOCKSHIFT;
zfs_arc_max = zfs_arc_meta_limit = 256 * 1024 * 1024;
zfs_arc_min = 2ULL << SPA_MAXBLOCKSHIFT;
zfs_arc_max = 256 * 1024 * 1024;
#endif

/*
Expand Down
1 change: 0 additions & 1 deletion include/sys/arc.h
Original file line number Diff line number Diff line change
Expand Up @@ -201,7 +201,6 @@ struct arc_buf {
};

typedef enum arc_buf_contents {
ARC_BUFC_INVALID, /* invalid type */
ARC_BUFC_DATA, /* buffer contains data */
ARC_BUFC_METADATA, /* buffer contains metadata */
ARC_BUFC_NUMTYPES
Expand Down
36 changes: 26 additions & 10 deletions include/sys/arc_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -81,15 +81,18 @@ typedef struct arc_state {
* supports the "dbufs" kstat
*/
arc_state_type_t arcs_state;
/*
* total amount of data in this state.
*/
zfs_refcount_t arcs_size[ARC_BUFC_NUMTYPES] ____cacheline_aligned;
/*
* total amount of evictable data in this state
*/
zfs_refcount_t arcs_esize[ARC_BUFC_NUMTYPES] ____cacheline_aligned;
zfs_refcount_t arcs_esize[ARC_BUFC_NUMTYPES];
/*
* total amount of data in this state; this includes: evictable,
* non-evictable, ARC_BUFC_DATA, and ARC_BUFC_METADATA.
* amount of hit bytes for this state (ghost only)
*/
zfs_refcount_t arcs_size;
wmsum_t arcs_hits[ARC_BUFC_NUMTYPES];
} arc_state_t;

typedef struct arc_callback arc_callback_t;
Expand Down Expand Up @@ -581,7 +584,9 @@ typedef struct arc_stats {
kstat_named_t arcstat_hash_collisions;
kstat_named_t arcstat_hash_chains;
kstat_named_t arcstat_hash_chain_max;
kstat_named_t arcstat_p;
kstat_named_t arcstat_meta;
kstat_named_t arcstat_pd;
kstat_named_t arcstat_pm;
kstat_named_t arcstat_c;
kstat_named_t arcstat_c_min;
kstat_named_t arcstat_c_max;
Expand Down Expand Up @@ -654,6 +659,8 @@ typedef struct arc_stats {
* are all included in this value.
*/
kstat_named_t arcstat_anon_size;
kstat_named_t arcstat_anon_data;
kstat_named_t arcstat_anon_metadata;
/*
* Number of bytes consumed by ARC buffers that meet the
* following criteria: backing buffers of type ARC_BUFC_DATA,
Expand All @@ -675,6 +682,8 @@ typedef struct arc_stats {
* are all included in this value.
*/
kstat_named_t arcstat_mru_size;
kstat_named_t arcstat_mru_data;
kstat_named_t arcstat_mru_metadata;
/*
* Number of bytes consumed by ARC buffers that meet the
* following criteria: backing buffers of type ARC_BUFC_DATA,
Expand All @@ -699,6 +708,8 @@ typedef struct arc_stats {
* buffers *would have* consumed this number of bytes.
*/
kstat_named_t arcstat_mru_ghost_size;
kstat_named_t arcstat_mru_ghost_data;
kstat_named_t arcstat_mru_ghost_metadata;
/*
* Number of bytes that *would have been* consumed by ARC
* buffers that are eligible for eviction, of type
Expand All @@ -718,6 +729,8 @@ typedef struct arc_stats {
* are all included in this value.
*/
kstat_named_t arcstat_mfu_size;
kstat_named_t arcstat_mfu_data;
kstat_named_t arcstat_mfu_metadata;
/*
* Number of bytes consumed by ARC buffers that are eligible for
* eviction, of type ARC_BUFC_DATA, and reside in the arc_mfu
Expand All @@ -736,6 +749,8 @@ typedef struct arc_stats {
* arcstat_mru_ghost_size for more details.
*/
kstat_named_t arcstat_mfu_ghost_size;
kstat_named_t arcstat_mfu_ghost_data;
kstat_named_t arcstat_mfu_ghost_metadata;
/*
* Number of bytes that *would have been* consumed by ARC
* buffers that are eligible for eviction, of type
Expand All @@ -753,6 +768,8 @@ typedef struct arc_stats {
* ARC_FLAG_UNCACHED being set.
*/
kstat_named_t arcstat_uncached_size;
kstat_named_t arcstat_uncached_data;
kstat_named_t arcstat_uncached_metadata;
/*
* Number of data bytes that are going to be evicted from ARC due to
* ARC_FLAG_UNCACHED being set.
Expand Down Expand Up @@ -875,10 +892,7 @@ typedef struct arc_stats {
kstat_named_t arcstat_loaned_bytes;
kstat_named_t arcstat_prune;
kstat_named_t arcstat_meta_used;
kstat_named_t arcstat_meta_limit;
kstat_named_t arcstat_dnode_limit;
kstat_named_t arcstat_meta_max;
kstat_named_t arcstat_meta_min;
kstat_named_t arcstat_async_upgrade_sync;
/* Number of predictive prefetch requests. */
kstat_named_t arcstat_predictive_prefetch;
Expand Down Expand Up @@ -986,7 +1000,7 @@ typedef struct arc_sums {
wmsum_t arcstat_memory_direct_count;
wmsum_t arcstat_memory_indirect_count;
wmsum_t arcstat_prune;
aggsum_t arcstat_meta_used;
wmsum_t arcstat_meta_used;
wmsum_t arcstat_async_upgrade_sync;
wmsum_t arcstat_predictive_prefetch;
wmsum_t arcstat_demand_hit_predictive_prefetch;
Expand Down Expand Up @@ -1014,7 +1028,9 @@ typedef struct arc_evict_waiter {
#define ARCSTAT_BUMPDOWN(stat) ARCSTAT_INCR(stat, -1)

#define arc_no_grow ARCSTAT(arcstat_no_grow) /* do not grow cache size */
#define arc_p ARCSTAT(arcstat_p) /* target size of MRU */
#define arc_meta ARCSTAT(arcstat_meta) /* target frac of metadata */
#define arc_pd ARCSTAT(arcstat_pd) /* target frac of data MRU */
#define arc_pm ARCSTAT(arcstat_pm) /* target frac of meta MRU */
#define arc_c ARCSTAT(arcstat_c) /* target size of cache */
#define arc_c_min ARCSTAT(arcstat_c_min) /* min target cache size */
#define arc_c_max ARCSTAT(arcstat_c_max) /* max target cache size */
Expand Down
82 changes: 4 additions & 78 deletions man/man4/zfs.4
Original file line number Diff line number Diff line change
Expand Up @@ -548,14 +548,6 @@ This value acts as a ceiling to the amount of dnode metadata, and defaults to
which indicates that a percent which is based on
.Sy zfs_arc_dnode_limit_percent
of the ARC meta buffers that may be used for dnodes.
.Pp
Also see
.Sy zfs_arc_meta_prune
which serves a similar purpose but is used
when the amount of metadata in the ARC exceeds
.Sy zfs_arc_meta_limit
rather than in response to overall demand for non-metadata.
.
.It Sy zfs_arc_dnode_limit_percent Ns = Ns Sy 10 Ns % Pq u64
Percentage that can be consumed by dnodes of ARC meta buffers.
.Pp
Expand Down Expand Up @@ -638,62 +630,10 @@ It cannot be set back to
while running, and reducing it below the current ARC size will not cause
the ARC to shrink without memory pressure to induce shrinking.
.
.It Sy zfs_arc_meta_adjust_restarts Ns = Ns Sy 4096 Pq uint
The number of restart passes to make while scanning the ARC attempting
the free buffers in order to stay below the
.Sy fs_arc_meta_limit .
This value should not need to be tuned but is available to facilitate
performance analysis.
.
.It Sy zfs_arc_meta_limit Ns = Ns Sy 0 Ns B Pq u64
The maximum allowed size in bytes that metadata buffers are allowed to
consume in the ARC.
When this limit is reached, metadata buffers will be reclaimed,
even if the overall
.Sy arc_c_max
has not been reached.
It defaults to
.Sy 0 ,
which indicates that a percentage based on
.Sy zfs_arc_meta_limit_percent
of the ARC may be used for metadata.
.Pp
This value my be changed dynamically, except that must be set to an explicit
value
.Pq cannot be set back to Sy 0 .
.
.It Sy zfs_arc_meta_limit_percent Ns = Ns Sy 75 Ns % Pq u64
Percentage of ARC buffers that can be used for metadata.
.Pp
See also
.Sy zfs_arc_meta_limit ,
which serves a similar purpose but has a higher priority if nonzero.
.
.It Sy zfs_arc_meta_min Ns = Ns Sy 0 Ns B Pq u64
The minimum allowed size in bytes that metadata buffers may consume in
the ARC.
.
.It Sy zfs_arc_meta_prune Ns = Ns Sy 10000 Pq int
The number of dentries and inodes to be scanned looking for entries
which can be dropped.
This may be required when the ARC reaches the
.Sy zfs_arc_meta_limit
because dentries and inodes can pin buffers in the ARC.
Increasing this value will cause to dentry and inode caches
to be pruned more aggressively.
Setting this value to
.Sy 0
will disable pruning the inode and dentry caches.
.
.It Sy zfs_arc_meta_strategy Ns = Ns Sy 1 Ns | Ns 0 Pq uint
Define the strategy for ARC metadata buffer eviction (meta reclaim strategy):
.Bl -tag -compact -offset 4n -width "0 (META_ONLY)"
.It Sy 0 Pq META_ONLY
evict only the ARC metadata buffers
.It Sy 1 Pq BALANCED
additional data buffers may be evicted if required
to evict the required number of metadata buffers.
.El
.It Sy zfs_arc_meta_balance Ns = Ns Sy 500 Pq uint
Balance between metadata and data on ghost hits.
Values above 100 increase metadata caching by proportionally reducing effect
of ghost data hits on target data/metadata rate.
.
.It Sy zfs_arc_min Ns = Ns Sy 0 Ns B Pq u64
Min size of ARC in bytes.
Expand Down Expand Up @@ -776,20 +716,6 @@ causes the ARC to start reclamation if it exceeds the target size by
of the target size, and block allocations by
.Em 0.6% .
.
.It Sy zfs_arc_p_min_shift Ns = Ns Sy 0 Pq uint
If nonzero, this will update
.Sy arc_p_min_shift Pq default Sy 4
with the new value.
.Sy arc_p_min_shift No is used as a shift of Sy arc_c
when calculating the minumum
.Sy arc_p No size .
.
.It Sy zfs_arc_p_dampener_disable Ns = Ns Sy 1 Ns | Ns 0 Pq int
Disable
.Sy arc_p
adapt dampener, which reduces the maximum single adjustment to
.Sy arc_p .
.
.It Sy zfs_arc_shrink_shift Ns = Ns Sy 0 Pq uint
If nonzero, this will update
.Sy arc_shrink_shift Pq default Sy 7
Expand Down
2 changes: 1 addition & 1 deletion module/os/freebsd/zfs/arc_os.c
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ arc_prune_task(void *arg)
/*
* Notify registered consumers they must drop holds on a portion of the ARC
* buffered they reference. This provides a mechanism to ensure the ARC can
* honor the arc_meta_limit and reclaim otherwise pinned ARC buffers. This
* honor the metadata limit and reclaim otherwise pinned ARC buffers. This
* is analogous to dnlc_reduce_cache() but more generic.
*
* This operation is performed asynchronously so it may be safely called
Expand Down
Loading

0 comments on commit 8e24992

Please sign in to comment.