-
Notifications
You must be signed in to change notification settings - Fork 0
ZFS Module Parameters (by Richard Elling on Aug 2, 2018)
DRAFT-DRAFT-DRAFT
The ZFS kernel module parameters are accessible in the SysFS
/sys/module/zfs/paramaters
directory. Current value can be observed by
cat /sys/module/zfs/parameters/PARAMETER
Many of these can be changed by writing new values. These are denoted by Change|Dynamic in the PARAMETER details below.
echo NEWVALUE >> /sys/module/zfs/parameters/PARAMETER
If the parameter is not dynamically adjustable, an error can occur and the value will not be set. It can be helpful to check the permissions for the PARAMETER file in SysFS.
In some cases, the parameter must be set prior to loading the kernel modules
or it is desired to have the parameters set automatically at boot time. For
many distros, this can be accomplished by creating a file named
/etc/modprobe.d/zfs.conf
containing text lines of the format
options zfs PARAMETER=VALUE
See the man page for modprobe.d for
more information.
The ZFS kernel module, zfs.ko
, parameters are detailed below.
To observe the list of parameters along with a short synopsis of each
parameter, use the modinfo
command:
modinfo zfs
The list of parameters is quite large and resists hierarchical representation. To assist in quickly finding relevant information quickly, each module parameter has a "Tags" row with keywords for frequent searches.
- dmu_object_alloc_chunk_shift
- metaslab_aliquot
- metaslab_bias_enabled
- metaslab_debug_load
- metaslab_debug_unload
- metaslab_fragmentation_factor_enabled
- zfs_metaslab_fragmentation_threshold
- metaslab_lba_weighting_enabled
- metaslab_preload_enabled
- zfs_metaslab_segment_weight_enabled
- zfs_metaslab_switch_threshold
- metaslabs_per_vdev
- zfs_mg_fragmentation_threshold
- zfs_mg_noalloc_threshold
- spa_asize_inflation
- spa_load_verify_data
- spa_slop_shift
- zfs_arc_average_blocksize
- zfs_arc_dnode_limit
- zfs_arc_dnode_limit_percent
- zfs_arc_dnode_reduce_percent
- zfs_arc_evict_batch_limit
- zfs_arc_grow_retry
- zfs_arc_lotsfree_percent
- zfs_arc_max
- zfs_arc_meta_adjust_restarts
- zfs_arc_meta_limit
- zfs_arc_meta_limit_percent
- zfs_arc_meta_min
- zfs_arc_meta_prune
- zfs_arc_meta_strategy
- zfs_arc_min
- zfs_arc_min_prefetch_ms
- zfs_arc_min_prescient_prefetch_ms
- zfs_arc_overflow_shift
- zfs_arc_p_dampener_disable
- zfs_arc_p_min_shift
- zfs_arc_pc_percent
- zfs_arc_shrink_shift
- zfs_arc_sys_free
- zfs_disable_dup_eviction
- l2arc_feed_again
- l2arc_feed_min_ms
- l2arc_feed_secs
- l2arc_headroom
- l2arc_headroom_boost
- l2arc_nocompress
- l2arc_noprefetch
- l2arc_norw
- l2arc_write_boost
- l2arc_write_max
- zfs_multilist_num_sublists
- zfs_dbgmsg_enable
- zfs_dbgmsg_maxsize
- zfs_dbuf_state_index
- zfs_deadman_checktime_ms
- zfs_deadman_enabled
- zfs_deadman_failmode
- zfs_deadman_synctime_ms
- zfs_deadman_ziotime_ms
- zfs_flags
- zfs_free_leak_on_eio
- zfs_nopwrite_enabled
- zfs_object_mutex_size
- zfs_read_history
- zfs_read_history_hits
- zfs_txg_history
- zfs_zevent_cols
- zfs_zevent_console
- zfs_zevent_len_max
- zil_replay_disable
- zio_delay_max
- zfs_delete_blocks
- zfs_free_bpobj_enabled
- zfs_free_max_blocks
- zfs_free_min_time_ms
- zfs_per_txg_dirty_frees_percent
- zfs_admin_snapshot
- zfs_delete_blocks
- zfs_expire_snapshot
- zfs_free_max_blocks
- zfs_max_recordsize
- zfs_read_chunk_size
- zfs_autoimport_disable
- zfs_multihost_fail_intervals
- zfs_multihost_history
- zfs_multihost_import_intervals
- zfs_multihost_interval
- zfs_recover
- spa_config_path
- spa_load_verify_maxinflight
- spa_load_verify_metadata
- zvol_inhibit_dev
- l2arc_feed_again
- l2arc_feed_min_ms
- l2arc_feed_secs
- l2arc_headroom
- l2arc_headroom_boost
- l2arc_nocompress
- l2arc_noprefetch
- l2arc_norw
- l2arc_write_boost
- l2arc_write_max
- zfs_abd_scatter_enabled
- zfs_abd_scatter_max_order
- zfs_arc_average_blocksize
- zfs_arc_grow_retry
- zfs_arc_lotsfree_percent
- zfs_arc_max
- zfs_arc_pc_percent
- zfs_arc_shrink_shift
- zfs_arc_sys_free
- zfs_dedup_prefetch
- zfs_max_recordsize
- metaslab_debug_load
- metaslab_debug_unload
- zfs_scan_mem_lim_fact
- zfs_scan_strict_mem_lim
- metaslab_aliquot
- metaslab_bias_enabled
- metaslab_debug_load
- metaslab_debug_unload
- metaslab_fragmentation_factor_enabled
- metaslab_lba_weighting_enabled
- metaslab_preload_enabled
- zfs_metaslab_segment_weight_enabled
- zfs_metaslab_switch_threshold
- metaslabs_per_vdev
- zfs_vdev_mirror_non_rotating_inc
- zfs_vdev_mirror_non_rotating_seek_inc
- zfs_vdev_mirror_rotating_inc
- zfs_vdev_mirror_rotating_seek_inc
- zfs_vdev_mirror_rotating_seek_offset
- zfs_multihost_fail_intervals
- zfs_multihost_history
- zfs_multihost_import_intervals
- zfs_multihost_interval
- zfs_arc_min_prefetch_ms
- zfs_arc_min_prescient_prefetch_ms
- zfs_dedup_prefetch
- l2arc_noprefetch
- zfs_no_scrub_prefetch
- zfs_pd_bytes_max
- zfs_prefetch_disable
- zfetch_array_rd_sz
- zfetch_max_distance
- zfetch_max_streams
- zfetch_min_sec_reap
- zvol_prefetch_bytes
- zfs_resilver_min_time_ms
- zfs_scan_checkpoint_intval
- zfs_scan_fill_weight
- zfs_scan_issue_strategy
- zfs_scan_legacy
- zfs_scan_max_ext_gap
- zfs_scan_mem_lim_fact
- zfs_scan_mem_lim_soft_fact
- zfs_scan_strict_mem_lim
- zfs_scan_vdev_limit
- zfs_vdev_scrub_max_active
- zfs_vdev_scrub_min_active
- zfs_no_scrub_io
- zfs_no_scrub_prefetch
- zfs_scan_checkpoint_intval
- zfs_scan_fill_weight
- zfs_scan_issue_strategy
- zfs_scan_legacy
- zfs_scan_max_ext_gap
- zfs_scan_mem_lim_fact
- zfs_scan_mem_lim_soft_fact
- zfs_scan_strict_mem_lim
- zfs_scan_vdev_limit
- zfs_scrub_min_time_ms
- zfs_vdev_scrub_max_active
- zfs_vdev_scrub_min_active
- spa_asize_inflation
- spa_load_verify_data
- spa_slop_shift
- zfs_sync_pass_deferred_free
- zfs_sync_pass_dont_compress
- zfs_sync_pass_rewrite
- zfs_sync_taskq_batch_pct
- zfs_txg_timeout
- metaslab_aliquot
- metaslab_bias_enabled
- zfs_metaslab_fragmentation_threshold
- metaslabs_per_vdev
- zfs_mg_fragmentation_threshold
- zfs_mg_noalloc_threshold
- zfs_multihost_interval
- zfs_scan_vdev_limit
- zfs_vdev_aggregation_limit
- zfs_vdev_async_read_max_active
- zfs_vdev_async_read_min_active
- zfs_vdev_async_write_active_max_dirty_percent
- zfs_vdev_async_write_active_min_dirty_percent
- zfs_vdev_async_write_max_active
- zfs_vdev_async_write_min_active
- zfs_vdev_cache_bshift
- zfs_vdev_cache_max
- zfs_vdev_cache_size
- zfs_vdev_max_active
- zfs_vdev_mirror_non_rotating_inc
- zfs_vdev_mirror_non_rotating_seek_inc
- zfs_vdev_mirror_rotating_inc
- zfs_vdev_mirror_rotating_seek_inc
- zfs_vdev_mirror_rotating_seek_offset
- zfs_vdev_queue_depth_pct
- zfs_vdev_raidz_impl
- zfs_vdev_read_gap_limit
- zfs_vdev_scheduler
- zfs_vdev_scrub_max_active
- zfs_vdev_scrub_min_active
- zfs_vdev_sync_read_max_active
- zfs_vdev_sync_read_min_active
- zfs_vdev_sync_write_max_active
- zfs_vdev_sync_write_min_active
- zfs_vdev_write_gap_limit
- zio_dva_throttle_enabled
- zfs_max_recordsize
- zvol_inhibit_dev
- zvol_major
- zvol_max_discard_blocks
- zvol_prefetch_bytes
- zvol_request_sync
- zvol_threads
- zvol_volmode
- zfs_delay_min_dirty_percent
- zfs_delay_scale
- zfs_dirty_data_max
- zfs_dirty_data_max_max
- zfs_dirty_data_max_max_percent
- zfs_dirty_data_max_percent
- zfs_dirty_data_sync
- zfs_commit_timeout_pct
- zfs_immediate_write_sz
- zfs_zil_clean_taskq_maxalloc
- zfs_zil_clean_taskq_minalloc
- zfs_zil_clean_taskq_nthr_pct
- zil_replay_disable
- zil_slog_bulk
- zfs_txg_timeout
- zfs_vdev_aggregation_limit
- zfs_vdev_async_read_max_active
- zfs_vdev_async_read_min_active
- zfs_vdev_async_write_active_max_dirty_percent
- zfs_vdev_async_write_active_min_dirty_percent
- zfs_vdev_async_write_max_active
- zfs_vdev_async_write_min_active
- zfs_vdev_max_active
- zfs_vdev_queue_depth_pct
- zfs_vdev_read_gap_limit
- zfs_vdev_scheduler
- zfs_vdev_scrub_max_active
- zfs_vdev_scrub_min_active
- zfs_vdev_sync_read_max_active
- zfs_vdev_sync_read_min_active
- zfs_vdev_sync_write_max_active
- zfs_vdev_sync_write_min_active
- zfs_vdev_write_gap_limit
- zio_dva_throttle_enabled
- zio_requeue_io_start_cut_in_line
- zio_taskq_batch_pct
- zfs_abd_scatter_enabled
- zfs_abd_scatter_max_order
- zfs_admin_snapshot
- zfs_arc_average_blocksize
- zfs_arc_dnode_limit
- zfs_arc_dnode_limit_percent
- zfs_arc_dnode_reduce_percent
- zfs_arc_evict_batch_limit
- zfs_arc_grow_retry
- zfs_arc_lotsfree_percent
- zfs_arc_max
- zfs_arc_meta_adjust_restarts
- zfs_arc_meta_limit
- zfs_arc_meta_limit_percent
- zfs_arc_meta_min
- zfs_arc_meta_prune
- zfs_arc_meta_strategy
- zfs_arc_min
- zfs_arc_min_prefetch_ms
- zfs_arc_min_prescient_prefetch_ms
- zfs_arc_overflow_shift
- zfs_arc_p_dampener_disable
- zfs_arc_p_min_shift
- zfs_arc_pc_percent
- zfs_arc_shrink_shift
- zfs_arc_sys_free
- zfs_autoimport_disable
- zfs_checksums_per_second
- zfs_commit_timeout_pct
- zfs_compressed_arc_enabled
- zfs_dbgmsg_enable
- zfs_dbgmsg_maxsize
- dbuf_cache_hiwater_pct
- dbuf_cache_lowater_pct
- dbuf_cache_max_bytes
- dbuf_cache_max_shift
- zfs_dbuf_state_index
- zfs_deadman_checktime_ms
- zfs_deadman_enabled
- zfs_deadman_failmode
- zfs_deadman_synctime_ms
- zfs_deadman_ziotime_ms
- zfs_dedup_prefetch
- zfs_delay_min_dirty_percent
- zfs_delay_scale
- zfs_delays_per_second
- zfs_delete_blocks
- zfs_dirty_data_max
- zfs_dirty_data_max_max
- zfs_dirty_data_max_max_percent
- zfs_dirty_data_max_percent
- zfs_dirty_data_sync
- zfs_disable_dup_eviction
- dmu_object_alloc_chunk_shift
- zfs_dmu_offset_next_sync
- zfs_expire_snapshot
- zfs_flags
- zfs_fletcher_4_impl
- zfs_free_bpobj_enabled
- zfs_free_leak_on_eio
- zfs_free_max_blocks
- zfs_free_min_time_ms
- ignore_hole_birth
- zfs_immediate_write_sz
- zfs_key_max_salt_uses
- l2arc_feed_again
- l2arc_feed_min_ms
- l2arc_feed_secs
- l2arc_headroom
- l2arc_headroom_boost
- l2arc_nocompress
- l2arc_noprefetch
- l2arc_norw
- l2arc_write_boost
- l2arc_write_max
- zfs_max_recordsize
- zfs_mdcomp_disable
- metaslab_aliquot
- metaslab_bias_enabled
- metaslab_debug_load
- metaslab_debug_unload
- metaslab_fragmentation_factor_enabled
- zfs_metaslab_fragmentation_threshold
- metaslab_lba_weighting_enabled
- metaslab_preload_enabled
- zfs_metaslab_segment_weight_enabled
- zfs_metaslab_switch_threshold
- metaslabs_per_vdev
- zfs_mg_fragmentation_threshold
- zfs_mg_noalloc_threshold
- zfs_multihost_fail_intervals
- zfs_multihost_history
- zfs_multihost_import_intervals
- zfs_multihost_interval
- zfs_multilist_num_sublists
- zfs_no_scrub_io
- zfs_no_scrub_prefetch
- zfs_nocacheflush
- zfs_nopwrite_enabled
- zfs_object_mutex_size
- zfs_pd_bytes_max
- zfs_per_txg_dirty_frees_percent
- zfs_prefetch_disable
- zfs_qat_compress_disable
- zfs_qat_disable
- zfs_read_chunk_size
- zfs_read_history
- zfs_read_history_hits
- zfs_recover
- zfs_resilver_min_time_ms
- zfs_scan_checkpoint_intval
- zfs_scan_fill_weight
- zfs_scan_issue_strategy
- zfs_scan_legacy
- zfs_scan_max_ext_gap
- zfs_scan_mem_lim_fact
- zfs_scan_mem_lim_soft_fact
- zfs_scan_strict_mem_lim
- zfs_scan_vdev_limit
- zfs_scrub_min_time_ms
- zfs_send_corrupt_data
- send_holes_without_birth_time
- spa_asize_inflation
- spa_config_path
- spa_load_verify_data
- spa_load_verify_maxinflight
- spa_load_verify_metadata
- spa_slop_shift
- zfs_sync_pass_deferred_free
- zfs_sync_pass_dont_compress
- zfs_sync_pass_rewrite
- zfs_sync_taskq_batch_pct
- zfs_txg_history
- zfs_txg_timeout
- zfs_vdev_aggregation_limit
- zfs_vdev_async_read_max_active
- zfs_vdev_async_read_min_active
- zfs_vdev_async_write_active_max_dirty_percent
- zfs_vdev_async_write_active_min_dirty_percent
- zfs_vdev_async_write_max_active
- zfs_vdev_async_write_min_active
- zfs_vdev_cache_bshift
- zfs_vdev_cache_max
- zfs_vdev_cache_size
- zfs_vdev_max_active
- zfs_vdev_mirror_non_rotating_inc
- zfs_vdev_mirror_non_rotating_seek_inc
- zfs_vdev_mirror_rotating_inc
- zfs_vdev_mirror_rotating_seek_inc
- zfs_vdev_mirror_rotating_seek_offset
- zfs_vdev_queue_depth_pct
- zfs_vdev_raidz_impl
- zfs_vdev_read_gap_limit
- zfs_vdev_scheduler
- zfs_vdev_scrub_max_active
- zfs_vdev_scrub_min_active
- zfs_vdev_sync_read_max_active
- zfs_vdev_sync_read_min_active
- zfs_vdev_sync_write_max_active
- zfs_vdev_sync_write_min_active
- zfs_vdev_write_gap_limit
- zfs_zevent_cols
- zfs_zevent_console
- zfs_zevent_len_max
- zfetch_array_rd_sz
- zfetch_max_distance
- zfetch_max_streams
- zfetch_min_sec_reap
- zfs_zil_clean_taskq_maxalloc
- zfs_zil_clean_taskq_minalloc
- zfs_zil_clean_taskq_nthr_pct
- zil_replay_disable
- zil_slog_bulk
- zio_delay_max
- zio_dva_throttle_enabled
- zio_requeue_io_start_cut_in_line
- zio_taskq_batch_pct
- zvol_inhibit_dev
- zvol_major
- zvol_max_discard_blocks
- zvol_prefetch_bytes
- zvol_request_sync
- zvol_threads
- zvol_volmode
When set, the hole_birth optimization will not be used and all holes will
always be sent by zfs send
In the source code, ignore_hole_birth is an
alias for and SysFS PARAMETER for send_holes_without_birth_time.
ignore_hole_birth | Notes |
---|---|
Tags | send |
When to change | Enable if you suspect your datasets are affected by a bug in hole_birth during zfs send operations |
Data Type | boolean |
Range | 0=disabled, 1=enabled |
Default | 1 (hole birth optimization is ignored) |
Change | Dynamic |
Versions Affected | TBD |
Turbo L2ARC cache warm-up. When the L2ARC is cold the fill interval will be set to aggressively fill as fast as possible.
l2arc_feed_again | Notes |
---|---|
Tags | ARC, L2ARC |
When to change | If cache devices exist and it is desired to fill them as fast as possible |
Data Type | boolean |
Range | 0=disabled, 1=enabled |
Default | 1 |
Change | Dynamic |
Versions Affected | TBD |
Minimum time period for aggressively feeding the L2ARC. The L2ARC feed thread
wakes up once per second (see l2arc_feed_secs) to look for data to feed into
the L2ARC. l2arc_feed_min_ms
only affects the turbo L2ARC cache warm-up and
allows the aggressiveness to be adjusted.
l2arc_feed_min_ms | Notes |
---|---|
Tags | ARC, L2ARC |
When to change | If cache devices exist and l2arc_feed_again and the feed is too aggressive, then this tunable can be adjusted to reduce the impact of the fill |
Data Type | uint64 |
Units | milliseconds |
Range | 0 to (1000 * l2arc_feed_secs) |
Default | 200 |
Change | Dynamic |
Versions Affected | 0.6 and later |
Seconds between waking the L2ARC feed thread. One feed thread works for all cache devices in turn.
If the pool that owns a cache device is imported readonly, then the feed thread is delayed 5 * l2arc_feed_secs before moving onto the next cache device. If multiple pools are imported with cache devices and one pool with cache is imported readonly, the L2ARC feed rate to all caches can be slowed.
l2arc_feed_secs | Notes |
---|---|
Tags | ARC, L2ARC |
When to change | Do not change |
Data Type | uint64 |
Units | seconds |
Range | 1 to UINT64_MAX |
Default | 1 |
Change | Dynamic |
Versions Affected | 0.6 and later |
How far through the ARC lists to search for L2ARC cacheable content, expressed as a multiplier of l2arc_write_max
l2arc_headroom | Notes |
---|---|
Tags | ARC, L2ARC |
When to change | If the rate of change in the ARC is faster than the overall L2ARC feed rate, then increasing l2arc_headroom can increase L2ARC efficiency. Setting the value too large can cause the L2ARC feed thread to consume more CPU time looking for data to feed. |
Data Type | uint64 |
Units | unit |
Range | 0 to UINT64_MAX |
Default | 2 |
Change | Dynamic |
Versions Affected | 0.6 and later |
Percentage scale for l2arc_headroom when L2ARC contents are being successfully compressed before writing.
l2arc_headroom_boost | Notes |
---|---|
Tags | ARC, L2ARC |
When to change | If average compression efficiency is greater than 2:1, then increasing l2arc_headroom_boost can increase the L2ARC feed rate |
Data Type | uint64 |
Units | percent |
Range | 100 to UINT64_MAX, when set to 100, the L2ARC headroom boost feature is effectively disabled |
Default | 200 |
Change | Dynamic |
Versions Affected | all |
Disable writing compressed data to cache devices. Disabling allows the legacy behavior of writing decompressed data to cache devices.
l2arc_nocompress | Notes |
---|---|
Tags | ARC, L2ARC |
When to change | When testing compressed L2ARC feature |
Data Type | boolean |
Range | 0=store compressed blocks in cache device, 1=store uncompressed blocks in cache device |
Default | 0 |
Change | Dynamic |
Versions Affected | deprecated in v0.7.0 by new compressed ARC design |
Disables writing prefetched, but unused, buffers to cache devices.
l2arc_noprefetch | Notes |
---|---|
Tags | ARC, L2ARC, prefetch |
When to change | Setting to 0 can increase L2ARC hit rates for workloads where the ARC is too small for a read workload that benefits from prefetching. |
Data Type | boolean |
Range | 0=write prefetched but unused buffers to cache devices, 1=do not write prefetched but unused buffers to cache devices |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.0 and later |
Disables writing to cache devices while they are being read.
l2arc_norw | Notes |
---|---|
Tags | ARC, L2ARC |
When to change | In the early days of SSDs, some devices did not perform well when reading and writing simultaneously. Modern SSDs do not have these issues. |
Data Type | boolean |
Range | 0=read and write simultaneously, 1=avoid writes when reading for antique SSDs |
Default | 0 |
Change | Dynamic |
Versions Affected | all |
Until the ARC fills, increases the L2ARC fill rate l2arc_write_max by
l2arc_write_boost
.
l2arc_write_boost | Notes |
---|---|
Tags | ARC, L2ARC |
When to change | To fill the cache devices more aggressively after pool import. |
Data Type | uint64 |
Units | bytes |
Range | 0 to UINT64_MAX |
Default | 8,388,608 |
Change | Dynamic |
Versions Affected | all |
Maximum number of bytes to be written to each cache device for each L2ARC feed thread interval (see l2arc_feed_secs). The actual limit can be adjusted by l2arc_write_boost. By default l2arc_feed_secs is 1 second, delivering a maximum write workload to cache devices of 8 MiB/sec.
l2arc_write_max | Notes |
---|---|
Tags | ARC, L2ARC |
When to change | If the cache devices can sustain the write workload, increasing the rate of cache device fill when workloads generate new data at a rate higher than l2arc_write_max can increase L2ARC hit rate |
Data Type | uint64 |
Units | bytes |
Range | 1 to UINT64_MAX |
Default | 8,388,608 |
Change | Dynamic |
Versions Affected | all |
Sets the metaslab granularity. Nominally, ZFS will try to allocate this amount of data to a top-level vdev before moving on to the next top-level vdev. This is roughly similar to what would be referred to as the "stripe size" in traditional RAID arrays.
When tuning for HDDs, it can be more efficient to have a few larger, sequential
writes to a device rather than switching to the next device. Monitoring the
size of contiguous writes to the disks relative to the write throughput can be
used to determine if increasing metaslab_aliquot
can help. For modern devices,
it is unlikely that decreasing metaslab_aliquot
from the default will help.
If there is only one top-level vdev, this tunable is not used.
metaslab_aliquot | Notes |
---|---|
Tags | allocation, metaslab, vdev |
When to change | If write performance increases as devices more efficiently write larger, contiguous blocks |
Data Type | uint64 |
Units | bytes |
Range | 0 to UINT64_MAX |
Default | 524,288 |
Change | Dynamic |
Versions Affected | all |
Enables metaslab group biasing based on a top-level vdev's utilization relative to the pool. Nominally, all top-level devs are the same size and the allocation is spread evenly. When the top-level vdevs are not of the same size, for example if a new (empty) top-level is added to the pool, this allows the new top-level vdev to get a larger portion of new allocations.
metaslab_bias_enabled | Notes |
---|---|
Tags | allocation, metaslab, vdev |
When to change | If a new top-level vdev is added and you do not want to bias new allocations to the new top-level vdev |
Data Type | boolean |
Range | 0=spread evenly across top-level vdevs, 1=bias spread to favor less full top-level vdevs |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Enables metaslab allocation based on largest free segment rather than total amount of free space. The goal is to avoid metaslabs that exhibit free space fragmentation: when there is a lot of small free spaces, but few larger free spaces.
If zfs_metaslab_segment_weight_enabled
is enabled, then
metaslab_fragmentation_factor_enabled is ignored.
zfs_metaslab_segment_weight_enabled | Notes |
---|---|
Tags | allocation, metaslab |
When to change | When testing allocation and fragmentation |
Data Type | boolean |
Range | 0=do not consider metaslab fragmentation, 1=avoid metaslabs where free space is highly fragmented |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
When using segment-based metaslab selection
(see zfs_metaslab_segment_weight_enabled), continue allocating
from the active metaslab until zfs_metaslab_switch_threshold
worth of free space buckets have been exhausted.
zfs_metaslab_switch_threshold | Notes |
---|---|
Tags | allocation, metaslab |
When to change | When testing allocation and fragmentation |
Data Type | uint64 |
Units | free spaces |
Range | 0 to UINT64_MAX |
Default | 2 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
When enabled, all metaslabs are loaded into memory during pool import. Nominally, metaslab space map information is loaded and unloaded as needed (see metaslab_debug_unload)
It is difficult to predict how much RAM is required to store a space map. An empty or completely full metaslab has a small space map. However, a highly fragmented space map can consume significantly more memory.
Enabling metaslab_debug_load
can increase pool import time.
metaslab_debug_load | Notes |
---|---|
Tags | allocation, memory, metaslab |
When to change | When RAM is plentiful and pool import time is not a consideration |
Data Type | boolean |
Range | 0=do not load all metaslab info at pool import, 1=dynamically load metaslab info as needed |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
When enabled, prevents metaslab information from being dynamically unloaded from RAM. Nominally, metaslab space map information is loaded and unloaded as needed (see metaslab_debug_load)
It is difficult to predict how much RAM is required to store a space map. An empty or completely full metaslab has a small space map. However, a highly fragmented space map can consume significantly more memory.
Enabling metaslab_debug_unload
consumes RAM that would otherwise be freed.
metaslab_debug_unload | Notes |
---|---|
Tags | allocation, memory, metaslab |
When to change | When RAM is plentiful and the penalty for dynamically reloading metaslab info from the pool is high |
Data Type | boolean |
Range | 0=dynamically unload metaslab info, 1=unload metaslab info only upon pool export |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Enable use of the fragmentation metric in computing metaslab weights.
In version v0.7.0, if zfs_metaslab_segment_weight_enabled is enabled, then
metaslab_fragmentation_factor_enabled
is ignored.
metaslab_fragmentation_factor_enabled | Notes |
---|---|
Tags | allocation, metaslab |
When to change | To test metaslab fragmentation |
Data Type | boolean |
Range | 0=do not consider metaslab free space fragmentation, 1=try to avoid fragmented metaslabs |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
When a vdev is added, it will be divided into approximately, but no more than, this number of metaslabs.
metaslabs_per_vdev | Notes |
---|---|
Tags | allocation, metaslab, vdev |
When to change | When testing metaslab allocation |
Data Type | uint64 |
Units | metaslabs |
Range | 16 to UINT64_MAX |
Default | 200 |
Change | Prior to pool creation or adding new top-level vdevs |
Versions Affected | all |
Enable metaslab group preloading. Each top-level vdev has a metaslab group.
By default, up to 3 copies of metadata can exist and are distributed across multiple
top-level vdevs. metaslab_preload_enabled
allows the corresponding metaslabs to be
preloaded, thus improving allocation efficiency.
metaslab_preload_enabled | Notes |
---|---|
Tags | allocation, metaslab |
When to change | When testing metaslab allocation |
Data Type | boolean |
Range | 0=do not preload metaslab info, 1=preload up to 3 metaslabs |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Modern HDDs have uniform bit density and constant angular velocity.
Therefore, the outer recording zones are faster (higher bandwidth)
than the inner zones by the ratio of outer to inner track diameter.
The difference in bandwidth can be 2:1, and is often available in the HDD
detailed specifications or drive manual. For HDDs when
metaslab_lba_weighting_enabled
is true, write allocation preference is given
to the metaslabs representing the outer recording zones. Thus the allocation
to metaslabs prefers faster bandwidth over free space.
If the devices are not rotational, yet misrepresent themselves to the OS as
rotational, then disabling metaslab_lba_weighting_enabled
can result in more
even, free-space-based allocation.
metaslab_lba_weighting_enabled | Notes |
---|---|
Tags | allocation, metaslab, HDD, SSD |
When to change | disable if using only SSDs and version v0.6.4 or earlier |
Data Type | boolean |
Range | 0=do not use LBA weighting, 1=use LBA weighting |
Default | 1 |
Change | Dynamic |
Verfication | The rotational setting described by a block device in sysfs by observing /sys/block/DISK_NAME/queue/rotational
|
Versions Affected | prior to v0.6.5, the check for non-rotation media did not exist |
By default, the zpool import
command searches for pool information in
the zpool.cache
file. If the pool to be imported has an entry
in zpool.cache
then the devices do not have to be scanned to determine if
they are pool members. The path to the cache file is spa_config_path.
For more information on zpool import
and the -o cachefile
and
-d
options, see the man page for zpool(8)
See also zfs_autoimport_disable
spa_config_path | Notes |
---|---|
Tags | import |
When to change | If creating a non-standard distribution and the cachefile property is inconvenient |
Data Type | string |
Default | /etc/zfs/zpool.cache |
Change | Dynamic, applies only to the next invocation of zpool import
|
Versions Affected | all |
Multiplication factor used to estimate actual disk consumption from the size of data being written. The default value is a worst case estimate, but lower values may be valid for a given pool depending on its configuration. Pool administrators who understand the factors involved may wish to specify a more realistic inflation factor, particularly if they operate close to quota or capacity limits.
The worst case space requirement for allocation is single-sector
max-parity RAIDZ blocks, in which case the space requirement is exactly
4 times the size, accounting for a maximum of 3 parity blocks.
This is added to the maximum number of ZFS copies
parameter (copies max=3).
Additional space is required if the block could impact deduplication
tables. Altogether, the worst case is 24.
If the estimation is not correct, then quotas or out-of-space conditions can lead to optimistic expectations of the ability to allocate. Applications are typically not prepared to deal with such failures and can misbehave.
spa_asize_inflation | Notes |
---|---|
Tags | allocation, SPA |
When to change | If the allocation requirements for the workload are well known and quotas are used |
Data Type | uint64 |
Units | unit |
Range | 1 to 24 |
Default | 24 |
Change | Dynamic |
Versions Affected | v0.6.3 and later |
An extreme rewind import (see zpool import -X
) normally performs a
full traversal of all blocks in the pool for verification. If this parameter
is set to 0, the traversal skips non-metadata blocks. It can be toggled
once the import has started to stop or start the traversal of non-metadata
blocks. See also spa_load_verify_metadata.
spa_load_verify_data | Notes |
---|---|
Tags | allocation, SPA |
When to change | At the risk of data integrity, to speed extreme import of large pool |
Data Type | boolean |
Range | 0=do not verify data upon pool import, 1=verify pool data upon import |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
An extreme rewind import (see zpool import -X
) normally performs a
full traversal of all blocks in the pool for verification. If this parameter
is set to 0, the traversal is not performed. It can be toggled once the
import has started to stop or start the traversal. See spa_load_verify_data
spa_load_verify_metadata | Notes |
---|---|
Tags | import |
When to change | At the risk of data integrity, to speed extreme import of large pool |
Data Type | boolean |
Range | 0=do not verify metadata upon pool import, 1=verify pool metadata upon import |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Maximum number of concurrent I/Os during the data verification performed
during an extreme rewind import (see zpool import -X
)
spa_load_verify_maxinflight | Notes |
---|---|
Tags | import |
When to change | During an extreme rewind import, to match the concurrent I/O capabilities of the pool devices |
Data Type | int |
Units | I/Os |
Range | 1 to MAX_INT |
Default | 10,000 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Normally, the last 3.2% (1/(2^spa_slop_shift
)) of pool space
is reserved to ensure the pool doesn't run completely out of space,
due to unaccounted changes (e.g. to the MOS).
This also limits the worst-case time to allocate space.
When less than this amount of free space exists, most ZPL operations
(e.g. write, create) return error:no space (ENOSPC).
Changing spa_slop_shift affects the currently loaded ZFS module and all imported pools. spa_slop_shift is not stored on disk. Beware when importing full pools on systems with larger spa_slop_shift can lead to over-full conditions.
The minimum SPA slop space is limited to 128 MiB.
spa_slop_shift | Notes |
---|---|
Tags | allocation, SPA |
When to change | For large pools, when 3.2% may be too conservative and more usable space is desired, consider increasing spa_slop_shift
|
Data Type | int |
Units | shift |
Range | 1 to MAX_INT |
Default | 5 |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
If prefetching is enabled, do not prefetch blocks larger than zfetch_array_rd_sz
size.
zfetch_array_rd_sz | Notes |
---|---|
Tags | prefetch |
When to change | To allow prefetching when using large block sizes |
Data Type | unsigned long |
Units | bytes |
Range | 0 to MAX_ULONG |
Default | 1,048,576 (1 MiB) |
Change | Dynamic |
Versions Affected | all |
Limits the maximum number of bytes to prefetch per stream.
zfetch_max_distance | Notes |
---|---|
Tags | prefetch |
When to change | Consider increasing read workloads that use large blocks and exhibit high prefetch hit ratios |
Data Type | uint |
Units | bytes |
Range | 0 to UINT_MAX |
Default | 8,388,608 |
Change | Dynamic |
Versions Affected | v0.7.0 |
Maximum number of prefetch streams per file.
For version v0.7.0 and later, when prefetching small files the number of prefetch streams is automatically reduced below to prevent the streams from overlapping.
zfetch_max_streams | Notes |
---|---|
Tags | prefetch |
When to change | If the workload benefits from prefetching and has more than zfetch_max_streams concurrent reader threads |
Data Type | uint |
Units | streams |
Range | 1 to MAX_UINT |
Default | 8 |
Change | Dynamic |
Versions Affected | all |
Prefetch streams that have been accessed in zfetch_min_sec_reap
seconds are
automatically stopped.
zfetch_min_sec_reap | Notes |
---|---|
Tags | prefetch |
When to change | To test prefetch efficiency |
Data Type | uint |
Units | seconds |
Range | 0 to MAX_UINT |
Default | 2 |
Change | Dynamic |
Versions Affected | all |
Percentage of ARC metadata space that can be used for dnodes.
The value calculated for zfs_arc_dnode_limit_percent
can be overridden by
zfs_arc_dnode_limit.
zfs_arc_dnode_limit_percent | Notes |
---|---|
Tags | ARC |
When to change | Testing dnode cache efficiency |
Data Type | int |
Units | percent of arc_meta_limit |
Range | 0 to 100 |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
When the number of bytes consumed by dnodes in the ARC exceeds
zfs_arc_dnode_limit
bytes, demand for new metadata can take from the space
consumed by dnodes.
The default value 0, indicates that a percent which is based on zfs_arc_dnode_limit_percent of the ARC meta buffers that may be used for dnodes.
zfs_arc_dnode_limit
is similar to zfs_arc_meta_prune which serves a similar
purpose for metadata.
zfs_arc_dnode_limit | Notes |
---|---|
Tags | ARC |
When to change | Testing dnode cache efficiency |
Data Type | uint64 |
Units | bytes |
Range | 0 to MAX_UINT64 |
Default | 0 (uses zfs_arc_dnode_limit_percent) |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
Percentage of ARC dnodes to try to evict in response to demand for non-metadata when the number of bytes consumed by dnodes exceeds zfs_arc_dnode_limit.
zfs_arc_dnode_reduce_percent | Notes |
---|---|
Tags | ARC |
When to change | Testing dnode cache efficiency |
Data Type | uint64 |
Units | percent of size of dnode space used above zfs_arc_dnode_limit |
Range | 0 to 100 |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
The ARC's buffer hash table is sized based on the assumption of an average
block size of zfs_arc_average_blocksize
. The default of 8 KiB uses
approximately 1 MiB of hash table per 1 GiB of physical memory with
8-byte pointers.
zfs_arc_average_blocksize | Notes |
---|---|
Tags | ARC, memory |
When to change | For workloads where the known average blocksize is larger, increasing zfs_arc_average_blocksize can reduce memory usage |
Data Type | int |
Units | bytes |
Range | 512 to 16,777,216 |
Default | 8,192 |
Change | Prior to zfs module load |
Versions Affected | all |
Number ARC headers to evict per sublist before proceeding to another sublist. This batch-style operation prevents entire sublists from being evicted at once but comes at a cost of additional unlocking and locking.
zfs_arc_evict_batch_limit | Notes |
---|---|
Tags | ARC |
When to change | Testing ARC multilist features |
Data Type | int |
Units | count of ARC headers |
Range | 1 to INT_MAX |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
When the ARC is shrunk due to memory demand, do not retry growing the ARC
for zfs_arc_grow_retry
seconds. This operates as a damper to prevent
oscillating grow/shrink cycles when there is memory pressure.
If zfs_arc_grow_retry
= 0, the internal default of 5 seconds is used.
zfs_arc_grow_retry | Notes |
---|---|
Tags | ARC, memory |
When to change | TBD |
Data Type | int |
Units | seconds |
Range | 1 to MAX_INT |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
Throttle ARC memory consumption, effectively throttling I/O, when free
system memory drops below this percentage of total system memory. Setting
zfs_arc_lotsfree_percent
to 0 disables the throttle.
The arcstat_memory_throttle_count counter in /proc/spl/kstat/arcstats
can indicate throttle activity.
zfs_arc_lotsfree_percent | Notes |
---|---|
Tags | ARC, memory |
When to change | TBD |
Data Type | int |
Units | percent |
Range | 0 to 100 |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
Maximum size of ARC in bytes. If set to 0 then the maximum ARC size is set to 1/2 of system RAM.
zfs_arc_max
can be changed dynamically with some caveats. It cannot be set back
to 0 while running and reducing it below the current ARC size will not cause
the ARC to shrink without memory pressure to induce shrinking.
zfs_arc_max | Notes |
---|---|
Tags | ARC, memory |
When to change | Reduce if ARC competes too much with other applications, increase if ZFS is the primary application and can use more RAM |
Data Type | uint64 |
Units | bytes |
Range | 67,108,864 to RAM size in bytes |
Default | 0 (uses default of RAM size in bytes / 2) |
Change | Dynamic (see description above) |
Verification |
c column in arcstats.py or /proc/spl/kstat/zfs/arcstats entry c_max
|
Versions Affected | all |
The number of restart passes to make while scanning the ARC attempting the free buffers in order to stay below the zfs_arc_meta_limit.
zfs_arc_meta_adjust_restarts | Notes |
---|---|
Tags | ARC |
When to change | Testing ARC metadata adjustment feature |
Data Type | int |
Units | restarts |
Range | 0 to INT_MAX |
Default | 4,096 |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
Sets the maximum allowed size metadata buffers in the ARC.
When zfs_arc_meta_limit is reached metadata buffers are reclaimed, even if
the overall c_max
has not been reached.
In version v0.7.0, with a default value = 0, zfs_arc_meta_limit_percent
is
used to set arc_meta_limit
zfs_arc_meta_limit | Notes |
---|---|
Tags | ARC |
When to change | For workloads where the metadata to data ratio in the ARC can be changed to improve ARC hit rates |
Data Type | uint64 |
Units | bytes |
Range | 0 to c_max
|
Default | 0 |
Change | Dynamic, except that it cannot be set back to 0 for a specific percent of the ARC; it must be set to an explicit value |
Verification |
/proc/spl/kstat/zfs/arcstats entry arc_meta_limit
|
Versions Affected | all |
Sets the limit to ARC metadata, arc_meta_limit
, as a percentage of
the maximum size target of the ARC, c_max
Prior to version v0.7.0, the zfs_arc_meta_limit was used to set the limit as a
fixed size. zfs_arc_meta_limit_percent
provides a more convenient interface
for setting the limit.
zfs_arc_meta_limit_percent | Notes |
---|---|
Tags | ARC |
When to change | For workloads where the metadata to data ratio in the ARC can be changed to improve ARC hit rates |
Data Type | uint64 |
Units | percent of c_max
|
Range | 0 to 100 |
Default | 75 |
Change | Dynamic |
Verification |
/proc/spl/kstat/zfs/arcstats entry arc_meta_limit
|
Versions Affected | v0.7.0 and later |
The minimum allowed size in bytes that metadata buffers may consume in the ARC. This value defaults to 0 which disables a floor on the amount of the ARC devoted meta data.
When evicting data from the ARC, if the metadata_size
is less than
arc_meta_min
then data is evicted instead of metadata.
zfs_arc_meta_min | Notes |
---|---|
Tags | ARC |
When to change | |
Data Type | uint64 |
Units | bytes |
Range | 16,777,216 to c_max
|
Default | 0 (use internal default 16 MiB) |
Change | Dynamic |
Verification |
/proc/spl/kstat/zfs/arcstats entry arc_meta_min
|
Versions Affected | all |
zfs_arc_meta_prune
sets the number of dentries and znodes to be scanned looking
for entries which can be dropped.
This provides a mechanism to ensure the ARC can
honor the arc_meta_limit and
reclaim otherwise pinned ARC buffers.
Pruning may be required when the ARC size drops to
arc_meta_limit
because dentries and znodes can pin buffers in the ARC.
Increasing this value will cause to dentry and znode caches
to be pruned more aggressively and the arc_prune thread becomes more active.
Setting zfs_arc_meta_prune
to 0 will disable pruning.
zfs_arc_meta_prune | Notes |
---|---|
Tags | ARC |
When to change | TBD |
Data Type | uint64 |
Units | entries |
Range | 0 to INT_MAX |
Default | 10,000 |
Change | Dynamic |
! Verification | Prune activity is counted by the /proc/spl/kstat/zfs/arcstats entry arc_prune
|
Versions Affected | v0.6.5 and later |
Defines the strategy for ARC metadata eviction (meta reclaim strategy). A value of 0 (META_ONLY) will evict only the ARC metadata. A value of 1 (BALANCED) indicates that additional data may be evicted if required in order to evict the requested amount of metadata.
zfs_arc_meta_strategy | Notes |
---|---|
Tags | ARC |
When to change | Testing ARC metadata eviction |
Data Type | int |
Units | enum |
Range | 0=evict metadata only, 1=also evict data buffers if they can free metadata buffers for eviction |
Default | 1 (BALANCED) |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
Minimum ARC size limit. When the ARC is asked to shrink, it will stop shrinking
at c_min
as tuned by zfs_arc_min
.
zfs_arc_min | Notes |
---|---|
Tags | ARC |
When to change | If the primary focus of the system is ZFS, then increasing can ensure the ARC gets a minimum amount of RAM |
Data Type | uint64 |
Units | bytes |
Range | 33,554,432 to c_max
|
Default | greater of 33,554,432 (32 MiB) and c_max / 2 |
Change | Dynamic |
Verification |
/proc/spl/kstat/zfs/arcstats entry c_min
|
Versions Affected | all |
Minimum time prefetched blocks are locked in the ARC.
A value of 0 represents the default of 1 second. However, once changed, dynamically setting to 0 will not return to the default.
zfs_arc_min_prefetch_ms | Notes |
---|---|
Tags | ARC, prefetch |
When to change | TBD |
Data Type | int |
Units | milliseconds |
Range | 1 to INT_MAX |
Default | 0 (use internal default of 1000 ms) |
Change | Dynamic |
Versions Affected | v0.8.0 and later |
Minimum time "prescient prefetched" blocks are locked in the ARC. These blocks are meant to be prefetched fairly aggresively ahead of the code that may use them.
A value of 0 represents the default of 6 seconds. However, once changed, dynamically setting to 0 will not return to the default.
zfs_arc_min_prescient_prefetch_ms | Notes |
---|---|
Tags | ARC, prefetch |
When to change | TBD |
Data Type | int |
Units | milliseconds |
Range | 1 to INT_MAX |
Default | 0 (use internal default of 6000 ms) |
Change | Dynamic |
Versions Affected | v0.8.0 and later |
To allow more fine-grained locking, each ARC state contains a series of lists (sublists) for both data and metadata objects. Locking is performed at the sublist level. This parameters controls the number of sublists per ARC state, and also applies to other uses of the multilist data structure.
zfs_multilist_num_sublists | Notes |
---|---|
Tags | ARC |
When to change | TBD |
Data Type | int |
Units | lists |
Range | 1 to INT_MAX |
Default | 0 (internal value is greater of number of online CPUs or 4) |
Change | Prior to zfs module load |
Versions Affected | v0.7.0 and later |
The ARC size is considered to be overflowing if it exceeds the current
ARC target size (/proc/spl/kstat/zfs/arcstats
entry c
) by a
threshold determined by zfs_arc_overflow_shift
.
The threshold is calculated as a fraction of c using the formula:
(ARC target size) c >> zfs_arc_overflow_shift
The default value of 8 causes the ARC to be considered to be overflowing if it exceeds the target size by 1/256th (0.3%) of the target size.
When the ARC is overflowing, new buffer allocations are stalled until the reclaim thread catches up and the overflow condition no longer exists.
zfs_arc_overflow_shift | Notes |
---|---|
Tags | ARC |
When to change | TBD |
Data Type | int |
Units | shift |
Range | 1 to INT_MAX |
Default | 8 |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
arc_p_min_shift is used to shift of ARC target size
(/proc/spl/kstat/zfs/arcstats
entry c
) for calculating
both minimum and maximum most recently used (MRU) target size
(/proc/spl/kstat/zfs/arcstats
entry p
)
A value of 0 represents the default setting of arc_p_min_shift
= 4.
However, once changed, dynamically setting zfs_arc_p_min_shift
to 0 will
not return to the default.
zfs_arc_p_min_shift | Notes |
---|---|
Tags | ARC |
When to change | TBD |
Data Type | int |
Units | shift |
Range | 1 to INT_MAX |
Default | 0 (internal default = 4) |
Change | Dynamic |
Verification | Observe changes to /proc/spl/kstat/zfs/arcstats entry p
|
Versions Affected | all |
When data is being added to the ghost lists, the MRU target size is adjusted. The amount of adjustment is based on the ratio of the MRU/MFU sizes. When enabled, the ratio is capped to 10, avoiding large adjustments.
zfs_arc_p_dampener_disable | Notes |
---|---|
Tags | ARC |
When to change | Testing ARC ghost list behaviour |
Data Type | boolean |
Range | 0=avoid large adjustments, 1=permit large adjustments |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
arc_shrink_shift
is used to adjust the ARC target sizes when large reduction
is required. The current ARC target size, c
, and MRU size p
can
be reduced by by the current size >> arc_shrink_shift
. For the default value
of 7, this reduces the target by approximately 0.8%.
A value of 0 represents the default setting of arc_shrink_shift = 7. However, once changed, dynamically setting arc_shrink_shift to 0 will not return to the default.
zfs_arc_shrink_shift | Notes |
---|---|
Tags | ARC, memory |
When to change | During memory shortfall, reducing zfs_arc_shrink_shift increases the rate of ARC shrinkage |
Data Type | int |
Units | shift |
Range | 1 to INT_MAX |
Default | 0 (arc_shrink_shift = 7) |
Change | Dynamic |
Versions Affected | all |
zfs_arc_pc_percent
allows ZFS arc to play more nicely with the kernel's LRU
pagecache. It can guarantee that the arc size won't collapse under scanning
pressure on the pagecache, yet still allows arc to be reclaimed down to
zfs_arc_min if necessary. This value is specified as percent of pagecache
size (as measured by NR_FILE_PAGES
) where that percent may exceed 100. This
only operates during memory pressure/reclaim.
zfs_arc_pc_percent | Notes |
---|---|
Tags | ARC, memory |
When to change | When using file systems under memory shortfall, if the page scanner causes the ARC to shrink too fast, then adjusting zfs_arc_pc_percent can reduce the shrink rate |
Data Type | int |
Units | percent |
Range | 0 to 100 |
Default | 0 (disabled) |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
zfs_arc_sys_free
is the target number of bytes the ARC should leave as
free memory on the system.
Defaults to the larger of 1/64 of physical memory or 512K. Setting this
option to a non-zero value will override the default.
A value of 0 represents the default setting of larger of 1/64 of physical memory or 512 KiB. However, once changed, dynamically setting zfs_arc_sys_free to 0 will not return to the default.
zfs_arc_sys_free | Notes |
---|---|
Tags | ARC, memory |
When to change | Change if more free memory is desired as a margin against memory demand by applications |
Data Type | ulong |
Units | bytes |
Range | 0 to ULONG_MAX |
Default | 0 (default to larger of 1/64 of physical memory or 512 KiB) |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
Disable reading zpool.cache file (see spa_config_path) when loading the zfs module.
zfs_autoimport_disable | Notes |
---|---|
Tags | import |
When to change | Leave as default so that zfs behaves as other Linux kernel modules |
Data Type | boolean |
Range | 0=read zpool.cache at module load, 1=do not read zpool.cache at module load |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_commit_timeout_pct
controls the amount of time that a log (ZIL) write
block (lwb) remains "open" when it isn't "full" and it has a thread waiting
to commit to stable storage.
The timeout is scaled based on a percentage of the last lwb
latency to avoid significantly impacting the latency of each individual
intent log transaction (itx).
zfs_commit_timeout_pct | Notes |
---|---|
Tags | ZIL |
When to change | TBD |
Data Type | int |
Units | percent |
Range | 1 to 100 |
Default | 5 |
Change | Dynamic |
Versions Affected | v0.8.0 |
Internally ZFS keeps a small log to facilitate debugging.
The contents of the log are in the /proc/spl/kstat/zfs/dbgmsg
file.
Writing 0 to /proc/spl/kstat/zfs/dbgmsg
file clears the log.
See also zfs_dbgmsg_maxsize
zfs_dbgmsg_enable | Notes |
---|---|
Tags | debug |
When to change | To view ZFS internal debug log |
Data Type | boolean |
Range | 0=do not log debug messages, 1=log debug messages |
Default | 0 (1 for debug builds) |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
The /proc/spl/kstat/zfs/dbgmsg
file size limit is set by
zfs_dbgmsg_maxsize.
See also zfs_dbgmsg_enable
zfs_dbgmsg_maxsize | Notes |
---|---|
Tags | debug |
When to change | TBD |
Data Type | int |
Units | bytes |
Range | 0 to INT_MAX |
Default | 4 MiB |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
The zfs_dbuf_state_index
feature is currently unused. It is normally used
for controlling values in the /proc/spl/kstat/zfs/dbufs
file.
zfs_dbuf_state_index | Notes |
---|---|
Tags | debug |
When to change | Do not change |
Data Type | int |
Units | TBD |
Range | TBD |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
When a pool sync operation takes longer than zfs_deadman_synctime_ms
milliseconds, a "slow spa_sync" message is logged to the debug log
(see zfs_dbgmsg_enable). If zfs_deadman_enabled
is
set to 1, then all pending IO operations are also checked and if any haven't
completed within zfs_deadman_synctime_ms milliseconds, a "SLOW IO" message
is logged to the debug log and a "deadman" system event (see zpool events
command) with the details of the hung IO is posted.
zfs_deadman_enabled | Notes |
---|---|
Tags | debug |
When to change | To disable logging of slow I/O |
Data Type | boolean |
Range | 0=do not log slow I/O, 1=log slow I/O |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.8.0 |
Once a pool sync operation has taken longer than zfs_deadman_synctime_ms milliseconds, continue to check for slow operations every zfs_deadman_checktime_ms milliseconds.
zfs_deadman_checktime_ms | Notes |
---|---|
Tags | debug |
When to change | When debugging slow I/O |
Data Type | ulong |
Units | milliseconds |
Range | 1 to ULONG_MAX |
Default | 60,000 (1 minute) |
Change | Dynamic |
Versions Affected | v0.8.0 |
When an individual I/O takes longer than zfs_deadman_ziotime_ms
milliseconds,
then the operation is considered to be "hung". If zfs_deadman_enabled
is set then the deadman behaviour is invoked as described by the
zfs_deadman_failmode option.
zfs_deadman_ziotime_ms | Notes |
---|---|
Tags | debug |
When to change | Testing ABD features |
Data Type | ulong |
Units | milliseconds |
Range | 1 to ULONG_MAX |
Default | 300,000 (5 minutes) |
Change | Dynamic |
Versions Affected | v0.8.0 |
The I/O deadman timer expiration time has two meanings
- determines when the
spa_deadman()
logic should fire, indicating the txg sync has not completed in a timely manner - determines if an I/O is considered "hung"
In version v0.8.0, any I/O that has not completed in zfs_deadman_synctime_ms
is considered "hung" resulting in one of three behaviors controlled by the
zfs_deadman_failmode parameter.
zfs_deadman_synctime_ms
takes effect if zfs_deadman_enabled = 1.
zfs_deadman_synctime_ms | Notes |
---|---|
Tags | debug |
When to change | When debugging slow I/O |
Data Type | ulong |
Units | milliseconds |
Range | 1 to |
Default | 600,000 (10 minutes) |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
zfs_deadman_failmode controls the behavior of the I/O deadman timer when it detects a "hung" I/O. Valid values are:
- wait - Wait for the "hung" I/O (default)
- continue - Attempt to recover from a "hung" I/O
- panic - Panic the system
zfs_deadman_failmode | Notes |
---|---|
Tags | debug |
When to change | In some cluster cases, panic can be appropriate |
Data Type | string |
Range | wait, continue, or panic |
Default | wait |
Change | Dynamic |
Versions Affected | v0.8.0 |
ZFS can prefetch deduplication table (DDT) entries. zfs_dedup_prefetch
allows
DDT prefetches to be enabled.
zfs_dedup_prefetch | Notes |
---|---|
Tags | prefetch, memory |
When to change | For systems with limited RAM using the dedup feature, disabling deduplication table prefetch can reduce memory pressure |
Data Type | boolean |
Range | 0=do not prefetch, 1=prefetch dedup table entries |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
zfs_delete_blocks
defines a large file for the purposes of delete.
Files containing more than zfs_delete_blocks
will be deleted asynchronously
while smaller files are deleted synchronously.
Decreasing this value reduces the time spent in an unlink(2)
system call at
the expense of a longer delay before the freed space is available.
The zfs_delete_blocks
value is specified in blocks, not bytes. The size of
blocks can vary and is ultimately limited by the filesystem's recordsize
property.
zfs_delete_blocks | Notes |
---|---|
Tags | filesystem, delete |
When to change | If applications delete large files and blocking on unlink(2) is not desired |
Data Type | ulong |
Units | blocks |
Range | 1 to ULONG_MAX |
Default | 20,480 |
Change | Dynamic |
Versions Affected | all |
The ZFS write throttle begins to delay each transaction when the amount of
dirty data reaches the threshold zfs_delay_min_dirty_percent
of
zfs_dirty_data_max.
This value should be >= zfs_vdev_async_write_active_max_dirty_percent.
zfs_delay_min_dirty_percent | Notes |
---|---|
Tags | write_throttle |
When to change | See section "ZFS TRANSACTION DELAY" |
Data Type | int |
Units | percent |
Range | 0 to 100 |
Default | 60 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_delay_scale
controls how quickly the ZFS write throttle transaction
delay approaches infinity.
Larger values cause longer delays for a given amount of dirty data.
For the smoothest delay, this value should be about 1 billion divided
by the maximum number of write operations per second the pool can sustain.
The throttle will smoothly handle between 10x and 1/10th zfs_delay_scale
.
Note: zfs_delay_scale
* zfs_dirty_data_max must be < 2^64.
zfs_delay_scale | Notes |
---|---|
Tags | write_throttle |
When to change | See section "ZFS TRANSACTION DELAY" |
Data Type | ulong |
Units | scalar (nanoseconds) |
Range | 0 to ULONG_MAX |
Default | 500,000 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_dirty_data_max
is the ZFS write throttle dirty space limit.
Once this limit is exceeded, new writes are delayed until space is freed by
writes being committed to the pool.
zfs_dirty_data_max takes precedence over zfs_dirty_data_max_percent.
zfs_dirty_data_max | Notes |
---|---|
Tags | write_throttle |
When to change | See section "ZFS TRANSACTION DELAY" |
Data Type | ulong |
Units | bytes |
Range | 1 to zfs_dirty_data_max_max |
Default | 10% of physical RAM |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_dirty_data_max_percent
is an alternative method of specifying
zfs_dirty_data_max, the ZFS write throttle dirty space limit.
Once this limit is exceeded, new writes are delayed until space is freed by
writes being committed to the pool.
zfs_dirty_data_max takes precedence over zfs_dirty_data_max_percent
.
zfs_dirty_data_max_percent | Notes |
---|---|
Tags | write_throttle |
When to change | See section "ZFS TRANSACTION DELAY" |
Data Type | int |
Units | percent |
Range | 1 to 100 |
Default | 10% of physical RAM |
Change | Prior to zfs module load or a memory hot plug event |
Versions Affected | v0.6.4 and later |
zfs_dirty_data_max_max
is the maximum allowable value of
zfs_dirty_data_max.
zfs_dirty_data_max_max
takes precedence over zfs_dirty_data_max_max_percent.
zfs_dirty_data_max_max | Notes |
---|---|
Tags | write_throttle |
When to change | See section "ZFS TRANSACTION DELAY" |
Data Type | ulong |
Units | bytes |
Range | 1 to physical RAM size |
Default | 25% of physical RAM |
Change | Prior to zfs module load |
Versions Affected | v0.6.4 and later |
zfs_dirty_data_max_max_percent
an alternative to zfs_dirty_data_max_max
for setting the maximum allowable value of zfs_dirty_data_max
zfs_dirty_data_max_max takes precedence over zfs_dirty_data_max_max_percent
zfs_dirty_data_max_max_percent | Notes |
---|---|
Tags | write_throttle |
When to change | See section "ZFS TRANSACTION DELAY" |
Data Type | int |
Units | percent |
Range | 1 to 100 |
Default | 25% of physical RAM |
Change | Prior to zfs module load |
Versions Affected | v0.6.4 and later |
When there is at least zfs_dirty_data_sync
dirty data, a transaction group
sync is started. This allows a transaction group sync to occur more frequently
than the transaction group timeout interval (see zfs_txg_timeout)
when there is dirty data to be written.
zfs_dirty_data_sync | Notes |
---|---|
Tags | write_throttle |
When to change | TBD |
Data Type | ulong |
Units | bytes |
Range | 1 to ULONG_MAX |
Default | 67,108,864 (64 MiB) |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Fletcher-4 is the default checksum algorithm for metadata and data.
When the zfs kernel module is loaded, a set of microbenchmarks are run to
determine the fastest algorithm for the current hardware. The
zfs_fletcher_4_impl
parameter allows a specific implementation to be
specified other than the default (fastest).
Selectors other than fastest and scalar require instruction
set extensions to be available and will only appear if ZFS detects their
presence. The scalar implementation works on all processors.
The results of the microbenchmark are visible in the
/proc/spl/kstat/zfs/fletcher_4_bench
file.
Larger numbers indicate better performance.
Since ZFS is processor endian-independent, the microbenchmark is run
against both big and little-endian transformation.
zfs_fletcher_4_impl | Notes |
---|---|
Tags | CPU, checksum |
When to change | Testing Fletcher-4 algorithms |
Data Type | string |
Range | fastest, scalar, superscalar, superscalar4, sse2, ssse3, avx2, avx512f, or aarch64_neon depending on hardware support |
Default | fastest |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
The processing of the free_bpobj object can be enabled by
zfs_free_bpobj_enabled
zfs_free_bpobj_enabled | Notes |
---|---|
Tags | delete |
When to change | If there's a problem with processing free_bpobj (e.g. i/o error or bug) |
Data Type | boolean |
Range | 0=do not process free_bpobj objects, 1=process free_bpobj objects |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
zfs_free_max_blocks
sets the maximum number of blocks to be freed in a single
transaction group (txg). For workloads that delete (free) large numbers of
blocks in a short period of time, the processing of the frees can negatively
impact other operations, including txg commits. zfs_free_max_blocks
acts as a
limit to reduce the impact.
zfs_free_max_blocks | Notes |
---|---|
Tags | filesystem, delete |
When to change | For workloads that delete large files, zfs_free_max_blocks can be adjusted to meet performance requirements while reducing the impacts of deletion |
Data Type | ulong |
Units | blocks |
Range | 1 to ULONG_MAX |
Default | 100,000 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
Maximum asynchronous read I/Os active to each device.
zfs_vdev_async_read_max_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_max_active |
Default | 3 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Minimum asynchronous read I/Os active to each device.
zfs_vdev_async_read_min_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to (zfs_vdev_async_read_max_active - 1) |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
When the amount of dirty data exceeds the threshold
zfs_vdev_async_write_active_max_dirty_percent
of zfs_dirty_data_max
dirty data, then zfs_vdev_async_write_max_active is used to
limit active async writes.
If the dirty data is between
zfs_vdev_async_write_active_min_dirty_percent
and zfs_vdev_async_write_active_max_dirty_percent
, the active I/O limit is
linearly interpolated between zfs_vdev_async_write_min_active
and zfs_vdev_async_write_max_active
zfs_vdev_async_write_active_max_dirty_percent | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | int |
Units | percent of zfs_dirty_data_max |
Range | 0 to 100 |
Default | 60 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
If the amount of dirty data is between
zfs_vdev_async_write_active_min_dirty_percent
and zfs_vdev_async_write_active_max_dirty_percent
of zfs_dirty_data_max,
the active I/O limit is linearly interpolated between
zfs_vdev_async_write_min_active and
zfs_vdev_async_write_max_active
zfs_vdev_async_write_active_min_dirty_percent | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | int |
Units | percent of zfs_dirty_data_max |
Range | 0 to (zfs_vdev_async_write_active_max_dirty_percent - 1) |
Default | 30 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_vdev_async_write_max_active
sets the maximum asynchronous
write I/Os active to each device.
zfs_vdev_async_write_max_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_max_active |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_vdev_async_write_min_active
sets the minimum asynchronous write I/Os active to each device.
Lower values are associated with better latency on rotational media but poorer resilver performance. The default value of 2 was chosen as a compromise. A value of 3 has been shown to improve resilver performance further at a cost of further increasing latency.
zfs_vdev_async_write_min_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_async_write_max_active |
Default | 1 for v0.6.x, 2 for v0.7.0 and later |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
The maximum number of I/Os active to each device. Ideally,
zfs_vdev_max_active
>= the sum of each queue's max_active.
Once queued to the device, the ZFS I/O scheduler is no longer able to prioritize I/O operations. The underlying device drivers have their own scheduler and queue depth limits. Values larger than the device's maximum queue depth can have the affect of increased latency as the I/Os are queued in the intervening device driver layers.
zfs_vdev_max_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | sum of each queue's min_active to UINT32_MAX |
Default | 1,000 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_vdev_scrub_max_active
sets the maximum scrub or scan
read I/Os active to each device.
zfs_vdev_scrub_max_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler, scrub, resilver |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_max_active |
Default | 2 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_vdev_max_active
sets the minimum scrub or scan read I/Os active
to each device.
zfs_vdev_scrub_min_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler, scrub, resilver |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_scrub_max_active |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Maximum synchronous read I/Os active to each device.
zfs_vdev_sync_read_max_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_max_active |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_vdev_sync_read_min_active
sets the minimum synchronous read I/Os
active to each device.
zfs_vdev_sync_read_min_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_sync_read_max_active |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_vdev_sync_write_max_active
sets the maximum synchronous write I/Os active
to each device.
zfs_vdev_sync_write_max_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_max_active |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_vdev_sync_write_min_active
sets the minimum synchronous write I/Os
active to each device.
zfs_vdev_sync_write_min_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_sync_write_max_active |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Maximum number of queued allocations per top-level vdev expressed as
a percentage of zfs_vdev_async_write_max_active.
This allows the system to detect devices that are more capable of handling allocations
and to allocate more blocks to those devices. It also allows for dynamic
allocation distribution when devices are imbalanced as fuller devices
will tend to be slower than empty devices. Once the queue depth
reaches (zfs_vdev_queue_depth_pct
* zfs_vdev_async_write_max_active / 100)
then allocator will stop allocating blocks on that top-level device and
switch to the next.
See also zio_dva_throttle_enabled
zfs_vdev_queue_depth_pct | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to UINT32_MAX |
Default | 1,000 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
Disable duplicate buffer eviction from ARC.
zfs_disable_dup_eviction | Notes |
---|---|
Tags | ARC, dedup |
When to change | TBD |
Data Type | boolean |
Range | 0=duplicate buffers can be evicted, 1=do not evict duplicate buffers |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.5, deprecated in v0.7.0 |
Snapshots of filesystems are normally automounted under the filesystem's
.zfs/snapshot
subdirectory. When not in use, snapshots are unmounted
after zfs_expire_snapshot seconds.
zfs_expire_snapshot | Notes |
---|---|
Tags | filesystem, snapshot |
When to change | TBD |
Data Type | int |
Units | seconds |
Range | 0 disables automatic unmounting, maximum time is INT_MAX |
Default | 300 |
Change | Dynamic |
Versions Affected | v0.6.1 and later |
Allow the creation, removal, or renaming of entries in the .zfs/snapshot
subdirectory to cause the creation, destruction, or renaming of snapshots.
When enabled this functionality works both locally and over NFS exports
which have the "no_root_squash" option set.
zfs_admin_snapshot | Notes |
---|---|
Tags | filesystem, snapshot |
When to change | TBD |
Data Type | boolean |
Range | 0=do not allow snapshot manipulation via the filesystem, 1=allow snapshot manipulation via the filesystem |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
Set additional debugging flags (see zfs_dbgmsg_enable)
flag value | symbolic name | description |
---|---|---|
0x1 | ZFS_DEBUG_DPRINTF | Enable dprintf entries in the debug log |
0x2 | ZFS_DEBUG_DBUF_VERIFY | Enable extra dnode verifications |
0x4 | ZFS_DEBUG_DNODE_VERIFY | Enable extra dnode verifications |
0x8 | ZFS_DEBUG_SNAPNAMES | Enable snapshot name verification |
0x10 | ZFS_DEBUG_MODIFY | Check for illegally modified ARC buffers |
0x20 | ZFS_DEBUG_SPA | Enable spa_dbgmsg entries in the debug log |
0x40 | ZFS_DEBUG_ZIO_FREE | Enable verification of block frees |
0x80 | ZFS_DEBUG_HISTOGRAM_VERIFY | Enable extra spacemap histogram verifications |
0x100 | ZFS_DEBUG_METASLAB_VERIFY | Verify space accounting on disk matches in-core range_trees |
0x200 | ZFS_DEBUG_SET_ERROR | Enable SET_ERROR and dprintf entries in the debug log |
zfs_flags | Notes |
---|---|
Tags | debug |
When to change | When debugging ZFS |
Data Type | int |
Default | 0 no debug flags set, for debug builds: all except ZFS_DEBUG_DPRINTF and ZFS_DEBUG_SPA |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
If destroy encounters an I/O error (EIO) while reading metadata (eg indirect
blocks), space referenced by the missing metadata cannot be freed.
Normally, this causes the background destroy to become "stalled", as the
destroy is unable to make forward progress. While in this stalled state,
all remaining space to free from the error-encountering filesystem is
temporarily leaked. Set zfs_free_leak_on_eio = 1
to ignore the EIO,
permanently leak the space from indirect blocks that can not be read,
and continue to free everything else that it can.
The default, stalling behavior is useful if the storage partially fails (eg some but not all I/Os fail), and then later recovers. In this case, we will be able to continue pool operations while it is partially failed, and when it recovers, we can continue to free the space, with no leaks. However, note that this case is rare.
Typically pools either:
-
fail completely (but perhaps temporarily (eg a top-level vdev going offline)
-
have localized, permanent errors (eg disk returns the wrong data due to bit flip or firmware bug)
In case (1), the zfs_free_leak_on_eio
setting does not matter because the
pool will be suspended and the sync thread will not be able to make
forward progress. In case (2), because the error is
permanent, the best effort do is leak the minimum amount of space.
Therefore, it is reasonable for zfs_free_leak_on_eio
be set, but by default
the more conservative approach is taken, so that there is no
possibility of leaking space in the "partial temporary" failure case.
zfs_free_leak_on_eio | Notes |
---|---|
Tags | debug |
When to change | When debugging I/O errors during destroy |
Data Type | boolean |
Range | 0=normal behavior, 1=ignore error and permanently leak space |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
During a zfs destroy
operation using feature@async_destroy
a
minimum of zfs_free_min_time_ms
time will be spent working on freeing blocks
per txg commit.
zfs_free_min_time_ms | Notes |
---|---|
Tags | delete |
When to change | TBD |
Data Type | int |
Units | milliseconds |
Range | 1 to (zfs_txg_timeout * 1000) |
Default | 1,000 |
Change | Dynamic |
Versions Affected | v0.6.0 and later |
If a pool does not have a log device, data blocks equal to or larger than
zfs_immediate_write_sz
are treated as if the dataset being written to had
the property setting logbias=throughput
Terminology note: logbias=throughput
writes the blocks in "indirect mode"
to the ZIL where the data is written to the pool and a pointer to the data
is written to the ZIL.
zfs_immediate_write_sz | Notes |
---|---|
Tags | ZIL |
When to change | TBD |
Data Type | long |
Units | bytes |
Range | 512 to 16,777,216 (valid block sizes) |
Default | 32,768 (32 KiB) |
Change | Dynamic |
Verification | Data blocks that exceed zfs_immediate_write_sz or are written as logbias=throughput increment the zil_itx_indirect_count entry in /proc/spl/kstat/zfs/zil
|
Versions Affected | all |
ZFS supports logical record (block) sizes from 512 bytes to 16 MiB.
The benefits of larger blocks, and thus larger average I/O sizes, can be
weighed against the cost of copy-on-write of large block to modify one byte.
Additionally, very large blocks can have a negative impact on both I/O latency
at the device level and the memory allocator. The zfs_max_recordsize
parameter limits the upper bound of the dataset volblocksize and recordsize
properties.
Larger blocks can be created by enabling zpool
large_blocks
feature and
changing this zfs_max_recordsize
. Pools with larger blocks can always be
imported and used, regardless of the value of zfs_max_recordsize
.
For 32-bit systems, zfs_max_recordsize
also limits the size of kernel virtual
memory caches used in the ZFS I/O pipeline (zio_buf_*
and zio_data_buf_*
).
See also the zpool
large_blocks
feature.
zfs_max_recordsize | Notes |
---|---|
Tags | filesystem, memory, volume |
When to change | To create datasets with larger volblocksize or recordsize |
Data Type | int |
Units | bytes |
Range | 512 to 16,777,216 (valid block sizes) |
Default | 1,048,576 |
Change | Dynamic, set prior to creating volumes or changing filesystem recordsize |
Versions Affected | v0.6.5 and later |
zfs_mdcomp_disable
allows metadata compression to be disabled.
zfs_mdcomp_disable | Notes |
---|---|
Tags | CPU, metadata |
When to change | When CPU cycles cost less than I/O |
Data Type | boolean |
Range | 0=compress metadata, 1=do not compress metadata |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.0 and later |
Allow metaslabs to keep their active state as long as their fragmentation
percentage is less than or equal to this value. When writing, an active
metaslab whose fragmentation percentage exceeds
zfs_metaslab_fragmentation_threshold
is avoided allowing metaslabs with less
fragmentation to be preferred.
Metaslab fragmentation is used to calculate the overall pool fragmentation
property value. However, individual metaslab fragmentation levels are
observable using the zdb
with the -mm
option.
zfs_metaslab_fragmentation_threshold
works at the metaslab level and each
top-level vdev has approximately metaslabs_per_vdev metaslabs.
See also zfs_mg_fragmentation_threshold
zfs_metaslab_fragmentation_threshold | Notes |
---|---|
Tags | allocation, fragmentation, vdev |
When to change | Testing metaslab allocation |
Data Type | int |
Units | percent |
Range | 1 to 100 |
Default | 70 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Metaslab groups (top-level vdevs) are considered eligible for allocations
if their fragmentation percentage metric is less than or equal to
zfs_mg_fragmentation_threshold
. If a metaslab group exceeds this threshold
then it will be skipped unless all metaslab groups within the metaslab class
have also crossed the zfs_mg_fragmentation_threshold
threshold.
zfs_mg_fragmentation_threshold | Notes |
---|---|
Tags | allocation, fragmentation, vdev |
When to change | Testing metaslab allocation |
Data Type | int |
Units | percent |
Range | 1 to 100 |
Default | 85 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Metaslab groups (top-level vdevs) with free space percentage greater than
zfs_mg_noalloc_threshold
are eligible for new allocations.
If a metaslab group's free space is less than or equal to the
threshold, the allocator avoids allocating to that group
unless all groups in the pool have reached the threshold. Once all
metaslab groups have reached the threshold, all metaslab groups are allowed
to accept allocations. The default value of 0 disables the feature and causes
all metaslab groups to be eligible for allocations.
This parameter allows one to deal with pools having heavily imbalanced
vdevs such as would be the case when a new vdev has been added.
Setting the threshold to a non-zero percentage will stop allocations
from being made to vdevs that aren't filled to the specified percentage
and allow lesser filled vdevs to acquire more allocations than they
otherwise would under the older zfs_mg_alloc_failures
facility.
zfs_mg_noalloc_threshold | Notes |
---|---|
Tags | allocation, fragmentation, vdev |
When to change | To force rebalancing as top-level vdevs are added or expanded |
Data Type | int |
Units | percent |
Range | 0 to 100 |
Default | 0 (disabled) |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
The pool multihost
multimodifier protection (MMP) subsystem can record
historical updates in the /proc/spl/kstat/zfs/POOL_NAME/multihost
file
for debugging purposes.
The number of lines of history is determined by zfs_multihost_history.
zfs_multihost_history | Notes |
---|---|
Tags | MMP, import |
When to change | When testing multihost feature |
Data Type | int |
Units | lines |
Range | 0 to INT_MAX |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
zfs_multihost_interval
controls the frequency of multihost writes performed
by the pool multihost multimodifier protection (MMP) subsystem.
The multihost write period is (zfs_multihost_interval
/ number of leaf-vdevs)
milliseconds.
Thus on average a multihost write will be issued for each leaf vdev every
zfs_multihost_interval
milliseconds. In practice, the observed period can
vary with the I/O load and this observed value is the delay which is stored in
the uberblock.
On import the multihost activity check waits a minimum amount of time
determined by (zfs_multihost_interval
* zfs_multihost_import_intervals)
with a lower bound of 1 second.
The activity check time may be further extended if the value of mmp delay
found in the best uberblock indicates actual multihost updates happened at
longer intervals than zfs_multihost_interval
Note: the multihost protection feature applies to storage devices that can be shared between multiple systems.
zfs_multihost_interval | Notes |
---|---|
Tags | MMP, import, vdev |
When to change | To optimize pool import time against possibility of simultaneous import by another system |
Data Type | ulong |
Units | milliseconds |
Range | 100 to ULONG_MAX |
Default | 1000 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
zfs_multihost_import_intervals
controls the duration of the activity test on
pool import for the multihost multimodifier protection (MMP) subsystem.
The activity test can be expected to take a minimum time of
(zfs_multihost_import_interval
s * zfs_multihost_interval * random(25%)
)
milliseconds. The random period of up to 25% improves simultaneous import
detection. For example, if two hosts are rebooted at the same time and
automatically attempt to import the pool, then is is highly probable that
one host will win.
Smaller values of zfs_multihost_import_intervals
reduces the
import time but increases the risk of failing to detect an active pool.
The total activity check time is never allowed to drop below one second.
Note: the multihost protection feature applies to storage devices that can be shared between multiple systems.
zfs_multihost_import_intervals | Notes |
---|---|
Tags | MMP, import |
When to change | TBD |
Data Type | uint |
Units | intervals |
Range | 1 to UINT_MAX |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
zfs_multihost_fail_intervals
controls the behavior of the pool when
write failures are detected in the multihost multimodifier protection (MMP)
subsystem.
If zfs_multihost_fail_intervals = 0
then multihost write failures are ignored.
The write failures are reported to the ZFS event daemon (zed
) which
can take action such as suspending the pool or offlining a device.
If zfs_multihost_fail_intervals > 0
then sequential multihost write failures
will cause the pool to be suspended. This occurs when
(zfs_multihost_fail_intervals
* zfs_multihost_interval)
milliseconds have passed since the last successful multihost write.
This guarantees the activity test will see multihost writes if the pool is
attempted to be imported by another system.
zfs_multihost_fail_intervals | Notes |
---|---|
Tags | MMP, import |
When to change | TBD |
Data Type | uint |
Units | intervals |
Range | 0 to UINT_MAX |
Default | 5 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
The ZFS Event Daemon (zed) processes events from ZFS. However, it can be
overwhelmed by high rates of error reports which can be generated by failing,
high-performance devices. zfs_delays_per_second
limits the rate of
delay events reported to zed.
zfs_delays_per_second | Notes |
---|---|
Tags | zed, delay |
When to change | If processing delay events at a higher rate is desired |
Data Type | uint |
Units | events per second |
Range | 0 to UINT_MAX |
Default | 20 |
Change | Dynamic |
Versions Affected | v0.7.7 and later |
The ZFS Event Daemon (zed) processes events from ZFS. However, it can be
overwhelmed by high rates of error reports which can be generated by failing,
high-performance devices. zfs_checksums_per_second
limits the rate of
checksum events reported to zed.
Note: do not set this value lower than the SERD limit for checksum
in zed.
By default, checksum_N
= 10 and checksum_T
= 10 minutes, resulting in a
practical lower limit of 1.
zfs_checksums_per_second | Notes |
---|---|
Tags | zed, checksum |
When to change | If processing checksum error events at a higher rate is desired |
Data Type | uint |
Units | events per second |
Range | 0 to UINT_MAX |
Default | 20 |
Change | Dynamic |
Versions Affected | v0.7.7 and later |
When zfs_no_scrub_io = 1
scrubs do not actually scrub data and
simply doing a metadata crawl of the pool instead.
zfs_no_scrub_io | Notes |
---|---|
Tags | scrub |
When to change | Testing scrub feature |
Data Type | boolean |
Range | 0=perform scrub I/O, 1=do not perform scrub I/O |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.0 and later |
When zfs_no_scrub_prefetch = 1
, prefetch is disabled for scrub I/Os.
zfs_no_scrub_prefetch | Notes |
---|---|
Tags | prefetch, scrub |
When to change | Testing scrub feature |
Data Type | boolean |
Range | 0=prefetch scrub I/Os, 1=do not prefetch scrub I/Os |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
ZFS uses barriers (volatile cache flush commands) to ensure data is committed to permanent media by devices. This ensures consistent on-media state for devices where caches are volatile (eg HDDs).
For devices with nonvolatile caches, the cache flush operation can be a no-op. However, in some RAID arrays, cache flushes can cause the entire cache to be flushed to the backing devices.
To ensure on-media consistency, keep cache flush enabled.
zfs_nocacheflush | Notes |
---|---|
Tags | disks |
When to change | If the storage device has nonvolatile cache, then disabling cache flush can save the cost of occasional cache flush comamnds. |
Data Type | boolean |
Range | 0=send cache flush commands, 1=do not send cache flush commands |
Default | 0 |
Change | Dynamic |
Versions Affected | all |
The NOP-write feature is enabled by default when a crytographically-secure
checksum algorithm is in use by the dataset. zfs_nopwrite_enabled
allows the
NOP-write feature to be completely disabled.
zfs_nopwrite_enabled | Notes |
---|---|
Tags | checksum, debug |
When to change | TBD |
Data Type | boolean |
Range | 0=disable NOP-write feature, 1=enable NOP-write feature |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.0 and later |
zfs_dmu_offset_next_sync
enables forcing txg sync to find holes.
This causes ZFS to act like older versions when SEEK_HOLE
or SEEK_DATA
flags
are used: when a dirty dnode causes txgs to be synced so the previous data
can be found.
zfs_dmu_offset_next_sync | Notes |
---|---|
Tags | DMU |
When to change | TBD |
Data Type | boolean |
Range | 0=do not force txg sync to find holes, 1=force txg sync to find holes |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
zfs_pd_bytes_max
limits the number of bytes prefetched during a pool traversal
(eg zfs send
or other data crawling operations). These prefetches are
referred to as "prescient prefetches" and are always 100% hit rate.
The traversal operations do not use the default data or metadata prefetcher.
zfs_pd_bytes_max | Notes |
---|---|
Tags | prefetch, send |
When to change | TBD |
Data Type | int32 |
Units | bytes |
Range | 0 to INT32_MAX |
Default | 52,428,800 (50 MiB) |
Change | Dynamic |
Versions Affected | TBD |
zfs_per_txg_dirty_frees_percent
as a percentage of zfs_dirty_data_max
controls the percentage of dirtied blocks from frees in one txg.
After the threshold is crossed, additional dirty blocks from frees
wait until the next txg.
Thus, when deleting large files, filling consecutive txgs with deletes/frees,
does not throttle other, perhaps more important, writes.
A side effect of this throttle can impact zfs receive
workloads that contain a
large number of frees and the ignore_hole_birth optimization is
disabled. The symptom is that the receive workload causes an increase
in the frequency of txg commits when. Since txg commits also flush data from volatile
caches in HDDs to media, HDD performance can be negatively impacted. Also, since
the frees do not consume much bandwith over the pipe, the pipe can appear to stall.
Thus the overall progress of receives is slower than expected.
A value of zero will disable this throttle.
zfs_per_txg_dirty_frees_percent | Notes |
---|---|
Tags | delete |
When to change | For zfs receive workloads, consider increasing or disabling. See section "ZFS I/O SCHEDULER" |
Data Type | ulong |
Units | percent |
Range | 0 to 100 |
Default | 30 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
zfs_prefetch_disable
controls the predictive prefetcher.
Note that it leaves "prescient" prefetch (eg prefetch for zfs send
) intact
(see zfs_pd_bytes_max)
zfs_prefetch_disable | Notes |
---|---|
Tags | prefetch |
When to change | In some case where the workload is completely random reads, overall performance can be better if prefetch is disabled |
Data Type | boolean |
Range | 0=prefetch enabled, 1=prefetch disabled |
Default | 0 |
Change | Dynamic |
Verification | prefetch efficacy is observed by arcstat , arc_summary , and the relevant entries in /proc/spl/kstat/zfs/arcstats
|
Versions Affected | all |
zfs_read_chunk_size
is the limit for ZFS filesystem reads. If an application
issues a read()
larger than zfs_read_chunk_size
, then the read()
is divided
into multiple operations no larger than zfs_read_chunk_size
zfs_read_chunk_size | Notes |
---|---|
Tags | filesystem |
When to change | TBD |
Data Type | ulong |
Units | bytes |
Range | 512 to ULONG_MAX |
Default | 1,048,576 |
Change | Dynamic |
Versions Affected | all |
Historical statistics for the last zfs_read_history
reads are available in
/proc/spl/kstat/zfs/POOL_NAME/reads
zfs_read_history | Notes |
---|---|
Tags | debug |
When to change | To observe read operation details |
Data Type | int |
Units | lines |
Range | 0 to INT_MAX |
Default | 0 |
Change | Dynamic |
Versions Affected | all |
When zfs_read_history > 0
, zfs_read_history_hits controls whether ARC hits are
displayed in the read history file, /proc/spl/kstat/zfs/POOL_NAME/reads
zfs_read_history_hits | Notes |
---|---|
Tags | debug |
When to change | To observe read operation details with ARC hits |
Data Type | boolean |
Range | 0=do not include data for ARC hits, 1=include ARC hit data |
Default | 0 |
Change | Dynamic |
Versions Affected | all |
zfs_recover
can be set to true (1) to attempt to recover from
otherwise-fatal errors, typically caused by on-disk corruption.
When set, calls to zfs_panic_recover()
will turn into warning messages
rather than calling panic()
zfs_recover | Notes |
---|---|
Tags | import |
When to change | zfs_recover should only be used as a last resort, as it typically results in leaked space, or worse |
Data Type | boolean |
Range | 0=normal operation, 1=attempt recovery zpool import |
Default | 0 |
Change | Dynamic |
Verification | check output of dmesg and other logs for details |
Versions Affected | v0.6.4 or later |
Resilvers are processed by the sync thread in syncing context. While
resilvering, ZFS spends at least zfs_resilver_min_time_ms
time working on a
resilver between txg commits.
See also zfs_txg_timeout.
zfs_resilver_min_time_ms | Notes |
---|---|
Tags | resilver |
When to change | In some resilvering cases, increasing zfs_resilver_min_time_ms can result in faster completion |
Data Type | int |
Units | milliseconds |
Range | 1 to zfs_txg_timeout converted to milliseconds |
Default | 3,000 |
Change | Dynamic |
Versions Affected | all |
Scrubs are processed by the sync thread in syncing context. While
scrubbing, ZFS spends at least zfs_scrub_min_time_ms
time working on a
resilver between txg commits.
See also zfs_txg_timeout.
zfs_scrub_min_time_ms | Notes |
---|---|
Tags | scrub |
When to change | In some scrub cases, increasing zfs_scrub_min_time_ms can result in faster completion |
Data Type | int |
Units | milliseconds |
Range | 1 to zfs_txg_timeout converted to milliseconds |
Default | 1,000 |
Change | Dynamic |
Versions Affected | all |
To preserve progress across reboots the sequential scan algorithm periodically
needs to stop metadata scanning and issue all the verifications I/Os to disk
every zfs_scan_checkpoint_intval
seconds.
zfs_scan_checkpoint_intval | Notes |
---|---|
Tags | resilver, scrub |
When to change | TBD |
Data Type | int |
Units | seconds |
Range | 1 to INT_MAX |
Default | 7,200 (2 hours) |
Change | Dynamic |
Versions Affected | v0.8.0 and later |
This tunable affects how scrub and resilver I/O segments are ordered. A higher number indicates that we care more about how filled in a segment is, while a lower number indicates we care more about the size of the extent without considering the gaps within a segment.
zfs_scan_fill_weight | Notes |
---|---|
Tags | resilver, scrub |
When to change | Testing sequential scrub and resilver |
Data Type | int |
Units | scalar |
Range | 0 to INT_MAX |
Default | 3 |
Change | Prior to zfs module load |
Versions Affected | v0.8.0 and later |
zfs_scan_issue_strategy
controls the order of data verification while scrubbing or
resilvering.
value | description |
---|---|
0 | fs will use strategy 1 during normal verification and strategy 2 while taking a checkpoint |
1 | data is verified as sequentially as possible, given the amount of memory reserved for scrubbing (see zfs_scan_mem_lim_fact). This can improve scrub performance if the pool's data is heavily fragmented. |
2 | the largest mostly-contiguous chunk of found data is verified first. By deferring scrubbing of small segments, we may later find adjacent data to coalesce and increase the segment size. |
zfs_scan_issue_strategy | Notes |
---|---|
Tags | resilver, scrub |
When to change | TBD |
Data Type | enum |
Range | 0 to 2 |
Default | 0 |
Change | Dynamic |
Versions Affected | TBD |
Setting zfs_scan_legacy = 1
enables the legacy scan and scrub behavior
instead of the newer sequential behavior.
zfs_scan_legacy | Notes |
---|---|
Tags | resilver, scrub |
When to change | TBD |
Data Type | TBD |
Units | TBD |
Range | 0=use new method: scrubs and resilvers will gather metadata in memory before issuing sequential I/O, 1=use legacy algorithm will be used where I/O is initiated as soon as it is discovered |
Default | 0 |
Change | Dynamic, however changing to 0 does not affect in-progress scrubs or resilvers |
Versions Affected | v0.8.0 and later |
zfs_scan_max_ext_gap
limits the largest gap in bytes between scrub and
resilver I/Os that will still be considered sequential for sorting purposes.
zfs_scan_max_ext_gap | Notes |
---|---|
Tags | resilver, scrub |
When to change | TBD |
Data Type | ulong |
Units | bytes |
Range | 512 to ULONG_MAX |
Default | 2,097,152 (2 MiB) |
Change | Dynamic, however changing to 0 does not affect in-progress scrubs or resilvers |
Versions Affected | v0.8.0 and later |
zfs_scan_mem_lim_fact
limits the maximum fraction of RAM used for I/O sorting
by sequential scan algorithm.
When the limit is reached scanning metadata is stopped and
data verification I/O is started.
Data verification I/O continues until the memory used by the sorting
algorith drops below below zfs_scan_mem_lim_soft_fact
zfs_scan_mem_lim_fact | Notes |
---|---|
Tags | memory, resilver, scrub |
When to change | TBD |
Data Type | int |
Units | divisor of physical RAM |
Range | TBD |
Default | 20 (physical RAM / 20 or 5%) |
Change | Dynamic |
Versions Affected | v0.8.0 and later |
zfs_scan_mem_lim_soft_fact
sets the fraction of the hard limit,
zfs_scan_mem_lim_fact, used to determined the RAM soft limit
for I/O sorting by the sequential scan algorithm.
After zfs_scan_mem_lim_fact has been reached, metadata scanning is stopped
until the RAM usage drops below zfs_scan_mem_lim_soft_fact
zfs_scan_mem_lim_soft_fact | Notes |
---|---|
Tags | resilver, scrub |
When to change | TBD |
Data Type | int |
Units | divisor of (physical RAM / zfs_scan_mem_lim_fact) |
Range | 1 to INT_MAX |
Default | 20 (for default zfs_scan_mem_lim_fact, 0.25% of physical RAM) |
Change | Dynamic |
Versions Affected | v0.8.0 and later |
zfs_scan_vdev_limit
is the maximum amount of data that can be concurrently
issued at once for scrubs and resilvers per leaf vdev.
zfs_scan_vdev_limit
attempts to strike a balance between keeping the leaf
vdev queues full of I/Os while not overflowing the queues causing high latency
resulting in long txg sync times.
While zfs_scan_vdev_limit
represents a bandwidth limit, the existing I/O
limit of zfs_vdev_scrub_max_active remains in effect, too.
zfs_scan_vdev_limit | Notes |
---|---|
Tags | resilver, scrub, vdev |
When to change | TBD |
Data Type | ulong |
Units | bytes |
Range | 512 to ULONG_MAX |
Default | 4,194,304 (4 MiB) |
Change | Dynamic |
Versions Affected | v0.8.0 and later |
zfs_send_corrupt_data
enables zfs send
to send of corrupt data by
ignoring read and checksum errors. The corrupted or unreadable blocks are
replaced with the value 0x2f5baddb10c
(ZFS bad block)
zfs_send_corrupt_data | Notes |
---|---|
Tags | send |
When to change | When data corruption exists and an attempt to recover at least some data via zfs send is needed |
Data Type | boolean |
Range | 0=do not send corrupt data, 1=replace corrupt data with cookie |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.0 and later |
The SPA sync process is performed in multiple passes. Once the pass number
reaches zfs_sync_pass_deferred_free
, frees are no long processed and must wait
for the next SPA sync.
The zfs_sync_pass_deferred_free
value is expected to be removed as a tunable
once the optimal value is determined during field testing.
The zfs_sync_pass_deferred_free
pass must be greater than 1 to ensure that
regular blocks are not deferred.
zfs_sync_pass_deferred_free | Notes |
---|---|
Tags | SPA |
When to change | Testing SPA sync process |
Data Type | int |
Units | SPA sync passes |
Range | 1 to INT_MAX |
Default | 2 |
Change | Dynamic |
Versions Affected | all |
The SPA sync process is performed in multiple passes. Once the pass number
reaches zfs_sync_pass_dont_compress
, data block compression is no longer
processed and must wait for the next SPA sync.
The zfs_sync_pass_dont_compress
value is expected to be removed as a tunable
once the optimal value is determined during field testing.
zfs_sync_pass_dont_compress | Notes |
---|---|
Tags | SPA |
When to change | Testing SPA sync process |
Data Type | int |
Units | SPA sync passes |
Range | 1 to INT_MAX |
Default | 5 |
Change | Dynamic |
Versions Affected | all |
The SPA sync process is performed in multiple passes. Once the pass number
reaches zfs_sync_pass_rewrite
, blocks can be split into gang blocks.
The zfs_sync_pass_rewrite
value is expected to be removed as a tunable
once the optimal value is determined during field testing.
zfs_sync_pass_rewrite | Notes |
---|---|
Tags | SPA |
When to change | Testing SPA sync process |
Data Type | int |
Units | SPA sync passes |
Range | 1 to INT_MAX |
Default | 2 |
Change | Dynamic |
Versions Affected | all |
zfs_sync_taskq_batch_pct
controls the number of threads used by the
DSL pool sync taskq, dp_sync_taskq
zfs_sync_taskq_batch_pct | Notes |
---|---|
Tags | SPA |
When to change | To adjust the number of dp_sync_taskq threads |
Data Type | int |
Units | percent of number of online CPUs |
Range | 1 to 100 |
Default | 75 |
Change | Prior to zfs module load |
Versions Affected | v0.7.0 and later |
Historical statistics for the last zfs_txg_history
txg commits are available
in /proc/spl/kstat/zfs/POOL_NAME/txgs
The work required to measure the txg commit (SPA statistics) is low. However, for debugging purposes, it can be useful to observe the SPA statistics.
zfs_txg_history | Notes |
---|---|
Tags | debug |
When to change | To observe details of SPA sync behavior. |
Data Type | int |
Units | lines |
Range | 0 to INT_MAX |
Default | 0 for version v0.6.0 to v0.7.6, 100 for version v0.8.0 |
Change | Dynamic |
Versions Affected | all |
The open txg is committed to the pool periodically (SPA sync) and
zfs_txg_timeout
represents the default target upper limit.
txg commits can occur more frequently and a rapid rate of txg commits often indicates a busy write workload, quota limits reached, or the free space is critically low.
txg commits can also take longer than zfs_txg_timeout
if the ZFS write throttle
is not properly tuned or the time to sync is otherwise delayed (eg slow device)
See also zfs_dirty_data_sync and zfs_txg_history
zfs_txg_timeout | Notes |
---|---|
Tags | SPA, ZIO_scheduler |
When to change | To optimize the work done by txg commit relative to the pool requirements. See also section "ZFS I/O SCHEDULER" |
Data Type | int |
Units | seconds |
Range | 1 to INT_MAX |
Default | 5 |
Change | Dynamic |
Versions Affected | all |
To reduce IOPs, small, adjacent I/Os can be aggregated (coalesced) into into a
large I/O.
For reads, aggregations occur across small adjacency gaps.
For writes, aggregation can occur at the ZFS or disk level.
zfs_vdev_aggregation_limit
is the upper bound on the size of the larger,
aggregated I/O.
zfs_vdev_aggregation_limit | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | If the workload does not benefit from aggregation, the zfs_vdev_aggregation_limit can be reduced to avoid aggregation attempts |
Data Type | int |
Units | bytes |
Range | 0 to 131,072 (default) or 16,777,216 (if zpool large_blocks feature is enabled) |
Default | 131,072 (128 KiB) |
Change | Dynamic |
Versions Affected | all |
Note: with the current ZFS code, the vdev cache is not helpful and in some
cases actually harmful. Thusit is disabled by setting the
zfs_vdev_cache_size = 0
zfs_vdev_cache_size
is the size of the vdev cache.
zfs_vdev_cache_size | Notes |
---|---|
Tags | vdev, vdev_cache |
When to change | Do not change |
Data Type | int |
Units | bytes |
Range | 0 to MAX_INT |
Default | 0 (vdev cache is disabled) |
Change | Dynamic |
Verification | vdev cache statistics are availabe in the /proc/spl/kstat/zfs/vdev_cache_stats file |
Versions Affected | all |
Note: with the current ZFS code, the vdev cache is not helpful and in some cases actually harmful. Thus it is disabled by setting the zfs_vdev_cache_size to zero. This related tunable is, by default, inoperative.
All read I/Os smaller than zfs_vdev_cache_max are turned into
(1 << zfs_vdev_cache_bshift
) byte reads by the vdev cache. At most
zfs_vdev_cache_size bytes will be kept in each vdev's cache.
zfs_vdev_cache_bshift | Notes |
---|---|
Tags | vdev, vdev_cache |
When to change | Do not change |
Data Type | int |
Units | shift |
Range | 1 to INT_MAX |
Default | 16 (65,536 bytes) |
Change | Dynamic |
Versions Affected | all |
Note: with the current ZFS code, the vdev cache is not helpful and in some cases actually harmful. Thus it is disabled by setting the zfs_vdev_cache_size to zero. This related tunable is, by default, inoperative.
All read I/Os smaller than zfs_vdev_cache_max will be turned into
(1 <<
zfs_vdev_cache_bshift byte reads by the vdev cache.
At most zfs_vdev_cache_size
bytes will be kept in each vdev's cache.
zfs_vdev_cache_max | Notes |
---|---|
Tags | vdev, vdev_cache |
When to change | Do not change |
Data Type | int |
Units | bytes |
Range | 512 to INT_MAX |
Default | 16,384 (16 KiB) |
Change | Dynamic |
Versions Affected | all |
The mirror read algorithm uses current load and an incremental weighting value
to determine the vdev to service a read operation. Lower values determine
the preferred vdev.
The weighting value is zfs_vdev_mirror_rotating_inc
for rotating media and
zfs_vdev_mirror_non_rotating_inc for nonrotating media.
Verify the rotational setting described by a block device in sysfs by
observing /sys/block/DISK_NAME/queue/rotational
zfs_vdev_mirror_rotating_inc | Notes |
---|---|
Tags | vdev, mirror, HDD |
When to change | Increasing for mirrors with both rotating and nonrotating media more strongly favors the nonrotating media |
Data Type | int |
Units | scalar |
Range | 0 to MAX_INT |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
The mirror read algorithm uses current load and an incremental weighting value
to determine the vdev to service a read operation. Lower values determine
the preferred vdev.
The weighting value is zfs_vdev_mirror_rotating_inc for rotating media and
zfs_vdev_mirror_non_rotating_inc
for nonrotating media.
Verify the rotational setting described by a block device in sysfs by
observing /sys/block/DISK_NAME/queue/rotational
zfs_vdev_mirror_non_rotating_inc | Notes |
---|---|
Tags | vdev, mirror, SSD |
When to change | TBD |
Data Type | int |
Units | scalar |
Range | 0 to INT_MAX |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
For rotating media in a mirror, if the next I/O offset is within
zfs_vdev_mirror_rotating_seek_offset
then the weighting factor is incremented by (zfs_vdev_mirror_rotating_seek_inc / 2
).
Otherwise the weighting factor is increased by zfs_vdev_mirror_rotating_seek_inc
.
This algorithm prefers rotating media with lower seek distance.
Verify the rotational setting described by a block device in sysfs by
observing /sys/block/DISK_NAME/queue/rotational
zfs_vdev_mirror_rotating_seek_inc | Notes |
---|---|
Tags | vdev, mirror, HDD |
When to change | TBD |
Data Type | int |
Units | scalar |
Range | 0 to INT_MAX |
Default | 5 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
For rotating media in a mirror, if the next I/O offset is within
zfs_vdev_mirror_rotating_seek_offset
then the weighting factor is
incremented by (zfs_vdev_mirror_rotating_seek_inc / 2
).
Otherwise the weighting factor is increased by zfs_vdev_mirror_rotating_seek_inc
.
This algorithm prefers rotating media with lower seek distance.
Verify the rotational setting described by a block device in sysfs by
observing /sys/block/DISK_NAME/queue/rotational
zfs_vdev_mirror_rotating_seek_offset | Notes |
---|---|
Tags | vdev, mirror, HDD |
When to change | TBD |
Data Type | int |
Units | bytes |
Range | 0 to INT_MAX |
Default | 1,048,576 (1 MiB) |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
For nonrotating media in a mirror, a seek penalty is applied as sequential I/O's can be aggregated into fewer operations, avoiding unnecessary per-command overhead, often boosting performance.
Verify the rotational setting described by a block device in SysFS by
observing /sys/block/DISK_NAME/queue/rotational
zfs_vdev_mirror_non_rotating_seek_inc | Notes |
---|---|
Tags | vdev, mirror, SSD |
When to change | TBD |
Data Type | int |
Units | scalar |
Range | 0 to INT_MAX |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
To reduce IOPs, small, adjacent I/Os are aggregated (coalesced) into into a
large I/O.
For reads, aggregations occur across small adjacency gaps where
the gap is less than zfs_vdev_read_gap_limit
zfs_vdev_read_gap_limit | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | TBD |
Data Type | int |
Units | bytes |
Range | 0 to INT_MAX |
Default | 32,768 (32 KiB) |
Change | Dynamic |
Versions Affected | all |
To reduce IOPs, small, adjacent I/Os are aggregated (coalesced) into into a
large I/O.
For writes, aggregations occur across small adjacency gaps where
the gap is less than zfs_vdev_write_gap_limit
zfs_vdev_write_gap_limit | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | TBD |
Data Type | int |
Units | bytes |
Range | 0 to INT_MAX |
Default | 4,096 (4 KiB) |
Change | Dynamic |
Versions Affected | all |
When the pool is imported, for whole disk vdevs, the block device I/O
scheduler is set to zfs_vdev_scheduler
.
The most common schedulers are: noop, cfq, bfq, anis
option to a non-zero value will override the default.
A value of 0 represents the default setting of larger of 1/64 of physical memory or 512 KiB. However, once changed, dynamically setting zfs_arc_sys_free to 0 will not return to the default.
zfs_arc_sys_free | Notes |
---|---|
Tags | ARC, memory |
When to change | Change if more free memory is desired as a margin against memory demand by applications |
Data Type | ulong |
Units | bytes |
Range | 0 to ULONG_MAX |
Default | 0 (default to larger of 1/64 of physical memory or 512 KiB) |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
Disable reading zpool.cache file (see spa_config_path) when loading the zfs module.
zfs_autoimport_disable | Notes |
---|---|
Tags | import |
When to change | Leave as default so that zfs behaves as other Linux kernel modules |
Data Type | boolean |
Range | 0=read zpool.cache at module load, 1=do not read zpool.cache at module load |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_commit_timeout_pct
controls the amount of time that a log (ZIL) write
block (lwb) remains "open" when it isn't "full" and it has a thread waiting
to commit to stable storage.
The timeout is scaled based on a percentage of the last lwb
latency to avoid significantly impacting the latency of each individual
intent log transaction (itx).
zfs_commit_timeout_pct | Notes |
---|---|
Tags | ZIL |
When to change | TBD |
Data Type | int |
Units | percent |
Range | 1 to 100 |
Default | 5 |
Change | Dynamic |
Versions Affected | v0.8.0 |
Internally ZFS keeps a small log to facilitate debugging.
The contents of the log are in the /proc/spl/kstat/zfs/dbgmsg
file.
Writing 0 to /proc/spl/kstat/zfs/dbgmsg
file clears the log.
See also zfs_dbgmsg_maxsize
zfs_dbgmsg_enable | Notes |
---|---|
Tags | debug |
When to change | To view ZFS internal debug log |
Data Type | boolean |
Range | 0=do not log debug messages, 1=log debug messages |
Default | 0 (1 for debug builds) |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
The /proc/spl/kstat/zfs/dbgmsg
file size limit is set by
zfs_dbgmsg_maxsize.
See also zfs_dbgmsg_enable
zfs_dbgmsg_maxsize | Notes |
---|---|
Tags | debug |
When to change | TBD |
Data Type | int |
Units | bytes |
Range | 0 to INT_MAX |
Default | 4 MiB |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
The zfs_dbuf_state_index
feature is currently unused. It is normally used
for controlling values in the /proc/spl/kstat/zfs/dbufs
file.
zfs_dbuf_state_index | Notes |
---|---|
Tags | debug |
When to change | Do not change |
Data Type | int |
Units | TBD |
Range | TBD |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
When a pool sync operation takes longer than zfs_deadman_synctime_ms
milliseconds, a "slow spa_sync" message is logged to the debug log
(see zfs_dbgmsg_enable). If zfs_deadman_enabled
is
set to 1, then all pending IO operations are also checked and if any haven't
completed within zfs_deadman_synctime_ms milliseconds, a "SLOW IO" message
is logged to the debug log and a "deadman" system event (see zpool events
command) with the details of the hung IO is posted.
zfs_deadman_enabled | Notes |
---|---|
Tags | debug |
When to change | To disable logging of slow I/O |
Data Type | boolean |
Range | 0=do not log slow I/O, 1=log slow I/O |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.8.0 |
Once a pool sync operation has taken longer than zfs_deadman_synctime_ms milliseconds, continue to check for slow operations every zfs_deadman_checktime_ms milliseconds.
zfs_deadman_checktime_ms | Notes |
---|---|
Tags | debug |
When to change | When debugging slow I/O |
Data Type | ulong |
Units | milliseconds |
Range | 1 to ULONG_MAX |
Default | 60,000 (1 minute) |
Change | Dynamic |
Versions Affected | v0.8.0 |
When an individual I/O takes longer than zfs_deadman_ziotime_ms
milliseconds,
then the operation is considered to be "hung". If zfs_deadman_enabled
is set then the deadman behaviour is invoked as described by the
zfs_deadman_failmode option.
zfs_deadman_ziotime_ms | Notes |
---|---|
Tags | debug |
When to change | Testing ABD features |
Data Type | ulong |
Units | milliseconds |
Range | 1 to ULONG_MAX |
Default | 300,000 (5 minutes) |
Change | Dynamic |
Versions Affected | v0.8.0 |
The I/O deadman timer expiration time has two meanings
- determines when the
spa_deadman()
logic should fire, indicating the txg sync has not completed in a timely manner - determines if an I/O is considered "hung"
In version v0.8.0, any I/O that has not completed in zfs_deadman_synctime_ms
is considered "hung" resulting in one of three behaviors controlled by the
zfs_deadman_failmode parameter.
zfs_deadman_synctime_ms
takes effect if zfs_deadman_enabled = 1.
zfs_deadman_synctime_ms | Notes |
---|---|
Tags | debug |
When to change | When debugging slow I/O |
Data Type | ulong |
Units | milliseconds |
Range | 1 to |
Default | 600,000 (10 minutes) |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
zfs_deadman_failmode controls the behavior of the I/O deadman timer when it detects a "hung" I/O. Valid values are:
- wait - Wait for the "hung" I/O (default)
- continue - Attempt to recover from a "hung" I/O
- panic - Panic the system
zfs_deadman_failmode | Notes |
---|---|
Tags | debug |
When to change | In some cluster cases, panic can be appropriate |
Data Type | string |
Range | wait, continue, or panic |
Default | wait |
Change | Dynamic |
Versions Affected | v0.8.0 |
ZFS can prefetch deduplication table (DDT) entries. zfs_dedup_prefetch
allows
DDT prefetches to be enabled.
zfs_dedup_prefetch | Notes |
---|---|
Tags | prefetch, memory |
When to change | For systems with limited RAM using the dedup feature, disabling deduplication table prefetch can reduce memory pressure |
Data Type | boolean |
Range | 0=do not prefetch, 1=prefetch dedup table entries |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
zfs_delete_blocks
defines a large file for the purposes of delete.
Files containing more than zfs_delete_blocks
will be deleted asynchronously
while smaller files are deleted synchronously.
Decreasing this value reduces the time spent in an unlink(2)
system call at
the expense of a longer delay before the freed space is available.
The zfs_delete_blocks
value is specified in blocks, not bytes. The size of
blocks can vary and is ultimately limited by the filesystem's recordsize
property.
zfs_delete_blocks | Notes |
---|---|
Tags | filesystem, delete |
When to change | If applications delete large files and blocking on unlink(2) is not desired |
Data Type | ulong |
Units | blocks |
Range | 1 to ULONG_MAX |
Default | 20,480 |
Change | Dynamic |
Versions Affected | all |
The ZFS write throttle begins to delay each transaction when the amount of
dirty data reaches the threshold zfs_delay_min_dirty_percent
of
zfs_dirty_data_max.
This value should be >= zfs_vdev_async_write_active_max_dirty_percent.
zfs_delay_min_dirty_percent | Notes |
---|---|
Tags | write_throttle |
When to change | See section "ZFS TRANSACTION DELAY" |
Data Type | int |
Units | percent |
Range | 0 to 100 |
Default | 60 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_delay_scale
controls how quickly the ZFS write throttle transaction
delay approaches infinity.
Larger values cause longer delays for a given amount of dirty data.
For the smoothest delay, this value should be about 1 billion divided
by the maximum number of write operations per second the pool can sustain.
The throttle will smoothly handle between 10x and 1/10th zfs_delay_scale
.
Note: zfs_delay_scale
* zfs_dirty_data_max must be < 2^64.
zfs_delay_scale | Notes |
---|---|
Tags | write_throttle |
When to change | See section "ZFS TRANSACTION DELAY" |
Data Type | ulong |
Units | scalar (nanoseconds) |
Range | 0 to ULONG_MAX |
Default | 500,000 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_dirty_data_max
is the ZFS write throttle dirty space limit.
Once this limit is exceeded, new writes are delayed until space is freed by
writes being committed to the pool.
zfs_dirty_data_max takes precedence over zfs_dirty_data_max_percent.
zfs_dirty_data_max | Notes |
---|---|
Tags | write_throttle |
When to change | See section "ZFS TRANSACTION DELAY" |
Data Type | ulong |
Units | bytes |
Range | 1 to zfs_dirty_data_max_max |
Default | 10% of physical RAM |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_dirty_data_max_percent
is an alternative method of specifying
zfs_dirty_data_max, the ZFS write throttle dirty space limit.
Once this limit is exceeded, new writes are delayed until space is freed by
writes being committed to the pool.
zfs_dirty_data_max takes precedence over zfs_dirty_data_max_percent
.
zfs_dirty_data_max_percent | Notes |
---|---|
Tags | write_throttle |
When to change | See section "ZFS TRANSACTION DELAY" |
Data Type | int |
Units | percent |
Range | 1 to 100 |
Default | 10% of physical RAM |
Change | Prior to zfs module load or a memory hot plug event |
Versions Affected | v0.6.4 and later |
zfs_dirty_data_max_max
is the maximum allowable value of
zfs_dirty_data_max.
zfs_dirty_data_max_max
takes precedence over zfs_dirty_data_max_max_percent.
zfs_dirty_data_max_max | Notes |
---|---|
Tags | write_throttle |
When to change | See section "ZFS TRANSACTION DELAY" |
Data Type | ulong |
Units | bytes |
Range | 1 to physical RAM size |
Default | 25% of physical RAM |
Change | Prior to zfs module load |
Versions Affected | v0.6.4 and later |
zfs_dirty_data_max_max_percent
an alternative to zfs_dirty_data_max_max
for setting the maximum allowable value of zfs_dirty_data_max
zfs_dirty_data_max_max takes precedence over zfs_dirty_data_max_max_percent
zfs_dirty_data_max_max_percent | Notes |
---|---|
Tags | write_throttle |
When to change | See section "ZFS TRANSACTION DELAY" |
Data Type | int |
Units | percent |
Range | 1 to 100 |
Default | 25% of physical RAM |
Change | Prior to zfs module load |
Versions Affected | v0.6.4 and later |
When there is at least zfs_dirty_data_sync
dirty data, a transaction group
sync is started. This allows a transaction group sync to occur more frequently
than the transaction group timeout interval (see zfs_txg_timeout)
when there is dirty data to be written.
zfs_dirty_data_sync | Notes |
---|---|
Tags | write_throttle |
When to change | TBD |
Data Type | ulong |
Units | bytes |
Range | 1 to ULONG_MAX |
Default | 67,108,864 (64 MiB) |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Fletcher-4 is the default checksum algorithm for metadata and data.
When the zfs kernel module is loaded, a set of microbenchmarks are run to
determine the fastest algorithm for the current hardware. The
zfs_fletcher_4_impl
parameter allows a specific implementation to be
specified other than the default (fastest).
Selectors other than fastest and scalar require instruction
set extensions to be available and will only appear if ZFS detects their
presence. The scalar implementation works on all processors.
The results of the microbenchmark are visible in the
/proc/spl/kstat/zfs/fletcher_4_bench
file.
Larger numbers indicate better performance.
Since ZFS is processor endian-independent, the microbenchmark is run
against both big and little-endian transformation.
zfs_fletcher_4_impl | Notes |
---|---|
Tags | CPU, checksum |
When to change | Testing Fletcher-4 algorithms |
Data Type | string |
Range | fastest, scalar, superscalar, superscalar4, sse2, ssse3, avx2, avx512f, or aarch64_neon depending on hardware support |
Default | fastest |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
The processing of the free_bpobj object can be enabled by
zfs_free_bpobj_enabled
zfs_free_bpobj_enabled | Notes |
---|---|
Tags | delete |
When to change | If there's a problem with processing free_bpobj (e.g. i/o error or bug) |
Data Type | boolean |
Range | 0=do not process free_bpobj objects, 1=process free_bpobj objects |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
zfs_free_max_blocks
sets the maximum number of blocks to be freed in a single
transaction group (txg). For workloads that delete (free) large numbers of
blocks in a short period of time, the processing of the frees can negatively
impact other operations, including txg commits. zfs_free_max_blocks
acts as a
limit to reduce the impact.
zfs_free_max_blocks | Notes |
---|---|
Tags | filesystem, delete |
When to change | For workloads that delete large files, zfs_free_max_blocks can be adjusted to meet performance requirements while reducing the impacts of deletion |
Data Type | ulong |
Units | blocks |
Range | 1 to ULONG_MAX |
Default | 100,000 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
Maximum asynchronous read I/Os active to each device.
zfs_vdev_async_read_max_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_max_active |
Default | 3 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Minimum asynchronous read I/Os active to each device.
zfs_vdev_async_read_min_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to (zfs_vdev_async_read_max_active - 1) |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
When the amount of dirty data exceeds the threshold
zfs_vdev_async_write_active_max_dirty_percent
of zfs_dirty_data_max
dirty data, then zfs_vdev_async_write_max_active is used to
limit active async writes.
If the dirty data is between
zfs_vdev_async_write_active_min_dirty_percent
and zfs_vdev_async_write_active_max_dirty_percent
, the active I/O limit is
linearly interpolated between zfs_vdev_async_write_min_active
and zfs_vdev_async_write_max_active
zfs_vdev_async_write_active_max_dirty_percent | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | int |
Units | percent of zfs_dirty_data_max |
Range | 0 to 100 |
Default | 60 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
If the amount of dirty data is between
zfs_vdev_async_write_active_min_dirty_percent
and zfs_vdev_async_write_active_max_dirty_percent
of zfs_dirty_data_max,
the active I/O limit is linearly interpolated between
zfs_vdev_async_write_min_active and
zfs_vdev_async_write_max_active
zfs_vdev_async_write_active_min_dirty_percent | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | int |
Units | percent of zfs_dirty_data_max |
Range | 0 to (zfs_vdev_async_write_active_max_dirty_percent - 1) |
Default | 30 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_vdev_async_write_max_active
sets the maximum asynchronous
write I/Os active to each device.
zfs_vdev_async_write_max_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_max_active |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_vdev_async_write_min_active
sets the minimum asynchronous write I/Os active to each device.
Lower values are associated with better latency on rotational media but poorer resilver performance. The default value of 2 was chosen as a compromise. A value of 3 has been shown to improve resilver performance further at a cost of further increasing latency.
zfs_vdev_async_write_min_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_async_write_max_active |
Default | 1 for v0.6.x, 2 for v0.7.0 and later |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
The maximum number of I/Os active to each device. Ideally,
zfs_vdev_max_active
>= the sum of each queue's max_active.
Once queued to the device, the ZFS I/O scheduler is no longer able to prioritize I/O operations. The underlying device drivers have their own scheduler and queue depth limits. Values larger than the device's maximum queue depth can have the affect of increased latency as the I/Os are queued in the intervening device driver layers.
zfs_vdev_max_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | sum of each queue's min_active to UINT32_MAX |
Default | 1,000 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_vdev_scrub_max_active
sets the maximum scrub or scan
read I/Os active to each device.
zfs_vdev_scrub_max_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler, scrub, resilver |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_max_active |
Default | 2 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_vdev_max_active
sets the minimum scrub or scan read I/Os active
to each device.
zfs_vdev_scrub_min_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler, scrub, resilver |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_scrub_max_active |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Maximum synchronous read I/Os active to each device.
zfs_vdev_sync_read_max_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_max_active |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_vdev_sync_read_min_active
sets the minimum synchronous read I/Os
active to each device.
zfs_vdev_sync_read_min_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_sync_read_max_active |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_vdev_sync_write_max_active
sets the maximum synchronous write I/Os active
to each device.
zfs_vdev_sync_write_max_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_max_active |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_vdev_sync_write_min_active
sets the minimum synchronous write I/Os
active to each device.
zfs_vdev_sync_write_min_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_sync_write_max_active |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Maximum number of queued allocations per top-level vdev expressed as
a percentage of zfs_vdev_async_write_max_active.
This allows the system to detect devices that are more capable of handling allocations
and to allocate more blocks to those devices. It also allows for dynamic
allocation distribution when devices are imbalanced as fuller devices
will tend to be slower than empty devices. Once the queue depth
reaches (zfs_vdev_queue_depth_pct
* zfs_vdev_async_write_max_active / 100)
then allocator will stop allocating blocks on that top-level device and
switch to the next.
See also zio_dva_throttle_enabled
zfs_vdev_queue_depth_pct | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to UINT32_MAX |
Default | 1,000 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
Disable duplicate buffer eviction from ARC.
zfs_disable_dup_eviction | Notes |
---|---|
Tags | ARC, dedup |
When to change | TBD |
Data Type | boolean |
Range | 0=duplicate buffers can be evicted, 1=do not evict duplicate buffers |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.5, deprecated in v0.7.0 |
Snapshots of filesystems are normally automounted under the filesystem's
.zfs/snapshot
subdirectory. When not in use, snapshots are unmounted
after zfs_expire_snapshot seconds.
zfs_expire_snapshot | Notes |
---|---|
Tags | filesystem, snapshot |
When to change | TBD |
Data Type | int |
Units | seconds |
Range | 0 disables automatic unmounting, maximum time is INT_MAX |
Default | 300 |
Change | Dynamic |
Versions Affected | v0.6.1 and later |
Allow the creation, removal, or renaming of entries in the .zfs/snapshot
subdirectory to cause the creation, destruction, or renaming of snapshots.
When enabled this functionality works both locally and over NFS exports
which have the "no_root_squash" option set.
zfs_admin_snapshot | Notes |
---|---|
Tags | filesystem, snapshot |
When to change | TBD |
Data Type | boolean |
Range | 0=do not allow snapshot manipulation via the filesystem, 1=allow snapshot manipulation via the filesystem |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
Set additional debugging flags (see zfs_dbgmsg_enable)
flag value | symbolic name | description |
---|---|---|
0x1 | ZFS_DEBUG_DPRINTF | Enable dprintf entries in the debug log |
0x2 | ZFS_DEBUG_DBUF_VERIFY | Enable extra dnode verifications |
0x4 | ZFS_DEBUG_DNODE_VERIFY | Enable extra dnode verifications |
0x8 | ZFS_DEBUG_SNAPNAMES | Enable snapshot name verification |
0x10 | ZFS_DEBUG_MODIFY | Check for illegally modified ARC buffers |
0x20 | ZFS_DEBUG_SPA | Enable spa_dbgmsg entries in the debug log |
0x40 | ZFS_DEBUG_ZIO_FREE | Enable verification of block frees |
0x80 | ZFS_DEBUG_HISTOGRAM_VERIFY | Enable extra spacemap histogram verifications |
0x100 | ZFS_DEBUG_METASLAB_VERIFY | Verify space accounting on disk matches in-core range_trees |
0x200 | ZFS_DEBUG_SET_ERROR | Enable SET_ERROR and dprintf entries in the debug log |
zfs_flags | Notes |
---|---|
Tags | debug |
When to change | When debugging ZFS |
Data Type | int |
Default | 0 no debug flags set, for debug builds: all except ZFS_DEBUG_DPRINTF and ZFS_DEBUG_SPA |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
If destroy encounters an I/O error (EIO) while reading metadata (eg indirect
blocks), space referenced by the missing metadata cannot be freed.
Normally, this causes the background destroy to become "stalled", as the
destroy is unable to make forward progress. While in this stalled state,
all remaining space to free from the error-encountering filesystem is
temporarily leaked. Set zfs_free_leak_on_eio = 1
to ignore the EIO,
permanently leak the space from indirect blocks that can not be read,
and continue to free everything else that it can.
The default, stalling behavior is useful if the storage partially fails (eg some but not all I/Os fail), and then later recovers. In this case, we will be able to continue pool operations while it is partially failed, and when it recovers, we can continue to free the space, with no leaks. However, note that this case is rare.
Typically pools either:
-
fail completely (but perhaps temporarily (eg a top-level vdev going offline)
-
have localized, permanent errors (eg disk returns the wrong data due to bit flip or firmware bug)
In case (1), the zfs_free_leak_on_eio
setting does not matter because the
pool will be suspended and the sync thread will not be able to make
forward progress. In case (2), because the error is
permanent, the best effort do is leak the minimum amount of space.
Therefore, it is reasonable for zfs_free_leak_on_eio
be set, but by default
the more conservative approach is taken, so that there is no
possibility of leaking space in the "partial temporary" failure case.
zfs_free_leak_on_eio | Notes |
---|---|
Tags | debug |
When to change | When debugging I/O errors during destroy |
Data Type | boolean |
Range | 0=normal behavior, 1=ignore error and permanently leak space |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
During a zfs destroy
operation using feature@async_destroy
a
minimum of zfs_free_min_time_ms
time will be spent working on freeing blocks
per txg commit.
zfs_free_min_time_ms | Notes |
---|---|
Tags | delete |
When to change | TBD |
Data Type | int |
Units | milliseconds |
Range | 1 to (zfs_txg_timeout * 1000) |
Default | 1,000 |
Change | Dynamic |
Versions Affected | v0.6.0 and later |
If a pool does not have a log device, data blocks equal to or larger than
zfs_immediate_write_sz
are treated as if the dataset being written to had
the property setting logbias=throughput
Terminology note: logbias=throughput
writes the blocks in "indirect mode"
to the ZIL where the data is written to the pool and a pointer to the data
is written to the ZIL.
zfs_immediate_write_sz | Notes |
---|---|
Tags | ZIL |
When to change | TBD |
Data Type | long |
Units | bytes |
Range | 512 to 16,777,216 (valid block sizes) |
Default | 32,768 (32 KiB) |
Change | Dynamic |
Verification | Data blocks that exceed zfs_immediate_write_sz or are written as logbias=throughput increment the zil_itx_indirect_count entry in /proc/spl/kstat/zfs/zil
|
Versions Affected | all |
ZFS supports logical record (block) sizes from 512 bytes to 16 MiB.
The benefits of larger blocks, and thus larger average I/O sizes, can be
weighed against the cost of copy-on-write of large block to modify one byte.
Additionally, very large blocks can have a negative impact on both I/O latency
at the device level and the memory allocator. The zfs_max_recordsize
parameter limits the upper bound of the dataset volblocksize and recordsize
properties.
Larger blocks can be created by enabling zpool
large_blocks
feature and
changing this zfs_max_recordsize
. Pools with larger blocks can always be
imported and used, regardless of the value of zfs_max_recordsize
.
For 32-bit systems, zfs_max_recordsize
also limits the size of kernel virtual
memory caches used in the ZFS I/O pipeline (zio_buf_*
and zio_data_buf_*
).
See also the zpool
large_blocks
feature.
zfs_max_recordsize | Notes |
---|---|
Tags | filesystem, memory, volume |
When to change | To create datasets with larger volblocksize or recordsize |
Data Type | int |
Units | bytes |
Range | 512 to 16,777,216 (valid block sizes) |
Default | 1,048,576 |
Change | Dynamic, set prior to creating volumes or changing filesystem recordsize |
Versions Affected | v0.6.5 and later |
zfs_mdcomp_disable
allows metadata compression to be disabled.
zfs_mdcomp_disable | Notes |
---|---|
Tags | CPU, metadata |
When to change | When CPU cycles cost less than I/O |
Data Type | boolean |
Range | 0=compress metadata, 1=do not compress metadata |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.0 and later |
Allow metaslabs to keep their active state as long as their fragmentation
percentage is less than or equal to this value. When writing, an active
metaslab whose fragmentation percentage exceeds
zfs_metaslab_fragmentation_threshold
is avoided allowing metaslabs with less
fragmentation to be preferred.
Metaslab fragmentation is used to calculate the overall pool fragmentation
property value. However, individual metaslab fragmentation levels are
observable using the zdb
with the -mm
option.
zfs_metaslab_fragmentation_threshold
works at the metaslab level and each
top-level vdev has approximately metaslabs_per_vdev metaslabs.
See also zfs_mg_fragmentation_threshold
zfs_metaslab_fragmentation_threshold | Notes |
---|---|
Tags | allocation, fragmentation, vdev |
When to change | Testing metaslab allocation |
Data Type | int |
Units | percent |
Range | 1 to 100 |
Default | 70 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Metaslab groups (top-level vdevs) are considered eligible for allocations
if their fragmentation percentage metric is less than or equal to
zfs_mg_fragmentation_threshold
. If a metaslab group exceeds this threshold
then it will be skipped unless all metaslab groups within the metaslab class
have also crossed the zfs_mg_fragmentation_threshold
threshold.
zfs_mg_fragmentation_threshold | Notes |
---|---|
Tags | allocation, fragmentation, vdev |
When to change | Testing metaslab allocation |
Data Type | int |
Units | percent |
Range | 1 to 100 |
Default | 85 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Metaslab groups (top-level vdevs) with free space percentage greater than
zfs_mg_noalloc_threshold
are eligible for new allocations.
If a metaslab group's free space is less than or equal to the
threshold, the allocator avoids allocating to that group
unless all groups in the pool have reached the threshold. Once all
metaslab groups have reached the threshold, all metaslab groups are allowed
to accept allocations. The default value of 0 disables the feature and causes
all metaslab groups to be eligible for allocations.
This parameter allows one to deal with pools having heavily imbalanced
vdevs such as would be the case when a new vdev has been added.
Setting the threshold to a non-zero percentage will stop allocations
from being made to vdevs that aren't filled to the specified percentage
and allow lesser filled vdevs to acquire more allocations than they
otherwise would under the older zfs_mg_alloc_failures
facility.
zfs_mg_noalloc_threshold | Notes |
---|---|
Tags | allocation, fragmentation, vdev |
When to change | To force rebalancing as top-level vdevs are added or expanded |
Data Type | int |
Units | percent |
Range | 0 to 100 |
Default | 0 (disabled) |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
The pool multihost
multimodifier protection (MMP) subsystem can record
historical updates in the /proc/spl/kstat/zfs/POOL_NAME/multihost
file
for debugging purposes.
The number of lines of history is determined by zfs_multihost_history.
zfs_multihost_history | Notes |
---|---|
Tags | MMP, import |
When to change | When testing multihost feature |
Data Type | int |
Units | lines |
Range | 0 to INT_MAX |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
zfs_multihost_interval
controls the frequency of multihost writes performed
by the pool multihost multimodifier protection (MMP) subsystem.
The multihost write period is (zfs_multihost_interval
/ number of leaf-vdevs)
milliseconds.
Thus on average a multihost write will be issued for each leaf vdev every
zfs_multihost_interval
milliseconds. In practice, the observed period can
vary with the I/O load and this observed value is the delay which is stored in
the uberblock.
On import the multihost activity check waits a minimum amount of time
determined by (zfs_multihost_interval
* zfs_multihost_import_intervals)
with a lower bound of 1 second.
The activity check time may be further extended if the value of mmp delay
found in the best uberblock indicates actual multihost updates happened at
longer intervals than zfs_multihost_interval
Note: the multihost protection feature applies to storage devices that can be shared between multiple systems.
zfs_multihost_interval | Notes |
---|---|
Tags | MMP, import, vdev |
When to change | To optimize pool import time against possibility of simultaneous import by another system |
Data Type | ulong |
Units | milliseconds |
Range | 100 to ULONG_MAX |
Default | 1000 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
zfs_multihost_import_intervals
controls the duration of the activity test on
pool import for the multihost multimodifier protection (MMP) subsystem.
The activity test can be expected to take a minimum time of
(zfs_multihost_import_interval
s * zfs_multihost_interval * random(25%)
)
milliseconds. The random period of up to 25% improves simultaneous import
detection. For example, if two hosts are rebooted at the same time and
automatically attempt to import the pool, then is is highly probable that
one host will win.
Smaller values of zfs_multihost_import_intervals
reduces the
import time but increases the risk of failing to detect an active pool.
The total activity check time is never allowed to drop below one second.
Note: the multihost protection feature applies to storage devices that can be shared between multiple systems.
zfs_multihost_import_intervals | Notes |
---|---|
Tags | MMP, import |
When to change | TBD |
Data Type | uint |
Units | intervals |
Range | 1 to UINT_MAX |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
zfs_multihost_fail_intervals
controls the behavior of the pool when
write failures are detected in the multihost multimodifier protection (MMP)
subsystem.
If zfs_multihost_fail_intervals = 0
then multihost write failures are ignored.
The write failures are reported to the ZFS event daemon (zed
) which
can take action such as suspending the pool or offlining a device.
If zfs_multihost_fail_intervals > 0
then sequential multihost write failures
will cause the pool to be suspended. This occurs when
(zfs_multihost_fail_intervals
* zfs_multihost_interval)
milliseconds have passed since the last successful multihost write.
This guarantees the activity test will see multihost writes if the pool is
attempted to be imported by another system.
zfs_multihost_fail_intervals | Notes |
---|---|
Tags | MMP, import |
When to change | TBD |
Data Type | uint |
Units | intervals |
Range | 0 to UINT_MAX |
Default | 5 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
The ZFS Event Daemon (zed) processes events from ZFS. However, it can be
overwhelmed by high rates of error reports which can be generated by failing,
high-performance devices. zfs_delays_per_second
limits the rate of
delay events reported to zed.
zfs_delays_per_second | Notes |
---|---|
Tags | zed, delay |
When to change | If processing delay events at a higher rate is desired |
Data Type | uint |
Units | events per second |
Range | 0 to UINT_MAX |
Default | 20 |
Change | Dynamic |
Versions Affected | v0.7.7 and later |
The ZFS Event Daemon (zed) processes events from ZFS. However, it can be
overwhelmed by high rates of error reports which can be generated by failing,
high-performance devices. zfs_checksums_per_second
limits the rate of
checksum events reported to zed.
Note: do not set this value lower than the SERD limit for checksum
in zed.
By default, checksum_N
= 10 and checksum_T
= 10 minutes, resulting in a
practical lower limit of 1.
zfs_checksums_per_second | Notes |
---|---|
Tags | zed, checksum |
When to change | If processing checksum error events at a higher rate is desired |
Data Type | uint |
Units | events per second |
Range | 0 to UINT_MAX |
Default | 20 |
Change | Dynamic |
Versions Affected | v0.7.7 and later |
When zfs_no_scrub_io = 1
scrubs do not actually scrub data and
simply doing a metadata crawl of the pool instead.
zfs_no_scrub_io | Notes |
---|---|
Tags | scrub |
When to change | Testing scrub feature |
Data Type | boolean |
Range | 0=perform scrub I/O, 1=do not perform scrub I/O |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.0 and later |
When zfs_no_scrub_prefetch = 1
, prefetch is disabled for scrub I/Os.
zfs_no_scrub_prefetch | Notes |
---|---|
Tags | prefetch, scrub |
When to change | Testing scrub feature |
Data Type | boolean |
Range | 0=prefetch scrub I/Os, 1=do not prefetch scrub I/Os |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
ZFS uses barriers (volatile cache flush commands) to ensure data is committed to permanent media by devices. This ensures consistent on-media state for devices where caches are volatile (eg HDDs).
For devices with nonvolatile caches, the cache flush operation can be a no-op. However, in some RAID arrays, cache flushes can cause the entire cache to be flushed to the backing devices.
To ensure on-media consistency, keep cache flush enabled.
zfs_nocacheflush | Notes |
---|---|
Tags | disks |
When to change | If the storage device has nonvolatile cache, then disabling cache flush can save the cost of occasional cache flush comamnds. |
Data Type | boolean |
Range | 0=send cache flush commands, 1=do not send cache flush commands |
Default | 0 |
Change | Dynamic |
Versions Affected | all |
The NOP-write feature is enabled by default when a crytographically-secure
checksum algorithm is in use by the dataset. zfs_nopwrite_enabled
allows the
NOP-write feature to be completely disabled.
zfs_nopwrite_enabled | Notes |
---|---|
Tags | checksum, debug |
When to change | TBD |
Data Type | boolean |
Range | 0=disable NOP-write feature, 1=enable NOP-write feature |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.0 and later |
zfs_dmu_offset_next_sync
enables forcing txg sync to find holes.
This causes ZFS to act like older versions when SEEK_HOLE
or SEEK_DATA
flags
are used: when a dirty dnode causes txgs to be synced so the previous data
can be found.
zfs_dmu_offset_next_sync | Notes |
---|---|
Tags | DMU |
When to change | TBD |
Data Type | boolean |
Range | 0=do not force txg sync to find holes, 1=force txg sync to find holes |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
zfs_pd_bytes_max
limits the number of bytes prefetched during a pool traversal
(eg zfs send
or other data crawling operations). These prefetches are
referred to as "prescient prefetches" and are always 100% hit rate.
The traversal operations do not use the default data or metadata prefetcher.
zfs_pd_bytes_max | Notes |
---|---|
Tags | prefetch, send |
When to change | TBD |
Data Type | int32 |
Units | bytes |
Range | 0 to INT32_MAX |
Default | 52,428,800 (50 MiB) |
Change | Dynamic |
Versions Affected | TBD |
zfs_per_txg_dirty_frees_percent
as a percentage of zfs_dirty_data_max
controls the percentage of dirtied blocks from frees in one txg.
After the threshold is crossed, additional dirty blocks from frees
wait until the next txg.
Thus, when deleting large files, filling consecutive txgs with deletes/frees,
does not throttle other, perhaps more important, writes.
A side effect of this throttle can impact zfs receive
workloads that contain a
large number of frees and the ignore_hole_birth optimization is
disabled. The symptom is that the receive workload causes an increase
in the frequency of txg commits when. Since txg commits also flush data from volatile
caches in HDDs to media, HDD performance can be negatively impacted. Also, since
the frees do not consume much bandwith over the pipe, the pipe can appear to stall.
Thus the overall progress of receives is slower than expected.
A value of zero will disable this throttle.
zfs_per_txg_dirty_frees_percent | Notes |
---|---|
Tags | delete |
When to change | For zfs receive workloads, consider increasing or disabling. See section "ZFS I/O SCHEDULER" |
Data Type | ulong |
Units | percent |
Range | 0 to 100 |
Default | 30 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
zfs_prefetch_disable
controls the predictive prefetcher.
Note that it leaves "prescient" prefetch (eg prefetch for zfs send
) intact
(see zfs_pd_bytes_max)
zfs_prefetch_disable | Notes |
---|---|
Tags | prefetch |
When to change | In some case where the workload is completely random reads, overall performance can be better if prefetch is disabled |
Data Type | boolean |
Range | 0=prefetch enabled, 1=prefetch disabled |
Default | 0 |
Change | Dynamic |
Verification | prefetch efficacy is observed by arcstat , arc_summary , and the relevant entries in /proc/spl/kstat/zfs/arcstats
|
Versions Affected | all |
zfs_read_chunk_size
is the limit for ZFS filesystem reads. If an application
issues a read()
larger than zfs_read_chunk_size
, then the read()
is divided
into multiple operations no larger than zfs_read_chunk_size
zfs_read_chunk_size | Notes |
---|---|
Tags | filesystem |
When to change | TBD |
Data Type | ulong |
Units | bytes |
Range | 512 to ULONG_MAX |
Default | 1,048,576 |
Change | Dynamic |
Versions Affected | all |
Historical statistics for the last zfs_read_history
reads are available in
/proc/spl/kstat/zfs/POOL_NAME/reads
zfs_read_history | Notes |
---|---|
Tags | debug |
When to change | To observe read operation details |
Data Type | int |
Units | lines |
Range | 0 to INT_MAX |
Default | 0 |
Change | Dynamic |
Versions Affected | all |
When zfs_read_history > 0
, zfs_read_history_hits controls whether ARC hits are
displayed in the read history file, /proc/spl/kstat/zfs/POOL_NAME/reads
zfs_read_history_hits | Notes |
---|---|
Tags | debug |
When to change | To observe read operation details with ARC hits |
Data Type | boolean |
Range | 0=do not include data for ARC hits, 1=include ARC hit data |
Default | 0 |
Change | Dynamic |
Versions Affected | all |
zfs_recover
can be set to true (1) to attempt to recover from
otherwise-fatal errors, typically caused by on-disk corruption.
When set, calls to zfs_panic_recover()
will turn into warning messages
rather than calling panic()
zfs_recover | Notes |
---|---|
Tags | import |
When to change | zfs_recover should only be used as a last resort, as it typically results in leaked space, or worse |
Data Type | boolean |
Range | 0=normal operation, 1=attempt recovery zpool import |
Default | 0 |
Change | Dynamic |
Verification | check output of dmesg and other logs for details |
Versions Affected | v0.6.4 or later |
Resilvers are processed by the sync thread in syncing context. While
resilvering, ZFS spends at least zfs_resilver_min_time_ms
time working on a
resilver between txg commits.
See also zfs_txg_timeout.
zfs_resilver_min_time_ms | Notes |
---|---|
Tags | resilver |
When to change | In some resilvering cases, increasing zfs_resilver_min_time_ms can result in faster completion |
Data Type | int |
Units | milliseconds |
Range | 1 to zfs_txg_timeout converted to milliseconds |
Default | 3,000 |
Change | Dynamic |
Versions Affected | all |
Scrubs are processed by the sync thread in syncing context. While
scrubbing, ZFS spends at least zfs_scrub_min_time_ms
time working on a
resilver between txg commits.
See also zfs_txg_timeout.
zfs_scrub_min_time_ms | Notes |
---|---|
Tags | scrub |
When to change | In some scrub cases, increasing zfs_scrub_min_time_ms can result in faster completion |
Data Type | int |
Units | milliseconds |
Range | 1 to zfs_txg_timeout converted to milliseconds |
Default | 1,000 |
Change | Dynamic |
Versions Affected | all |
To preserve progress across reboots the sequential scan algorithm periodically
needs to stop metadata scanning and issue all the verifications I/Os to disk
every zfs_scan_checkpoint_intval
seconds.
zfs_scan_checkpoint_intval | Notes |
---|---|
Tags | resilver, scrub |
When to change | TBD |
Data Type | int |
Units | seconds |
Range | 1 to INT_MAX |
Default | 7,200 (2 hours) |
Change | Dynamic |
Versions Affected | v0.8.0 and later |
This tunable affects how scrub and resilver I/O segments are ordered. A higher number indicates that we care more about how filled in a segment is, while a lower number indicates we care more about the size of the extent without considering the gaps within a segment.
zfs_scan_fill_weight | Notes |
---|---|
Tags | resilver, scrub |
When to change | Testing sequential scrub and resilver |
Data Type | int |
Units | scalar |
Range | 0 to INT_MAX |
Default | 3 |
Change | Prior to zfs module load |
Versions Affected | v0.8.0 and later |
zfs_scan_issue_strategy
controls the order of data verification while scrubbing or
resilvering.
value | description |
---|---|
0 | fs will use strategy 1 during normal verification and strategy 2 while taking a checkpoint |
1 | data is verified as sequentially as possible, given the amount of memory reserved for scrubbing (see zfs_scan_mem_lim_fact). This can improve scrub performance if the pool's data is heavily fragmented. |
2 | the largest mostly-contiguous chunk of found data is verified first. By deferring scrubbing of small segments, we may later find adjacent data to coalesce and increase the segment size. |
zfs_scan_issue_strategy | Notes |
---|---|
Tags | resilver, scrub |
When to change | TBD |
Data Type | enum |
Range | 0 to 2 |
Default | 0 |
Change | Dynamic |
Versions Affected | TBD |
Setting zfs_scan_legacy = 1
enables the legacy scan and scrub behavior
instead of the newer sequential behavior.
zfs_scan_legacy | Notes |
---|---|
Tags | resilver, scrub |
When to change | TBD |
Data Type | TBD |
Units | TBD |
Range | 0=use new method: scrubs and resilvers will gather metadata in memory before issuing sequential I/O, 1=use legacy algorithm will be used where I/O is initiated as soon as it is discovered |
Default | 0 |
Change | Dynamic, however changing to 0 does not affect in-progress scrubs or resilvers |
Versions Affected | v0.8.0 and later |
zfs_scan_max_ext_gap
limits the largest gap in bytes between scrub and
resilver I/Os that will still be considered sequential for sorting purposes.
zfs_scan_max_ext_gap | Notes |
---|---|
Tags | resilver, scrub |
When to change | TBD |
Data Type | ulong |
Units | bytes |
Range | 512 to ULONG_MAX |
Default | 2,097,152 (2 MiB) |
Change | Dynamic, however changing to 0 does not affect in-progress scrubs or resilvers |
Versions Affected | v0.8.0 and later |
zfs_scan_mem_lim_fact
limits the maximum fraction of RAM used for I/O sorting
by sequential scan algorithm.
When the limit is reached scanning metadata is stopped and
data verification I/O is started.
Data verification I/O continues until the memory used by the sorting
algorith drops below below zfs_scan_mem_lim_soft_fact
zfs_scan_mem_lim_fact | Notes |
---|---|
Tags | memory, resilver, scrub |
When to change | TBD |
Data Type | int |
Units | divisor of physical RAM |
Range | TBD |
Default | 20 (physical RAM / 20 or 5%) |
Change | Dynamic |
Versions Affected | v0.8.0 and later |
zfs_scan_mem_lim_soft_fact
sets the fraction of the hard limit,
zfs_scan_mem_lim_fact, used to determined the RAM soft limit
for I/O sorting by the sequential scan algorithm.
After zfs_scan_mem_lim_fact has been reached, metadata scanning is stopped
until the RAM usage drops below zfs_scan_mem_lim_soft_fact
zfs_scan_mem_lim_soft_fact | Notes |
---|---|
Tags | resilver, scrub |
When to change | TBD |
Data Type | int |
Units | divisor of (physical RAM / zfs_scan_mem_lim_fact) |
Range | 1 to INT_MAX |
Default | 20 (for default zfs_scan_mem_lim_fact, 0.25% of physical RAM) |
Change | Dynamic |
Versions Affected | v0.8.0 and later |
zfs_scan_vdev_limit
is the maximum amount of data that can be concurrently
issued at once for scrubs and resilvers per leaf vdev.
zfs_scan_vdev_limit
attempts to strike a balance between keeping the leaf
vdev queues full of I/Os while not overflowing the queues causing high latency
resulting in long txg sync times.
While zfs_scan_vdev_limit
represents a bandwidth limit, the existing I/O
limit of zfs_vdev_scrub_max_active remains in effect, too.
zfs_scan_vdev_limit | Notes |
---|---|
Tags | resilver, scrub, vdev |
When to change | TBD |
Data Type | ulong |
Units | bytes |
Range | 512 to ULONG_MAX |
Default | 4,194,304 (4 MiB) |
Change | Dynamic |
Versions Affected | v0.8.0 and later |
zfs_send_corrupt_data
enables zfs send
to send of corrupt data by
ignoring read and checksum errors. The corrupted or unreadable blocks are
replaced with the value 0x2f5baddb10c
(ZFS bad block)
zfs_send_corrupt_data | Notes |
---|---|
Tags | send |
When to change | When data corruption exists and an attempt to recover at least some data via zfs send is needed |
Data Type | boolean |
Range | 0=do not send corrupt data, 1=replace corrupt data with cookie |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.0 and later |
The SPA sync process is performed in multiple passes. Once the pass number
reaches zfs_sync_pass_deferred_free
, frees are no long processed and must wait
for the next SPA sync.
The zfs_sync_pass_deferred_free
value is expected to be removed as a tunable
once the optimal value is determined during field testing.
The zfs_sync_pass_deferred_free
pass must be greater than 1 to ensure that
regular blocks are not deferred.
zfs_sync_pass_deferred_free | Notes |
---|---|
Tags | SPA |
When to change | Testing SPA sync process |
Data Type | int |
Units | SPA sync passes |
Range | 1 to INT_MAX |
Default | 2 |
Change | Dynamic |
Versions Affected | all |
The SPA sync process is performed in multiple passes. Once the pass number
reaches zfs_sync_pass_dont_compress
, data block compression is no longer
processed and must wait for the next SPA sync.
The zfs_sync_pass_dont_compress
value is expected to be removed as a tunable
once the optimal value is determined during field testing.
zfs_sync_pass_dont_compress | Notes |
---|---|
Tags | SPA |
When to change | Testing SPA sync process |
Data Type | int |
Units | SPA sync passes |
Range | 1 to INT_MAX |
Default | 5 |
Change | Dynamic |
Versions Affected | all |
The SPA sync process is performed in multiple passes. Once the pass number
reaches zfs_sync_pass_rewrite
, blocks can be split into gang blocks.
The zfs_sync_pass_rewrite
value is expected to be removed as a tunable
once the optimal value is determined during field testing.
zfs_sync_pass_rewrite | Notes |
---|---|
Tags | SPA |
When to change | Testing SPA sync process |
Data Type | int |
Units | SPA sync passes |
Range | 1 to INT_MAX |
Default | 2 |
Change | Dynamic |
Versions Affected | all |
zfs_sync_taskq_batch_pct
controls the number of threads used by the
DSL pool sync taskq, dp_sync_taskq
zfs_sync_taskq_batch_pct | Notes |
---|---|
Tags | SPA |
When to change | To adjust the number of dp_sync_taskq threads |
Data Type | int |
Units | percent of number of online CPUs |
Range | 1 to 100 |
Default | 75 |
Change | Prior to zfs module load |
Versions Affected | v0.7.0 and later |
Historical statistics for the last zfs_txg_history
txg commits are available
in /proc/spl/kstat/zfs/POOL_NAME/txgs
The work required to measure the txg commit (SPA statistics) is low. However, for debugging purposes, it can be useful to observe the SPA statistics.
zfs_txg_history | Notes |
---|---|
Tags | debug |
When to change | To observe details of SPA sync behavior. |
Data Type | int |
Units | lines |
Range | 0 to INT_MAX |
Default | 0 for version v0.6.0 to v0.7.6, 100 for version v0.8.0 |
Change | Dynamic |
Versions Affected | all |
The open txg is committed to the pool periodically (SPA sync) and
zfs_txg_timeout
represents the default target upper limit.
txg commits can occur more frequently and a rapid rate of txg commits often indicates a busy write workload, quota limits reached, or the free space is critically low.
txg commits can also take longer than zfs_txg_timeout
if the ZFS write throttle
is not properly tuned or the time to sync is otherwise delayed (eg slow device)
See also zfs_dirty_data_sync and zfs_txg_history
zfs_txg_timeout | Notes |
---|---|
Tags | SPA, ZIO_scheduler |
When to change | To optimize the work done by txg commit relative to the pool requirements. See also section "ZFS I/O SCHEDULER" |
Data Type | int |
Units | seconds |
Range | 1 to INT_MAX |
Default | 5 |
Change | Dynamic |
Versions Affected | all |
To reduce IOPs, small, adjacent I/Os can be aggregated (coalesced) into into a
large I/O.
For reads, aggregations occur across small adjacency gaps.
For writes, aggregation can occur at the ZFS or disk level.
zfs_vdev_aggregation_limit
is the upper bound on the size of the larger,
aggregated I/O.
zfs_vdev_aggregation_limit | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | If the workload does not benefit from aggregation, the zfs_vdev_aggregation_limit can be reduced to avoid aggregation attempts |
Data Type | int |
Units | bytes |
Range | 0 to 131,072 (default) or 16,777,216 (if zpool large_blocks feature is enabled) |
Default | 131,072 (128 KiB) |
Change | Dynamic |
Versions Affected | all |
Note: with the current ZFS code, the vdev cache is not helpful and in some
cases actually harmful. Thusit is disabled by setting the
zfs_vdev_cache_size = 0
zfs_vdev_cache_size
is the size of the vdev cache.
zfs_vdev_cache_size | Notes |
---|---|
Tags | vdev, vdev_cache |
When to change | Do not change |
Data Type | int |
Units | bytes |
Range | 0 to MAX_INT |
Default | 0 (vdev cache is disabled) |
Change | Dynamic |
Verification | vdev cache statistics are availabe in the /proc/spl/kstat/zfs/vdev_cache_stats file |
Versions Affected | all |
Note: with the current ZFS code, the vdev cache is not helpful and in some cases actually harmful. Thus it is disabled by setting the zfs_vdev_cache_size to zero. This related tunable is, by default, inoperative.
All read I/Os smaller than zfs_vdev_cache_max are turned into
(1 << zfs_vdev_cache_bshift
) byte reads by the vdev cache. At most
zfs_vdev_cache_size bytes will be kept in each vdev's cache.
zfs_vdev_cache_bshift | Notes |
---|---|
Tags | vdev, vdev_cache |
When to change | Do not change |
Data Type | int |
Units | shift |
Range | 1 to INT_MAX |
Default | 16 (65,536 bytes) |
Change | Dynamic |
Versions Affected | all |
Note: with the current ZFS code, the vdev cache is not helpful and in some cases actually harmful. Thus it is disabled by setting the zfs_vdev_cache_size to zero. This related tunable is, by default, inoperative.
All read I/Os smaller than zfs_vdev_cache_max will be turned into
(1 <<
zfs_vdev_cache_bshift byte reads by the vdev cache.
At most zfs_vdev_cache_size
bytes will be kept in each vdev's cache.
zfs_vdev_cache_max | Notes |
---|---|
Tags | vdev, vdev_cache |
When to change | Do not change |
Data Type | int |
Units | bytes |
Range | 512 to INT_MAX |
Default | 16,384 (16 KiB) |
Change | Dynamic |
Versions Affected | all |
The mirror read algorithm uses current load and an incremental weighting value
to determine the vdev to service a read operation. Lower values determine
the preferred vdev.
The weighting value is zfs_vdev_mirror_rotating_inc
for rotating media and
zfs_vdev_mirror_non_rotating_inc for nonrotating media.
Verify the rotational setting described by a block device in sysfs by
observing /sys/block/DISK_NAME/queue/rotational
zfs_vdev_mirror_rotating_inc | Notes |
---|---|
Tags | vdev, mirror, HDD |
When to change | Increasing for mirrors with both rotating and nonrotating media more strongly favors the nonrotating media |
Data Type | int |
Units | scalar |
Range | 0 to MAX_INT |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
The mirror read algorithm uses current load and an incremental weighting value
to determine the vdev to service a read operation. Lower values determine
the preferred vdev.
The weighting value is zfs_vdev_mirror_rotating_inc for rotating media and
zfs_vdev_mirror_non_rotating_inc
for nonrotating media.
Verify the rotational setting described by a block device in sysfs by
observing /sys/block/DISK_NAME/queue/rotational
zfs_vdev_mirror_non_rotating_inc | Notes |
---|---|
Tags | vdev, mirror, SSD |
When to change | TBD |
Data Type | int |
Units | scalar |
Range | 0 to INT_MAX |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
For rotating media in a mirror, if the next I/O offset is within
zfs_vdev_mirror_rotating_seek_offset
then the weighting factor is incremented by (zfs_vdev_mirror_rotating_seek_inc / 2
).
Otherwise the weighting factor is increased by zfs_vdev_mirror_rotating_seek_inc
.
This algorithm prefers rotating media with lower seek distance.
Verify the rotational setting described by a block device in sysfs by
observing /sys/block/DISK_NAME/queue/rotational
zfs_vdev_mirror_rotating_seek_inc | Notes |
---|---|
Tags | vdev, mirror, HDD |
When to change | TBD |
Data Type | int |
Units | scalar |
Range | 0 to INT_MAX |
Default | 5 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
For rotating media in a mirror, if the next I/O offset is within
zfs_vdev_mirror_rotating_seek_offset
then the weighting factor is
incremented by (zfs_vdev_mirror_rotating_seek_inc / 2
).
Otherwise the weighting factor is increased by zfs_vdev_mirror_rotating_seek_inc
.
This algorithm prefers rotating media with lower seek distance.
Verify the rotational setting described by a block device in sysfs by
observing /sys/block/DISK_NAME/queue/rotational
zfs_vdev_mirror_rotating_seek_offset | Notes |
---|---|
Tags | vdev, mirror, HDD |
When to change | TBD |
Data Type | int |
Units | bytes |
Range | 0 to INT_MAX |
Default | 1,048,576 (1 MiB) |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
For nonrotating media in a mirror, a seek penalty is applied as sequential I/O's can be aggregated into fewer operations, avoiding unnecessary per-command overhead, often boosting performance.
Verify the rotational setting described by a block device in SysFS by
observing /sys/block/DISK_NAME/queue/rotational
zfs_vdev_mirror_non_rotating_seek_inc | Notes |
---|---|
Tags | vdev, mirror, SSD |
When to change | TBD |
Data Type | int |
Units | scalar |
Range | 0 to INT_MAX |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
To reduce IOPs, small, adjacent I/Os are aggregated (coalesced) into into a
large I/O.
For reads, aggregations occur across small adjacency gaps where
the gap is less than zfs_vdev_read_gap_limit
zfs_vdev_read_gap_limit | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | TBD |
Data Type | int |
Units | bytes |
Range | 0 to INT_MAX |
Default | 32,768 (32 KiB) |
Change | Dynamic |
Versions Affected | all |
To reduce IOPs, small, adjacent I/Os are aggregated (coalesced) into into a
large I/O.
For writes, aggregations occur across small adjacency gaps where
the gap is less than zfs_vdev_write_gap_limit
zfs_vdev_write_gap_limit | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | TBD |
Data Type | int |
Units | bytes |
Range | 0 to INT_MAX |
Default | 4,096 (4 KiB) |
Change | Dynamic |
Versions Affected | all |
When the pool is imported, for whole disk vdevs, the block device I/O
scheduler is set to zfs_vdev_scheduler
.
The most common schedulers are: noop, cfq, bfq, anis
option to a non-zero value will override the default.
A value of 0 represents the default setting of larger of 1/64 of physical memory or 512 KiB. However, once changed, dynamically setting zfs_arc_sys_free to 0 will not return to the default.
zfs_arc_sys_free | Notes |
---|---|
Tags | ARC, memory |
When to change | Change if more free memory is desired as a margin against memory demand by applications |
Data Type | ulong |
Units | bytes |
Range | 0 to ULONG_MAX |
Default | 0 (default to larger of 1/64 of physical memory or 512 KiB) |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
Disable reading zpool.cache file (see spa_config_path) when loading the zfs module.
zfs_autoimport_disable | Notes |
---|---|
Tags | import |
When to change | Leave as default so that zfs behaves as other Linux kernel modules |
Data Type | boolean |
Range | 0=read zpool.cache at module load, 1=do not read zpool.cache at module load |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_commit_timeout_pct
controls the amount of time that a log (ZIL) write
block (lwb) remains "open" when it isn't "full" and it has a thread waiting
to commit to stable storage.
The timeout is scaled based on a percentage of the last lwb
latency to avoid significantly impacting the latency of each individual
intent log transaction (itx).
zfs_commit_timeout_pct | Notes |
---|---|
Tags | ZIL |
When to change | TBD |
Data Type | int |
Units | percent |
Range | 1 to 100 |
Default | 5 |
Change | Dynamic |
Versions Affected | v0.8.0 |
Internally ZFS keeps a small log to facilitate debugging.
The contents of the log are in the /proc/spl/kstat/zfs/dbgmsg
file.
Writing 0 to /proc/spl/kstat/zfs/dbgmsg
file clears the log.
See also zfs_dbgmsg_maxsize
zfs_dbgmsg_enable | Notes |
---|---|
Tags | debug |
When to change | To view ZFS internal debug log |
Data Type | boolean |
Range | 0=do not log debug messages, 1=log debug messages |
Default | 0 (1 for debug builds) |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
The /proc/spl/kstat/zfs/dbgmsg
file size limit is set by
zfs_dbgmsg_maxsize.
See also zfs_dbgmsg_enable
zfs_dbgmsg_maxsize | Notes |
---|---|
Tags | debug |
When to change | TBD |
Data Type | int |
Units | bytes |
Range | 0 to INT_MAX |
Default | 4 MiB |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
The zfs_dbuf_state_index
feature is currently unused. It is normally used
for controlling values in the /proc/spl/kstat/zfs/dbufs
file.
zfs_dbuf_state_index | Notes |
---|---|
Tags | debug |
When to change | Do not change |
Data Type | int |
Units | TBD |
Range | TBD |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
When a pool sync operation takes longer than zfs_deadman_synctime_ms
milliseconds, a "slow spa_sync" message is logged to the debug log
(see zfs_dbgmsg_enable). If zfs_deadman_enabled
is
set to 1, then all pending IO operations are also checked and if any haven't
completed within zfs_deadman_synctime_ms milliseconds, a "SLOW IO" message
is logged to the debug log and a "deadman" system event (see zpool events
command) with the details of the hung IO is posted.
zfs_deadman_enabled | Notes |
---|---|
Tags | debug |
When to change | To disable logging of slow I/O |
Data Type | boolean |
Range | 0=do not log slow I/O, 1=log slow I/O |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.8.0 |
Once a pool sync operation has taken longer than zfs_deadman_synctime_ms milliseconds, continue to check for slow operations every zfs_deadman_checktime_ms milliseconds.
zfs_deadman_checktime_ms | Notes |
---|---|
Tags | debug |
When to change | When debugging slow I/O |
Data Type | ulong |
Units | milliseconds |
Range | 1 to ULONG_MAX |
Default | 60,000 (1 minute) |
Change | Dynamic |
Versions Affected | v0.8.0 |
When an individual I/O takes longer than zfs_deadman_ziotime_ms
milliseconds,
then the operation is considered to be "hung". If zfs_deadman_enabled
is set then the deadman behaviour is invoked as described by the
zfs_deadman_failmode option.
zfs_deadman_ziotime_ms | Notes |
---|---|
Tags | debug |
When to change | Testing ABD features |
Data Type | ulong |
Units | milliseconds |
Range | 1 to ULONG_MAX |
Default | 300,000 (5 minutes) |
Change | Dynamic |
Versions Affected | v0.8.0 |
The I/O deadman timer expiration time has two meanings
- determines when the
spa_deadman()
logic should fire, indicating the txg sync has not completed in a timely manner - determines if an I/O is considered "hung"
In version v0.8.0, any I/O that has not completed in zfs_deadman_synctime_ms
is considered "hung" resulting in one of three behaviors controlled by the
zfs_deadman_failmode parameter.
zfs_deadman_synctime_ms
takes effect if zfs_deadman_enabled = 1.
zfs_deadman_synctime_ms | Notes |
---|---|
Tags | debug |
When to change | When debugging slow I/O |
Data Type | ulong |
Units | milliseconds |
Range | 1 to |
Default | 600,000 (10 minutes) |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
zfs_deadman_failmode controls the behavior of the I/O deadman timer when it detects a "hung" I/O. Valid values are:
- wait - Wait for the "hung" I/O (default)
- continue - Attempt to recover from a "hung" I/O
- panic - Panic the system
zfs_deadman_failmode | Notes |
---|---|
Tags | debug |
When to change | In some cluster cases, panic can be appropriate |
Data Type | string |
Range | wait, continue, or panic |
Default | wait |
Change | Dynamic |
Versions Affected | v0.8.0 |
ZFS can prefetch deduplication table (DDT) entries. zfs_dedup_prefetch
allows
DDT prefetches to be enabled.
zfs_dedup_prefetch | Notes |
---|---|
Tags | prefetch, memory |
When to change | For systems with limited RAM using the dedup feature, disabling deduplication table prefetch can reduce memory pressure |
Data Type | boolean |
Range | 0=do not prefetch, 1=prefetch dedup table entries |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
zfs_delete_blocks
defines a large file for the purposes of delete.
Files containing more than zfs_delete_blocks
will be deleted asynchronously
while smaller files are deleted synchronously.
Decreasing this value reduces the time spent in an unlink(2)
system call at
the expense of a longer delay before the freed space is available.
The zfs_delete_blocks
value is specified in blocks, not bytes. The size of
blocks can vary and is ultimately limited by the filesystem's recordsize
property.
zfs_delete_blocks | Notes |
---|---|
Tags | filesystem, delete |
When to change | If applications delete large files and blocking on unlink(2) is not desired |
Data Type | ulong |
Units | blocks |
Range | 1 to ULONG_MAX |
Default | 20,480 |
Change | Dynamic |
Versions Affected | all |
The ZFS write throttle begins to delay each transaction when the amount of
dirty data reaches the threshold zfs_delay_min_dirty_percent
of
zfs_dirty_data_max.
This value should be >= zfs_vdev_async_write_active_max_dirty_percent.
zfs_delay_min_dirty_percent | Notes |
---|---|
Tags | write_throttle |
When to change | See section "ZFS TRANSACTION DELAY" |
Data Type | int |
Units | percent |
Range | 0 to 100 |
Default | 60 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_delay_scale
controls how quickly the ZFS write throttle transaction
delay approaches infinity.
Larger values cause longer delays for a given amount of dirty data.
For the smoothest delay, this value should be about 1 billion divided
by the maximum number of write operations per second the pool can sustain.
The throttle will smoothly handle between 10x and 1/10th zfs_delay_scale
.
Note: zfs_delay_scale
* zfs_dirty_data_max must be < 2^64.
zfs_delay_scale | Notes |
---|---|
Tags | write_throttle |
When to change | See section "ZFS TRANSACTION DELAY" |
Data Type | ulong |
Units | scalar (nanoseconds) |
Range | 0 to ULONG_MAX |
Default | 500,000 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_dirty_data_max
is the ZFS write throttle dirty space limit.
Once this limit is exceeded, new writes are delayed until space is freed by
writes being committed to the pool.
zfs_dirty_data_max takes precedence over zfs_dirty_data_max_percent.
zfs_dirty_data_max | Notes |
---|---|
Tags | write_throttle |
When to change | See section "ZFS TRANSACTION DELAY" |
Data Type | ulong |
Units | bytes |
Range | 1 to zfs_dirty_data_max_max |
Default | 10% of physical RAM |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_dirty_data_max_percent
is an alternative method of specifying
zfs_dirty_data_max, the ZFS write throttle dirty space limit.
Once this limit is exceeded, new writes are delayed until space is freed by
writes being committed to the pool.
zfs_dirty_data_max takes precedence over zfs_dirty_data_max_percent
.
zfs_dirty_data_max_percent | Notes |
---|---|
Tags | write_throttle |
When to change | See section "ZFS TRANSACTION DELAY" |
Data Type | int |
Units | percent |
Range | 1 to 100 |
Default | 10% of physical RAM |
Change | Prior to zfs module load or a memory hot plug event |
Versions Affected | v0.6.4 and later |
zfs_dirty_data_max_max
is the maximum allowable value of
zfs_dirty_data_max.
zfs_dirty_data_max_max
takes precedence over zfs_dirty_data_max_max_percent.
zfs_dirty_data_max_max | Notes |
---|---|
Tags | write_throttle |
When to change | See section "ZFS TRANSACTION DELAY" |
Data Type | ulong |
Units | bytes |
Range | 1 to physical RAM size |
Default | 25% of physical RAM |
Change | Prior to zfs module load |
Versions Affected | v0.6.4 and later |
zfs_dirty_data_max_max_percent
an alternative to zfs_dirty_data_max_max
for setting the maximum allowable value of zfs_dirty_data_max
zfs_dirty_data_max_max takes precedence over zfs_dirty_data_max_max_percent
zfs_dirty_data_max_max_percent | Notes |
---|---|
Tags | write_throttle |
When to change | See section "ZFS TRANSACTION DELAY" |
Data Type | int |
Units | percent |
Range | 1 to 100 |
Default | 25% of physical RAM |
Change | Prior to zfs module load |
Versions Affected | v0.6.4 and later |
When there is at least zfs_dirty_data_sync
dirty data, a transaction group
sync is started. This allows a transaction group sync to occur more frequently
than the transaction group timeout interval (see zfs_txg_timeout)
when there is dirty data to be written.
zfs_dirty_data_sync | Notes |
---|---|
Tags | write_throttle |
When to change | TBD |
Data Type | ulong |
Units | bytes |
Range | 1 to ULONG_MAX |
Default | 67,108,864 (64 MiB) |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Fletcher-4 is the default checksum algorithm for metadata and data.
When the zfs kernel module is loaded, a set of microbenchmarks are run to
determine the fastest algorithm for the current hardware. The
zfs_fletcher_4_impl
parameter allows a specific implementation to be
specified other than the default (fastest).
Selectors other than fastest and scalar require instruction
set extensions to be available and will only appear if ZFS detects their
presence. The scalar implementation works on all processors.
The results of the microbenchmark are visible in the
/proc/spl/kstat/zfs/fletcher_4_bench
file.
Larger numbers indicate better performance.
Since ZFS is processor endian-independent, the microbenchmark is run
against both big and little-endian transformation.
zfs_fletcher_4_impl | Notes |
---|---|
Tags | CPU, checksum |
When to change | Testing Fletcher-4 algorithms |
Data Type | string |
Range | fastest, scalar, superscalar, superscalar4, sse2, ssse3, avx2, avx512f, or aarch64_neon depending on hardware support |
Default | fastest |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
The processing of the free_bpobj object can be enabled by
zfs_free_bpobj_enabled
zfs_free_bpobj_enabled | Notes |
---|---|
Tags | delete |
When to change | If there's a problem with processing free_bpobj (e.g. i/o error or bug) |
Data Type | boolean |
Range | 0=do not process free_bpobj objects, 1=process free_bpobj objects |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
zfs_free_max_blocks
sets the maximum number of blocks to be freed in a single
transaction group (txg). For workloads that delete (free) large numbers of
blocks in a short period of time, the processing of the frees can negatively
impact other operations, including txg commits. zfs_free_max_blocks
acts as a
limit to reduce the impact.
zfs_free_max_blocks | Notes |
---|---|
Tags | filesystem, delete |
When to change | For workloads that delete large files, zfs_free_max_blocks can be adjusted to meet performance requirements while reducing the impacts of deletion |
Data Type | ulong |
Units | blocks |
Range | 1 to ULONG_MAX |
Default | 100,000 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
Maximum asynchronous read I/Os active to each device.
zfs_vdev_async_read_max_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_max_active |
Default | 3 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Minimum asynchronous read I/Os active to each device.
zfs_vdev_async_read_min_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to (zfs_vdev_async_read_max_active - 1) |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
When the amount of dirty data exceeds the threshold
zfs_vdev_async_write_active_max_dirty_percent
of zfs_dirty_data_max
dirty data, then zfs_vdev_async_write_max_active is used to
limit active async writes.
If the dirty data is between
zfs_vdev_async_write_active_min_dirty_percent
and zfs_vdev_async_write_active_max_dirty_percent
, the active I/O limit is
linearly interpolated between zfs_vdev_async_write_min_active
and zfs_vdev_async_write_max_active
zfs_vdev_async_write_active_max_dirty_percent | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | int |
Units | percent of zfs_dirty_data_max |
Range | 0 to 100 |
Default | 60 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
If the amount of dirty data is between
zfs_vdev_async_write_active_min_dirty_percent
and zfs_vdev_async_write_active_max_dirty_percent
of zfs_dirty_data_max,
the active I/O limit is linearly interpolated between
zfs_vdev_async_write_min_active and
zfs_vdev_async_write_max_active
zfs_vdev_async_write_active_min_dirty_percent | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | int |
Units | percent of zfs_dirty_data_max |
Range | 0 to (zfs_vdev_async_write_active_max_dirty_percent - 1) |
Default | 30 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_vdev_async_write_max_active
sets the maximum asynchronous
write I/Os active to each device.
zfs_vdev_async_write_max_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_max_active |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_vdev_async_write_min_active
sets the minimum asynchronous write I/Os active to each device.
Lower values are associated with better latency on rotational media but poorer resilver performance. The default value of 2 was chosen as a compromise. A value of 3 has been shown to improve resilver performance further at a cost of further increasing latency.
zfs_vdev_async_write_min_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_async_write_max_active |
Default | 1 for v0.6.x, 2 for v0.7.0 and later |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
The maximum number of I/Os active to each device. Ideally,
zfs_vdev_max_active
>= the sum of each queue's max_active.
Once queued to the device, the ZFS I/O scheduler is no longer able to prioritize I/O operations. The underlying device drivers have their own scheduler and queue depth limits. Values larger than the device's maximum queue depth can have the affect of increased latency as the I/Os are queued in the intervening device driver layers.
zfs_vdev_max_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | sum of each queue's min_active to UINT32_MAX |
Default | 1,000 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_vdev_scrub_max_active
sets the maximum scrub or scan
read I/Os active to each device.
zfs_vdev_scrub_max_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler, scrub, resilver |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_max_active |
Default | 2 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_vdev_max_active
sets the minimum scrub or scan read I/Os active
to each device.
zfs_vdev_scrub_min_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler, scrub, resilver |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_scrub_max_active |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Maximum synchronous read I/Os active to each device.
zfs_vdev_sync_read_max_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_max_active |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_vdev_sync_read_min_active
sets the minimum synchronous read I/Os
active to each device.
zfs_vdev_sync_read_min_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_sync_read_max_active |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_vdev_sync_write_max_active
sets the maximum synchronous write I/Os active
to each device.
zfs_vdev_sync_write_max_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_max_active |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
zfs_vdev_sync_write_min_active
sets the minimum synchronous write I/Os
active to each device.
zfs_vdev_sync_write_min_active | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to zfs_vdev_sync_write_max_active |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Maximum number of queued allocations per top-level vdev expressed as
a percentage of zfs_vdev_async_write_max_active.
This allows the system to detect devices that are more capable of handling allocations
and to allocate more blocks to those devices. It also allows for dynamic
allocation distribution when devices are imbalanced as fuller devices
will tend to be slower than empty devices. Once the queue depth
reaches (zfs_vdev_queue_depth_pct
* zfs_vdev_async_write_max_active / 100)
then allocator will stop allocating blocks on that top-level device and
switch to the next.
See also zio_dva_throttle_enabled
zfs_vdev_queue_depth_pct | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | See the section "ZFS I/O SCHEDULER" |
Data Type | uint32 |
Units | I/O operations |
Range | 1 to UINT32_MAX |
Default | 1,000 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
Disable duplicate buffer eviction from ARC.
zfs_disable_dup_eviction | Notes |
---|---|
Tags | ARC, dedup |
When to change | TBD |
Data Type | boolean |
Range | 0=duplicate buffers can be evicted, 1=do not evict duplicate buffers |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.5, deprecated in v0.7.0 |
Snapshots of filesystems are normally automounted under the filesystem's
.zfs/snapshot
subdirectory. When not in use, snapshots are unmounted
after zfs_expire_snapshot seconds.
zfs_expire_snapshot | Notes |
---|---|
Tags | filesystem, snapshot |
When to change | TBD |
Data Type | int |
Units | seconds |
Range | 0 disables automatic unmounting, maximum time is INT_MAX |
Default | 300 |
Change | Dynamic |
Versions Affected | v0.6.1 and later |
Allow the creation, removal, or renaming of entries in the .zfs/snapshot
subdirectory to cause the creation, destruction, or renaming of snapshots.
When enabled this functionality works both locally and over NFS exports
which have the "no_root_squash" option set.
zfs_admin_snapshot | Notes |
---|---|
Tags | filesystem, snapshot |
When to change | TBD |
Data Type | boolean |
Range | 0=do not allow snapshot manipulation via the filesystem, 1=allow snapshot manipulation via the filesystem |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
Set additional debugging flags (see zfs_dbgmsg_enable)
flag value | symbolic name | description |
---|---|---|
0x1 | ZFS_DEBUG_DPRINTF | Enable dprintf entries in the debug log |
0x2 | ZFS_DEBUG_DBUF_VERIFY | Enable extra dnode verifications |
0x4 | ZFS_DEBUG_DNODE_VERIFY | Enable extra dnode verifications |
0x8 | ZFS_DEBUG_SNAPNAMES | Enable snapshot name verification |
0x10 | ZFS_DEBUG_MODIFY | Check for illegally modified ARC buffers |
0x20 | ZFS_DEBUG_SPA | Enable spa_dbgmsg entries in the debug log |
0x40 | ZFS_DEBUG_ZIO_FREE | Enable verification of block frees |
0x80 | ZFS_DEBUG_HISTOGRAM_VERIFY | Enable extra spacemap histogram verifications |
0x100 | ZFS_DEBUG_METASLAB_VERIFY | Verify space accounting on disk matches in-core range_trees |
0x200 | ZFS_DEBUG_SET_ERROR | Enable SET_ERROR and dprintf entries in the debug log |
zfs_flags | Notes |
---|---|
Tags | debug |
When to change | When debugging ZFS |
Data Type | int |
Default | 0 no debug flags set, for debug builds: all except ZFS_DEBUG_DPRINTF and ZFS_DEBUG_SPA |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
If destroy encounters an I/O error (EIO) while reading metadata (eg indirect
blocks), space referenced by the missing metadata cannot be freed.
Normally, this causes the background destroy to become "stalled", as the
destroy is unable to make forward progress. While in this stalled state,
all remaining space to free from the error-encountering filesystem is
temporarily leaked. Set zfs_free_leak_on_eio = 1
to ignore the EIO,
permanently leak the space from indirect blocks that can not be read,
and continue to free everything else that it can.
The default, stalling behavior is useful if the storage partially fails (eg some but not all I/Os fail), and then later recovers. In this case, we will be able to continue pool operations while it is partially failed, and when it recovers, we can continue to free the space, with no leaks. However, note that this case is rare.
Typically pools either:
-
fail completely (but perhaps temporarily (eg a top-level vdev going offline)
-
have localized, permanent errors (eg disk returns the wrong data due to bit flip or firmware bug)
In case (1), the zfs_free_leak_on_eio
setting does not matter because the
pool will be suspended and the sync thread will not be able to make
forward progress. In case (2), because the error is
permanent, the best effort do is leak the minimum amount of space.
Therefore, it is reasonable for zfs_free_leak_on_eio
be set, but by default
the more conservative approach is taken, so that there is no
possibility of leaking space in the "partial temporary" failure case.
zfs_free_leak_on_eio | Notes |
---|---|
Tags | debug |
When to change | When debugging I/O errors during destroy |
Data Type | boolean |
Range | 0=normal behavior, 1=ignore error and permanently leak space |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.5 and later |
During a zfs destroy
operation using feature@async_destroy
a
minimum of zfs_free_min_time_ms
time will be spent working on freeing blocks
per txg commit.
zfs_free_min_time_ms | Notes |
---|---|
Tags | delete |
When to change | TBD |
Data Type | int |
Units | milliseconds |
Range | 1 to (zfs_txg_timeout * 1000) |
Default | 1,000 |
Change | Dynamic |
Versions Affected | v0.6.0 and later |
If a pool does not have a log device, data blocks equal to or larger than
zfs_immediate_write_sz
are treated as if the dataset being written to had
the property setting logbias=throughput
Terminology note: logbias=throughput
writes the blocks in "indirect mode"
to the ZIL where the data is written to the pool and a pointer to the data
is written to the ZIL.
zfs_immediate_write_sz | Notes |
---|---|
Tags | ZIL |
When to change | TBD |
Data Type | long |
Units | bytes |
Range | 512 to 16,777,216 (valid block sizes) |
Default | 32,768 (32 KiB) |
Change | Dynamic |
Verification | Data blocks that exceed zfs_immediate_write_sz or are written as logbias=throughput increment the zil_itx_indirect_count entry in /proc/spl/kstat/zfs/zil
|
Versions Affected | all |
ZFS supports logical record (block) sizes from 512 bytes to 16 MiB.
The benefits of larger blocks, and thus larger average I/O sizes, can be
weighed against the cost of copy-on-write of large block to modify one byte.
Additionally, very large blocks can have a negative impact on both I/O latency
at the device level and the memory allocator. The zfs_max_recordsize
parameter limits the upper bound of the dataset volblocksize and recordsize
properties.
Larger blocks can be created by enabling zpool
large_blocks
feature and
changing this zfs_max_recordsize
. Pools with larger blocks can always be
imported and used, regardless of the value of zfs_max_recordsize
.
For 32-bit systems, zfs_max_recordsize
also limits the size of kernel virtual
memory caches used in the ZFS I/O pipeline (zio_buf_*
and zio_data_buf_*
).
See also the zpool
large_blocks
feature.
zfs_max_recordsize | Notes |
---|---|
Tags | filesystem, memory, volume |
When to change | To create datasets with larger volblocksize or recordsize |
Data Type | int |
Units | bytes |
Range | 512 to 16,777,216 (valid block sizes) |
Default | 1,048,576 |
Change | Dynamic, set prior to creating volumes or changing filesystem recordsize |
Versions Affected | v0.6.5 and later |
zfs_mdcomp_disable
allows metadata compression to be disabled.
zfs_mdcomp_disable | Notes |
---|---|
Tags | CPU, metadata |
When to change | When CPU cycles cost less than I/O |
Data Type | boolean |
Range | 0=compress metadata, 1=do not compress metadata |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.0 and later |
Allow metaslabs to keep their active state as long as their fragmentation
percentage is less than or equal to this value. When writing, an active
metaslab whose fragmentation percentage exceeds
zfs_metaslab_fragmentation_threshold
is avoided allowing metaslabs with less
fragmentation to be preferred.
Metaslab fragmentation is used to calculate the overall pool fragmentation
property value. However, individual metaslab fragmentation levels are
observable using the zdb
with the -mm
option.
zfs_metaslab_fragmentation_threshold
works at the metaslab level and each
top-level vdev has approximately metaslabs_per_vdev metaslabs.
See also zfs_mg_fragmentation_threshold
zfs_metaslab_fragmentation_threshold | Notes |
---|---|
Tags | allocation, fragmentation, vdev |
When to change | Testing metaslab allocation |
Data Type | int |
Units | percent |
Range | 1 to 100 |
Default | 70 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Metaslab groups (top-level vdevs) are considered eligible for allocations
if their fragmentation percentage metric is less than or equal to
zfs_mg_fragmentation_threshold
. If a metaslab group exceeds this threshold
then it will be skipped unless all metaslab groups within the metaslab class
have also crossed the zfs_mg_fragmentation_threshold
threshold.
zfs_mg_fragmentation_threshold | Notes |
---|---|
Tags | allocation, fragmentation, vdev |
When to change | Testing metaslab allocation |
Data Type | int |
Units | percent |
Range | 1 to 100 |
Default | 85 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
Metaslab groups (top-level vdevs) with free space percentage greater than
zfs_mg_noalloc_threshold
are eligible for new allocations.
If a metaslab group's free space is less than or equal to the
threshold, the allocator avoids allocating to that group
unless all groups in the pool have reached the threshold. Once all
metaslab groups have reached the threshold, all metaslab groups are allowed
to accept allocations. The default value of 0 disables the feature and causes
all metaslab groups to be eligible for allocations.
This parameter allows one to deal with pools having heavily imbalanced
vdevs such as would be the case when a new vdev has been added.
Setting the threshold to a non-zero percentage will stop allocations
from being made to vdevs that aren't filled to the specified percentage
and allow lesser filled vdevs to acquire more allocations than they
otherwise would under the older zfs_mg_alloc_failures
facility.
zfs_mg_noalloc_threshold | Notes |
---|---|
Tags | allocation, fragmentation, vdev |
When to change | To force rebalancing as top-level vdevs are added or expanded |
Data Type | int |
Units | percent |
Range | 0 to 100 |
Default | 0 (disabled) |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
The pool multihost
multimodifier protection (MMP) subsystem can record
historical updates in the /proc/spl/kstat/zfs/POOL_NAME/multihost
file
for debugging purposes.
The number of lines of history is determined by zfs_multihost_history.
zfs_multihost_history | Notes |
---|---|
Tags | MMP, import |
When to change | When testing multihost feature |
Data Type | int |
Units | lines |
Range | 0 to INT_MAX |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
zfs_multihost_interval
controls the frequency of multihost writes performed
by the pool multihost multimodifier protection (MMP) subsystem.
The multihost write period is (zfs_multihost_interval
/ number of leaf-vdevs)
milliseconds.
Thus on average a multihost write will be issued for each leaf vdev every
zfs_multihost_interval
milliseconds. In practice, the observed period can
vary with the I/O load and this observed value is the delay which is stored in
the uberblock.
On import the multihost activity check waits a minimum amount of time
determined by (zfs_multihost_interval
* zfs_multihost_import_intervals)
with a lower bound of 1 second.
The activity check time may be further extended if the value of mmp delay
found in the best uberblock indicates actual multihost updates happened at
longer intervals than zfs_multihost_interval
Note: the multihost protection feature applies to storage devices that can be shared between multiple systems.
zfs_multihost_interval | Notes |
---|---|
Tags | MMP, import, vdev |
When to change | To optimize pool import time against possibility of simultaneous import by another system |
Data Type | ulong |
Units | milliseconds |
Range | 100 to ULONG_MAX |
Default | 1000 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
zfs_multihost_import_intervals
controls the duration of the activity test on
pool import for the multihost multimodifier protection (MMP) subsystem.
The activity test can be expected to take a minimum time of
(zfs_multihost_import_interval
s * zfs_multihost_interval * random(25%)
)
milliseconds. The random period of up to 25% improves simultaneous import
detection. For example, if two hosts are rebooted at the same time and
automatically attempt to import the pool, then is is highly probable that
one host will win.
Smaller values of zfs_multihost_import_intervals
reduces the
import time but increases the risk of failing to detect an active pool.
The total activity check time is never allowed to drop below one second.
Note: the multihost protection feature applies to storage devices that can be shared between multiple systems.
zfs_multihost_import_intervals | Notes |
---|---|
Tags | MMP, import |
When to change | TBD |
Data Type | uint |
Units | intervals |
Range | 1 to UINT_MAX |
Default | 10 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
zfs_multihost_fail_intervals
controls the behavior of the pool when
write failures are detected in the multihost multimodifier protection (MMP)
subsystem.
If zfs_multihost_fail_intervals = 0
then multihost write failures are ignored.
The write failures are reported to the ZFS event daemon (zed
) which
can take action such as suspending the pool or offlining a device.
If zfs_multihost_fail_intervals > 0
then sequential multihost write failures
will cause the pool to be suspended. This occurs when
(zfs_multihost_fail_intervals
* zfs_multihost_interval)
milliseconds have passed since the last successful multihost write.
This guarantees the activity test will see multihost writes if the pool is
attempted to be imported by another system.
zfs_multihost_fail_intervals | Notes |
---|---|
Tags | MMP, import |
When to change | TBD |
Data Type | uint |
Units | intervals |
Range | 0 to UINT_MAX |
Default | 5 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
The ZFS Event Daemon (zed) processes events from ZFS. However, it can be
overwhelmed by high rates of error reports which can be generated by failing,
high-performance devices. zfs_delays_per_second
limits the rate of
delay events reported to zed.
zfs_delays_per_second | Notes |
---|---|
Tags | zed, delay |
When to change | If processing delay events at a higher rate is desired |
Data Type | uint |
Units | events per second |
Range | 0 to UINT_MAX |
Default | 20 |
Change | Dynamic |
Versions Affected | v0.7.7 and later |
The ZFS Event Daemon (zed) processes events from ZFS. However, it can be
overwhelmed by high rates of error reports which can be generated by failing,
high-performance devices. zfs_checksums_per_second
limits the rate of
checksum events reported to zed.
Note: do not set this value lower than the SERD limit for checksum
in zed.
By default, checksum_N
= 10 and checksum_T
= 10 minutes, resulting in a
practical lower limit of 1.
zfs_checksums_per_second | Notes |
---|---|
Tags | zed, checksum |
When to change | If processing checksum error events at a higher rate is desired |
Data Type | uint |
Units | events per second |
Range | 0 to UINT_MAX |
Default | 20 |
Change | Dynamic |
Versions Affected | v0.7.7 and later |
When zfs_no_scrub_io = 1
scrubs do not actually scrub data and
simply doing a metadata crawl of the pool instead.
zfs_no_scrub_io | Notes |
---|---|
Tags | scrub |
When to change | Testing scrub feature |
Data Type | boolean |
Range | 0=perform scrub I/O, 1=do not perform scrub I/O |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.0 and later |
When zfs_no_scrub_prefetch = 1
, prefetch is disabled for scrub I/Os.
zfs_no_scrub_prefetch | Notes |
---|---|
Tags | prefetch, scrub |
When to change | Testing scrub feature |
Data Type | boolean |
Range | 0=prefetch scrub I/Os, 1=do not prefetch scrub I/Os |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.4 and later |
ZFS uses barriers (volatile cache flush commands) to ensure data is committed to permanent media by devices. This ensures consistent on-media state for devices where caches are volatile (eg HDDs).
For devices with nonvolatile caches, the cache flush operation can be a no-op. However, in some RAID arrays, cache flushes can cause the entire cache to be flushed to the backing devices.
To ensure on-media consistency, keep cache flush enabled.
zfs_nocacheflush | Notes |
---|---|
Tags | disks |
When to change | If the storage device has nonvolatile cache, then disabling cache flush can save the cost of occasional cache flush comamnds. |
Data Type | boolean |
Range | 0=send cache flush commands, 1=do not send cache flush commands |
Default | 0 |
Change | Dynamic |
Versions Affected | all |
The NOP-write feature is enabled by default when a crytographically-secure
checksum algorithm is in use by the dataset. zfs_nopwrite_enabled
allows the
NOP-write feature to be completely disabled.
zfs_nopwrite_enabled | Notes |
---|---|
Tags | checksum, debug |
When to change | TBD |
Data Type | boolean |
Range | 0=disable NOP-write feature, 1=enable NOP-write feature |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.6.0 and later |
zfs_dmu_offset_next_sync
enables forcing txg sync to find holes.
This causes ZFS to act like older versions when SEEK_HOLE
or SEEK_DATA
flags
are used: when a dirty dnode causes txgs to be synced so the previous data
can be found.
zfs_dmu_offset_next_sync | Notes |
---|---|
Tags | DMU |
When to change | TBD |
Data Type | boolean |
Range | 0=do not force txg sync to find holes, 1=force txg sync to find holes |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
zfs_pd_bytes_max
limits the number of bytes prefetched during a pool traversal
(eg zfs send
or other data crawling operations). These prefetches are
referred to as "prescient prefetches" and are always 100% hit rate.
The traversal operations do not use the default data or metadata prefetcher.
zfs_pd_bytes_max | Notes |
---|---|
Tags | prefetch, send |
When to change | TBD |
Data Type | int32 |
Units | bytes |
Range | 0 to INT32_MAX |
Default | 52,428,800 (50 MiB) |
Change | Dynamic |
Versions Affected | TBD |
zfs_per_txg_dirty_frees_percent
as a percentage of zfs_dirty_data_max
controls the percentage of dirtied blocks from frees in one txg.
After the threshold is crossed, additional dirty blocks from frees
wait until the next txg.
Thus, when deleting large files, filling consecutive txgs with deletes/frees,
does not throttle other, perhaps more important, writes.
A side effect of this throttle can impact zfs receive
workloads that contain a
large number of frees and the ignore_hole_birth optimization is
disabled. The symptom is that the receive workload causes an increase
in the frequency of txg commits when. Since txg commits also flush data from volatile
caches in HDDs to media, HDD performance can be negatively impacted. Also, since
the frees do not consume much bandwith over the pipe, the pipe can appear to stall.
Thus the overall progress of receives is slower than expected.
A value of zero will disable this throttle.
zfs_per_txg_dirty_frees_percent | Notes |
---|---|
Tags | delete |
When to change | For zfs receive workloads, consider increasing or disabling. See section "ZFS I/O SCHEDULER" |
Data Type | ulong |
Units | percent |
Range | 0 to 100 |
Default | 30 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
zfs_prefetch_disable
controls the predictive prefetcher.
Note that it leaves "prescient" prefetch (eg prefetch for zfs send
) intact
(see zfs_pd_bytes_max)
zfs_prefetch_disable | Notes |
---|---|
Tags | prefetch |
When to change | In some case where the workload is completely random reads, overall performance can be better if prefetch is disabled |
Data Type | boolean |
Range | 0=prefetch enabled, 1=prefetch disabled |
Default | 0 |
Change | Dynamic |
Verification | prefetch efficacy is observed by arcstat , arc_summary , and the relevant entries in /proc/spl/kstat/zfs/arcstats
|
Versions Affected | all |
zfs_read_chunk_size
is the limit for ZFS filesystem reads. If an application
issues a read()
larger than zfs_read_chunk_size
, then the read()
is divided
into multiple operations no larger than zfs_read_chunk_size
zfs_read_chunk_size | Notes |
---|---|
Tags | filesystem |
When to change | TBD |
Data Type | ulong |
Units | bytes |
Range | 512 to ULONG_MAX |
Default | 1,048,576 |
Change | Dynamic |
Versions Affected | all |
Historical statistics for the last zfs_read_history
reads are available in
/proc/spl/kstat/zfs/POOL_NAME/reads
zfs_read_history | Notes |
---|---|
Tags | debug |
When to change | To observe read operation details |
Data Type | int |
Units | lines |
Range | 0 to INT_MAX |
Default | 0 |
Change | Dynamic |
Versions Affected | all |
When zfs_read_history > 0
, zfs_read_history_hits controls whether ARC hits are
displayed in the read history file, /proc/spl/kstat/zfs/POOL_NAME/reads
zfs_read_history_hits | Notes |
---|---|
Tags | debug |
When to change | To observe read operation details with ARC hits |
Data Type | boolean |
Range | 0=do not include data for ARC hits, 1=include ARC hit data |
Default | 0 |
Change | Dynamic |
Versions Affected | all |
zfs_recover
can be set to true (1) to attempt to recover from
otherwise-fatal errors, typically caused by on-disk corruption.
When set, calls to zfs_panic_recover()
will turn into warning messages
rather than calling panic()
zfs_recover | Notes |
---|---|
Tags | import |
When to change | zfs_recover should only be used as a last resort, as it typically results in leaked space, or worse |
Data Type | boolean |
Range | 0=normal operation, 1=attempt recovery zpool import |
Default | 0 |
Change | Dynamic |
Verification | check output of dmesg and other logs for details |
Versions Affected | v0.6.4 or later |
Resilvers are processed by the sync thread in syncing context. While
resilvering, ZFS spends at least zfs_resilver_min_time_ms
time working on a
resilver between txg commits.
See also zfs_txg_timeout.
zfs_resilver_min_time_ms | Notes |
---|---|
Tags | resilver |
When to change | In some resilvering cases, increasing zfs_resilver_min_time_ms can result in faster completion |
Data Type | int |
Units | milliseconds |
Range | 1 to zfs_txg_timeout converted to milliseconds |
Default | 3,000 |
Change | Dynamic |
Versions Affected | all |
Scrubs are processed by the sync thread in syncing context. While
scrubbing, ZFS spends at least zfs_scrub_min_time_ms
time working on a
resilver between txg commits.
See also zfs_txg_timeout.
zfs_scrub_min_time_ms | Notes |
---|---|
Tags | scrub |
When to change | In some scrub cases, increasing zfs_scrub_min_time_ms can result in faster completion |
Data Type | int |
Units | milliseconds |
Range | 1 to zfs_txg_timeout converted to milliseconds |
Default | 1,000 |
Change | Dynamic |
Versions Affected | all |
To preserve progress across reboots the sequential scan algorithm periodically
needs to stop metadata scanning and issue all the verifications I/Os to disk
every zfs_scan_checkpoint_intval
seconds.
zfs_scan_checkpoint_intval | Notes |
---|---|
Tags | resilver, scrub |
When to change | TBD |
Data Type | int |
Units | seconds |
Range | 1 to INT_MAX |
Default | 7,200 (2 hours) |
Change | Dynamic |
Versions Affected | v0.8.0 and later |
This tunable affects how scrub and resilver I/O segments are ordered. A higher number indicates that we care more about how filled in a segment is, while a lower number indicates we care more about the size of the extent without considering the gaps within a segment.
zfs_scan_fill_weight | Notes |
---|---|
Tags | resilver, scrub |
When to change | Testing sequential scrub and resilver |
Data Type | int |
Units | scalar |
Range | 0 to INT_MAX |
Default | 3 |
Change | Prior to zfs module load |
Versions Affected | v0.8.0 and later |
zfs_scan_issue_strategy
controls the order of data verification while scrubbing or
resilvering.
value | description |
---|---|
0 | fs will use strategy 1 during normal verification and strategy 2 while taking a checkpoint |
1 | data is verified as sequentially as possible, given the amount of memory reserved for scrubbing (see zfs_scan_mem_lim_fact). This can improve scrub performance if the pool's data is heavily fragmented. |
2 | the largest mostly-contiguous chunk of found data is verified first. By deferring scrubbing of small segments, we may later find adjacent data to coalesce and increase the segment size. |
zfs_scan_issue_strategy | Notes |
---|---|
Tags | resilver, scrub |
When to change | TBD |
Data Type | enum |
Range | 0 to 2 |
Default | 0 |
Change | Dynamic |
Versions Affected | TBD |
Setting zfs_scan_legacy = 1
enables the legacy scan and scrub behavior
instead of the newer sequential behavior.
zfs_scan_legacy | Notes |
---|---|
Tags | resilver, scrub |
When to change | TBD |
Data Type | TBD |
Units | TBD |
Range | 0=use new method: scrubs and resilvers will gather metadata in memory before issuing sequential I/O, 1=use legacy algorithm will be used where I/O is initiated as soon as it is discovered |
Default | 0 |
Change | Dynamic, however changing to 0 does not affect in-progress scrubs or resilvers |
Versions Affected | v0.8.0 and later |
zfs_scan_max_ext_gap
limits the largest gap in bytes between scrub and
resilver I/Os that will still be considered sequential for sorting purposes.
zfs_scan_max_ext_gap | Notes |
---|---|
Tags | resilver, scrub |
When to change | TBD |
Data Type | ulong |
Units | bytes |
Range | 512 to ULONG_MAX |
Default | 2,097,152 (2 MiB) |
Change | Dynamic, however changing to 0 does not affect in-progress scrubs or resilvers |
Versions Affected | v0.8.0 and later |
zfs_scan_mem_lim_fact
limits the maximum fraction of RAM used for I/O sorting
by sequential scan algorithm.
When the limit is reached scanning metadata is stopped and
data verification I/O is started.
Data verification I/O continues until the memory used by the sorting
algorith drops below below zfs_scan_mem_lim_soft_fact
zfs_scan_mem_lim_fact | Notes |
---|---|
Tags | memory, resilver, scrub |
When to change | TBD |
Data Type | int |
Units | divisor of physical RAM |
Range | TBD |
Default | 20 (physical RAM / 20 or 5%) |
Change | Dynamic |
Versions Affected | v0.8.0 and later |
zfs_scan_mem_lim_soft_fact
sets the fraction of the hard limit,
zfs_scan_mem_lim_fact, used to determined the RAM soft limit
for I/O sorting by the sequential scan algorithm.
After zfs_scan_mem_lim_fact has been reached, metadata scanning is stopped
until the RAM usage drops below zfs_scan_mem_lim_soft_fact
zfs_scan_mem_lim_soft_fact | Notes |
---|---|
Tags | resilver, scrub |
When to change | TBD |
Data Type | int |
Units | divisor of (physical RAM / zfs_scan_mem_lim_fact) |
Range | 1 to INT_MAX |
Default | 20 (for default zfs_scan_mem_lim_fact, 0.25% of physical RAM) |
Change | Dynamic |
Versions Affected | v0.8.0 and later |
zfs_scan_vdev_limit
is the maximum amount of data that can be concurrently
issued at once for scrubs and resilvers per leaf vdev.
zfs_scan_vdev_limit
attempts to strike a balance between keeping the leaf
vdev queues full of I/Os while not overflowing the queues causing high latency
resulting in long txg sync times.
While zfs_scan_vdev_limit
represents a bandwidth limit, the existing I/O
limit of zfs_vdev_scrub_max_active remains in effect, too.
zfs_scan_vdev_limit | Notes |
---|---|
Tags | resilver, scrub, vdev |
When to change | TBD |
Data Type | ulong |
Units | bytes |
Range | 512 to ULONG_MAX |
Default | 4,194,304 (4 MiB) |
Change | Dynamic |
Versions Affected | v0.8.0 and later |
zfs_send_corrupt_data
enables zfs send
to send of corrupt data by
ignoring read and checksum errors. The corrupted or unreadable blocks are
replaced with the value 0x2f5baddb10c
(ZFS bad block)
zfs_send_corrupt_data | Notes |
---|---|
Tags | send |
When to change | When data corruption exists and an attempt to recover at least some data via zfs send is needed |
Data Type | boolean |
Range | 0=do not send corrupt data, 1=replace corrupt data with cookie |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.6.0 and later |
The SPA sync process is performed in multiple passes. Once the pass number
reaches zfs_sync_pass_deferred_free
, frees are no long processed and must wait
for the next SPA sync.
The zfs_sync_pass_deferred_free
value is expected to be removed as a tunable
once the optimal value is determined during field testing.
The zfs_sync_pass_deferred_free
pass must be greater than 1 to ensure that
regular blocks are not deferred.
zfs_sync_pass_deferred_free | Notes |
---|---|
Tags | SPA |
When to change | Testing SPA sync process |
Data Type | int |
Units | SPA sync passes |
Range | 1 to INT_MAX |
Default | 2 |
Change | Dynamic |
Versions Affected | all |
The SPA sync process is performed in multiple passes. Once the pass number
reaches zfs_sync_pass_dont_compress
, data block compression is no longer
processed and must wait for the next SPA sync.
The zfs_sync_pass_dont_compress
value is expected to be removed as a tunable
once the optimal value is determined during field testing.
zfs_sync_pass_dont_compress | Notes |
---|---|
Tags | SPA |
When to change | Testing SPA sync process |
Data Type | int |
Units | SPA sync passes |
Range | 1 to INT_MAX |
Default | 5 |
Change | Dynamic |
Versions Affected | all |
The SPA sync process is performed in multiple passes. Once the pass number
reaches zfs_sync_pass_rewrite
, blocks can be split into gang blocks.
The zfs_sync_pass_rewrite
value is expected to be removed as a tunable
once the optimal value is determined during field testing.
zfs_sync_pass_rewrite | Notes |
---|---|
Tags | SPA |
When to change | Testing SPA sync process |
Data Type | int |
Units | SPA sync passes |
Range | 1 to INT_MAX |
Default | 2 |
Change | Dynamic |
Versions Affected | all |
zfs_sync_taskq_batch_pct
controls the number of threads used by the
DSL pool sync taskq, dp_sync_taskq
zfs_sync_taskq_batch_pct | Notes |
---|---|
Tags | SPA |
When to change | To adjust the number of dp_sync_taskq threads |
Data Type | int |
Units | percent of number of online CPUs |
Range | 1 to 100 |
Default | 75 |
Change | Prior to zfs module load |
Versions Affected | v0.7.0 and later |
Historical statistics for the last zfs_txg_history
txg commits are available
in /proc/spl/kstat/zfs/POOL_NAME/txgs
The work required to measure the txg commit (SPA statistics) is low. However, for debugging purposes, it can be useful to observe the SPA statistics.
zfs_txg_history | Notes |
---|---|
Tags | debug |
When to change | To observe details of SPA sync behavior. |
Data Type | int |
Units | lines |
Range | 0 to INT_MAX |
Default | 0 for version v0.6.0 to v0.7.6, 100 for version v0.8.0 |
Change | Dynamic |
Versions Affected | all |
The open txg is committed to the pool periodically (SPA sync) and
zfs_txg_timeout
represents the default target upper limit.
txg commits can occur more frequently and a rapid rate of txg commits often indicates a busy write workload, quota limits reached, or the free space is critically low.
txg commits can also take longer than zfs_txg_timeout
if the ZFS write throttle
is not properly tuned or the time to sync is otherwise delayed (eg slow device)
See also zfs_dirty_data_sync and zfs_txg_history
zfs_txg_timeout | Notes |
---|---|
Tags | SPA, ZIO_scheduler |
When to change | To optimize the work done by txg commit relative to the pool requirements. See also section "ZFS I/O SCHEDULER" |
Data Type | int |
Units | seconds |
Range | 1 to INT_MAX |
Default | 5 |
Change | Dynamic |
Versions Affected | all |
To reduce IOPs, small, adjacent I/Os can be aggregated (coalesced) into into a
large I/O.
For reads, aggregations occur across small adjacency gaps.
For writes, aggregation can occur at the ZFS or disk level.
zfs_vdev_aggregation_limit
is the upper bound on the size of the larger,
aggregated I/O.
zfs_vdev_aggregation_limit | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | If the workload does not benefit from aggregation, the zfs_vdev_aggregation_limit can be reduced to avoid aggregation attempts |
Data Type | int |
Units | bytes |
Range | 0 to 131,072 (default) or 16,777,216 (if zpool large_blocks feature is enabled) |
Default | 131,072 (128 KiB) |
Change | Dynamic |
Versions Affected | all |
Note: with the current ZFS code, the vdev cache is not helpful and in some
cases actually harmful. Thusit is disabled by setting the
zfs_vdev_cache_size = 0
zfs_vdev_cache_size
is the size of the vdev cache.
zfs_vdev_cache_size | Notes |
---|---|
Tags | vdev, vdev_cache |
When to change | Do not change |
Data Type | int |
Units | bytes |
Range | 0 to MAX_INT |
Default | 0 (vdev cache is disabled) |
Change | Dynamic |
Verification | vdev cache statistics are availabe in the /proc/spl/kstat/zfs/vdev_cache_stats file |
Versions Affected | all |
Note: with the current ZFS code, the vdev cache is not helpful and in some cases actually harmful. Thus it is disabled by setting the zfs_vdev_cache_size to zero. This related tunable is, by default, inoperative.
All read I/Os smaller than zfs_vdev_cache_max are turned into
(1 << zfs_vdev_cache_bshift
) byte reads by the vdev cache. At most
zfs_vdev_cache_size bytes will be kept in each vdev's cache.
zfs_vdev_cache_bshift | Notes |
---|---|
Tags | vdev, vdev_cache |
When to change | Do not change |
Data Type | int |
Units | shift |
Range | 1 to INT_MAX |
Default | 16 (65,536 bytes) |
Change | Dynamic |
Versions Affected | all |
Note: with the current ZFS code, the vdev cache is not helpful and in some cases actually harmful. Thus it is disabled by setting the zfs_vdev_cache_size to zero. This related tunable is, by default, inoperative.
All read I/Os smaller than zfs_vdev_cache_max will be turned into
(1 <<
zfs_vdev_cache_bshift byte reads by the vdev cache.
At most zfs_vdev_cache_size
bytes will be kept in each vdev's cache.
zfs_vdev_cache_max | Notes |
---|---|
Tags | vdev, vdev_cache |
When to change | Do not change |
Data Type | int |
Units | bytes |
Range | 512 to INT_MAX |
Default | 16,384 (16 KiB) |
Change | Dynamic |
Versions Affected | all |
The mirror read algorithm uses current load and an incremental weighting value
to determine the vdev to service a read operation. Lower values determine
the preferred vdev.
The weighting value is zfs_vdev_mirror_rotating_inc
for rotating media and
zfs_vdev_mirror_non_rotating_inc for nonrotating media.
Verify the rotational setting described by a block device in sysfs by
observing /sys/block/DISK_NAME/queue/rotational
zfs_vdev_mirror_rotating_inc | Notes |
---|---|
Tags | vdev, mirror, HDD |
When to change | Increasing for mirrors with both rotating and nonrotating media more strongly favors the nonrotating media |
Data Type | int |
Units | scalar |
Range | 0 to MAX_INT |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
The mirror read algorithm uses current load and an incremental weighting value
to determine the vdev to service a read operation. Lower values determine
the preferred vdev.
The weighting value is zfs_vdev_mirror_rotating_inc for rotating media and
zfs_vdev_mirror_non_rotating_inc
for nonrotating media.
Verify the rotational setting described by a block device in sysfs by
observing /sys/block/DISK_NAME/queue/rotational
zfs_vdev_mirror_non_rotating_inc | Notes |
---|---|
Tags | vdev, mirror, SSD |
When to change | TBD |
Data Type | int |
Units | scalar |
Range | 0 to INT_MAX |
Default | 0 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
For rotating media in a mirror, if the next I/O offset is within
zfs_vdev_mirror_rotating_seek_offset
then the weighting factor is incremented by (zfs_vdev_mirror_rotating_seek_inc / 2
).
Otherwise the weighting factor is increased by zfs_vdev_mirror_rotating_seek_inc
.
This algorithm prefers rotating media with lower seek distance.
Verify the rotational setting described by a block device in sysfs by
observing /sys/block/DISK_NAME/queue/rotational
zfs_vdev_mirror_rotating_seek_inc | Notes |
---|---|
Tags | vdev, mirror, HDD |
When to change | TBD |
Data Type | int |
Units | scalar |
Range | 0 to INT_MAX |
Default | 5 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
For rotating media in a mirror, if the next I/O offset is within
zfs_vdev_mirror_rotating_seek_offset
then the weighting factor is
incremented by (zfs_vdev_mirror_rotating_seek_inc / 2
).
Otherwise the weighting factor is increased by zfs_vdev_mirror_rotating_seek_inc
.
This algorithm prefers rotating media with lower seek distance.
Verify the rotational setting described by a block device in sysfs by
observing /sys/block/DISK_NAME/queue/rotational
zfs_vdev_mirror_rotating_seek_offset | Notes |
---|---|
Tags | vdev, mirror, HDD |
When to change | TBD |
Data Type | int |
Units | bytes |
Range | 0 to INT_MAX |
Default | 1,048,576 (1 MiB) |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
For nonrotating media in a mirror, a seek penalty is applied as sequential I/O's can be aggregated into fewer operations, avoiding unnecessary per-command overhead, often boosting performance.
Verify the rotational setting described by a block device in SysFS by
observing /sys/block/DISK_NAME/queue/rotational
zfs_vdev_mirror_non_rotating_seek_inc | Notes |
---|---|
Tags | vdev, mirror, SSD |
When to change | TBD |
Data Type | int |
Units | scalar |
Range | 0 to INT_MAX |
Default | 1 |
Change | Dynamic |
Versions Affected | v0.7.0 and later |
To reduce IOPs, small, adjacent I/Os are aggregated (coalesced) into into a
large I/O.
For reads, aggregations occur across small adjacency gaps where
the gap is less than zfs_vdev_read_gap_limit
zfs_vdev_read_gap_limit | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | TBD |
Data Type | int |
Units | bytes |
Range | 0 to INT_MAX |
Default | 32,768 (32 KiB) |
Change | Dynamic |
Versions Affected | all |
To reduce IOPs, small, adjacent I/Os are aggregated (coalesced) into into a
large I/O.
For writes, aggregations occur across small adjacency gaps where
the gap is less than zfs_vdev_write_gap_limit
zfs_vdev_write_gap_limit | Notes |
---|---|
Tags | vdev, ZIO_scheduler |
When to change | TBD |
Data Type | int |
Units | bytes |
Range | 0 to INT_MAX |
Default | 4,096 (4 KiB) |
Change | Dynamic |
Versions Affected | all |
When the pool is imported, for whole disk vdevs, the block device I/O
scheduler is set to zfs_vdev_scheduler
.
The most common schedulers are: noop, cfq, bfq, an