[WIP] Adaptive Compression Refresh #9689
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR rebases (and succeeds if the rebase works out) the currently stale PR:
Adaptive compression [was: auto compression]
#7560Which itself was the successor of:
auto compression
#5928It also moves this PR to a "Draft", as it is not ready for prime-time by a long shot.
I'm not currently able to do major changes to this PR, so please feel free to take over and continue this from this point onwards!
Note: rebase might not have been smooth, will check on the tests and make changes when required!
Motivation and Context
(From original: #5928)
Which compression algorithm is best for high throughput? The answer to this depends on the type of hardware in use.
If compression takes long then the disk remains idle. If compression is faster than the writing speed of the disk then the CPU remains idle as compression and writing to the disk happens in parallel.
Auto compression tries to keep both as busy as possible.
The disk load is observed through the vdev queue. If the queue is empty a fast compression algorithm like lz4 with low compression rates is used and if the queue is full then gzip-[1-9] can require more CPU time for higher compression rates.
The already existing zio_dva_throttle might conflict with the concept described above. Therefore it is recommended to deactivate zio_dva_throttle.
Description
(based on original from: #7560)
@RubenKelevra did some performance measurements and it wasn't meeting his expectations. So he tweaked the algoritm and excluded the off-compression as an option since the algorithm actually isn't able to determine the additional latency resulting of the larger data size when no compression is applied. he added gzip-2 to gzip-9 as options for the algorithm to choose from.
The algorithm should adapt to different CPU load situations, since it's measuring the latency introduces over the last 1000 compression cycles (one cycle one block). If the load of the system change over time, it might choose different compression algorithms.
In the light of zstd, the adaptive compression keyword might be a good choice for an adaptive zstd mode in the future selecting different zstd compression rates and relying on the same mechanism to select those.
This patch adds auto as ZFS compression type.
Requested changes
If anyone (or the original author) want to take this on, a not-complete list of todo's:
How Has This Been Tested?
(based on original from: #7560)
@RubenKelevra ran a simple benchmark on a single HDD with different scenarios:
with and without load
with some common block sizes
with dva_throttle on and off
for xfs and ext4 on zvols
his corpus is /usr/lib from my system (5.9 G with 117,920 files in 17,777 folders), copied with cp -ax from an SSD to an HDD.
All ZFS settings was set to default, except for checksum, which was edonr.
System specs:
Intel i3 2330M @ 2.20GHz (2 physical / 4 logical cores)
12 G DDR3 memory
2.5" 750 G Samsung HDD (as destination)
Intel SSD 320 (as source)
This test results might not be valid for a typical server application, but it should be a good measurement for an average notebook user. A use case for ZFS where latency and thruput is important too.
The workload scenario was a synthetic CPU/memory bound only user space program, with one thread per logical CPU core. The programm used for this was BOINC, with seti@home work units.
The load of the system was measured 75 seconds into the copy (on runs which was completed in less than 75 s the load value is somewhat inaccurate). Overall this value isn't really a hard prove, that one test result is better than a different one. @RubenKelevra wanted to show that the load of the system doesn't skyrocket, when using adaptive instead of lz4 or a gzip level.
In #5928 the author explained that dva_throttle might interfere with this adaptive compress algorithm selection. And I can confirm this, it might result in slightly less performance in compression ratio, but @RubenKelevra could not find a distinctive drop in I/O performance which would hinder an inclusion into the master. Furthermore, with all compression algorithms, the performance impact was mixed with and without dva_throttle.
Overall those performance numbers for adaptive compression often look pretty good. @RubenKelevra wasn't expecting a performance better than plain LZ4 compression, but it performed better in some scenarios.
@RubenKelevra also likes to point out that I used the filesystems without any parameters natively on the zvols. In my test the physical sector size is set by zfs to the recordsize, so the filesystems are aware of this (ext4 at least) and might use some (automatic) optimizations for those large physical sector sizes. This might lead to different results than in VMs, where the physical block size is usually 512 or 4096 for the filesystems inside the VM.
adaptive compression stats (pdf)
adaptive compression stats zvol (pdf)
Types of changes
Branch overlapping changes (feature, compress values)
The patch is has read-only backward compatibility by using the new introduced SPA_FEATURE_COMPRESS_AUTO feature. The feature activation procedure is equivalent to my other code branches.
Regarding the limited namespace of BP_GET_COMPRESS() (128 values), the
zio_compress enum's first part is for block pointer & dataset values, the second part for dataset values only. This is an alternative suggestion to #3908.
Checklist:
Signed-off-by
.