-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot trim L2ARC on SSD #9713
Comments
Why would it? |
If you have a long running server with regular reboot intervals and loose your L2ARC you encounter performance issues over time. From wikipedia:
Therefore in this case a trim of the L2ARC does increase L2ARC performance which is a good thing. :) |
Persistent l2arc would only minimize the performance hit over time. Data in the l2arc would still be overwritten from time to time, which would result in the same issue as with the reboot. Only slower though. |
Lack of trim is of course drive dependent. While not Datacenter caliber, Samsung 840, 850, etc are very popular and all have issues. Simply filling a disk with dd and removing the file also works in cases where trim isn't available. It's something I've had to do with every Samsung I've ever encountered. Anyone who routinely runs scrub may not notice it. |
The server has a routine scrub configured, I still encountered an issue with slow L2ARC anyways. After removing the SSDs from the pool, manually reformatting and trimming the disk the performance went back to normal. In our case the performance was so bad, that it was actually better with the L2ARC removed compared to our default setup. So I still believe that the issue we encountered can be traced back to the missing trim support for L2ARC. |
Sorry, not clear to me how lack of trim could inhibit read performance. |
Sadly we dont have a choice in the used SSDs. :( We also do not have autotrim activated as it can affect performance negatively as @kpande mentioned. It did not corrupt any data, just the general performance was reduced so much that the whole system became nearly unusable. The issue was fixed when the L2ARC was removed. After trimming the disks manually and readding the SSDs as L2ARC our performance did not degrade anymore. The initial reason I thought of trim was that I knew of an issue with Macbooks which did not support trim in older versions. This resulted in SSDs which slowed down over time. In the end it comes down to the question: does zfs ever perform a "delete" operation (which would definitely require a trim) or is it always "just" overwriting data? As well as do some (maybe cheap) SSDs require trim even if you "overwrite" data? From what I read so far about trim I don't believe that this can be answered easily as it seems to me that trim behaves different for any configuration of controller, manufacturer, firmware, ... |
@lukeb2e today the l2arc device is always overwritten, it does not get trimmed. This optimization was left as follow up work to the initial trim feature, but it is something we'd like to eventually implement. |
I really wonder why Trim on L2ARC SSDs is required. During normal operation i would expect the SSD to be fully (or close to fully) utilized. As a result there is nothing to trim as the data is always overwritten and never deleted. After a reboot (and without persistent l2arc) trimming the l2arc SSD would be helpful to warm up l2arc faster as you can write faster. If i remember correctly l2arc is feed with 30 MB/s by default. This should be doable even without prior trim. |
True, but performance of SSDs can degrade quite a bit. The worst case I found right now is someone mentioning degredation of performance to 8MB/s. Sadly we worked around the issue for now, so I can't tell you what our performance was before our workaround. |
Default value for l2arc_write_max is, coincidentally, 8MB/sec. |
I don't think it is as easy as that. After doing some more digging on Read/Write Performance for SSDs without trim I found this:
One can now argue wether the P128 should be used in "server applications/hardware". But since this is just an example, I think this should support the general issue about trim on L2ARC. |
Can't speak for everyone of course but it is a reality I see. Had a drive today do exactly what has been reported.. These will degrade to sub 1M/s if not conditioned. While I absolutely agree I wouldn't put critical vms on them, they are still useful once you understand the problem and Samsung has nfi how to make firmware.
etc. Drive itself is fine
Eventually the firmware will complete whatever cleanup it needs to and we're back at "normal"
|
What does that have to do with the L2ARC? Which will be constantly full anyway (assuming the headers fit in memory). If the fill rate is still too slow, then make a partition for some overprovisioning. |
The entire drive slows to a crawl. |
For anyone interested in working on this, there may be some relatively low hanging fruit to be had. The As @richardelling mentioned currently it's only overwritten in |
Thought everyone tweaked those already? :) There are a few defaults that could use revisiting. They are sane but not necessarily ideal. |
I can confirm this issue can take place even on underprovisioned (i.e. not the full capacity of SSD dedicated to the L2ARC, some left unformatted as an empty partition), but low-end SSDs which tends to write data slower and slower while capacity utilization increases. Most obvious this issue became if we're using some of the L2ARC-SSDs free capacity for the SLOG and L2ARC partition on the same SSD devices once became fully occupied: if afterward for some reason utilization of L2ARC device decreases, write performance of the sync writes to SLOG is not restored to the level it was before the partition dedicated for the L2ARC became fully utilized. I did notice this in one of my ZFS boxes with two 120GB SSDs, which space (~70%) used for the L2ARC and 1GB of each used for the mirror SLOG device but I'm able to reproduce it in almost all of my ZFS setups where I'm using same SSDs for the L2ARC and SLOG mirror with the following steps:
So it would be best to have write latencies of SLOG restored automatically whenever utilization of L2ARC goes down (BTW this isn't such rare case in my workloads) without destroying L2ARC and discarding all of the data. |
Depends on the workload, most SSD's are still orders of magnitude faster then a platter drive though. |
This ^ |
Well, most probably I will wait for the persistent L2ARC to become live and then provide PR for this. |
I don't understand the analogy you're trying to make. Does it really matter if the hardware is to blame or the workload when the end result is the same?
There are different cases above. Regardless of what some think about sharing services between SLOG and ARC, it is a valid deployment. The slowness comes from lack of trim on some drives, and yes that too happens on all hardware. As the example I gave above, it's a fairly serious degradation with some drives. One cases where trim does not make sense is security but you're not arguing that? |
For those interested I created #9789. |
Sure I will, thank you again for all your hard work making ZFS more suitable for the cheap tiered storage setups! |
The l2arc_evict() function is responsible for evicting buffers which reference the next bytes of the L2ARC device to be overwritten. Teach this function to additionally TRIM that vdev space before it is overwritten if the device has been filled with data. This is done by vdev_trim_simple() which trims by issuing a new type of TRIM, TRIM_TYPE_SIMPLE. We also implement a "Trim Ahead" feature. It is a zfs module parameter, expressed in % of the current write size. This trims ahead of the current write size. A minimum of 64MB will be trimmed. The default is 0 which disables TRIM on L2ARC as it can put significant stress to underlying storage devices. To enable TRIM on L2ARC we set l2arc_trim_ahead > 0. We also implement TRIM of the whole cache device upon addition to a pool, pool creation or when the header of the device is invalid upon importing a pool or onlining a cache device. This is dependent on l2arc_trim_ahead > 0. TRIM of the whole device is done with TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t. We save the TRIM state for the whole device and the time of completion on-disk in the header, and restore these upon L2ARC rebuild so that zpool status -t can correctly report them. Whole device TRIM is done asynchronously so that the user can export of the pool or remove the cache device while it is trimming (ie if it is too slow). We do not TRIM the whole device if persistent L2ARC has been disabled by l2arc_rebuild_enabled = 0 because we may not want to lose all cached buffers (eg we may want to import the pool with l2arc_rebuild_enabled = 0 only once because of memory pressure). If persistent L2ARC has been disabled by setting the module parameter l2arc_rebuild_blocks_min_l2size to a value greater than the size of the cache device then the whole device is trimmed upon creation or import of a pool if l2arc_trim_ahead > 0. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam D. Moss <c@yotes.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #9713 Closes #9789 Closes #10224
The l2arc_evict() function is responsible for evicting buffers which reference the next bytes of the L2ARC device to be overwritten. Teach this function to additionally TRIM that vdev space before it is overwritten if the device has been filled with data. This is done by vdev_trim_simple() which trims by issuing a new type of TRIM, TRIM_TYPE_SIMPLE. We also implement a "Trim Ahead" feature. It is a zfs module parameter, expressed in % of the current write size. This trims ahead of the current write size. A minimum of 64MB will be trimmed. The default is 0 which disables TRIM on L2ARC as it can put significant stress to underlying storage devices. To enable TRIM on L2ARC we set l2arc_trim_ahead > 0. We also implement TRIM of the whole cache device upon addition to a pool, pool creation or when the header of the device is invalid upon importing a pool or onlining a cache device. This is dependent on l2arc_trim_ahead > 0. TRIM of the whole device is done with TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t. We save the TRIM state for the whole device and the time of completion on-disk in the header, and restore these upon L2ARC rebuild so that zpool status -t can correctly report them. Whole device TRIM is done asynchronously so that the user can export of the pool or remove the cache device while it is trimming (ie if it is too slow). We do not TRIM the whole device if persistent L2ARC has been disabled by l2arc_rebuild_enabled = 0 because we may not want to lose all cached buffers (eg we may want to import the pool with l2arc_rebuild_enabled = 0 only once because of memory pressure). If persistent L2ARC has been disabled by setting the module parameter l2arc_rebuild_blocks_min_l2size to a value greater than the size of the cache device then the whole device is trimmed upon creation or import of a pool if l2arc_trim_ahead > 0. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam D. Moss <c@yotes.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes openzfs#9713 Closes openzfs#9789 Closes openzfs#10224
The l2arc_evict() function is responsible for evicting buffers which reference the next bytes of the L2ARC device to be overwritten. Teach this function to additionally TRIM that vdev space before it is overwritten if the device has been filled with data. This is done by vdev_trim_simple() which trims by issuing a new type of TRIM, TRIM_TYPE_SIMPLE. We also implement a "Trim Ahead" feature. It is a zfs module parameter, expressed in % of the current write size. This trims ahead of the current write size. A minimum of 64MB will be trimmed. The default is 0 which disables TRIM on L2ARC as it can put significant stress to underlying storage devices. To enable TRIM on L2ARC we set l2arc_trim_ahead > 0. We also implement TRIM of the whole cache device upon addition to a pool, pool creation or when the header of the device is invalid upon importing a pool or onlining a cache device. This is dependent on l2arc_trim_ahead > 0. TRIM of the whole device is done with TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t. We save the TRIM state for the whole device and the time of completion on-disk in the header, and restore these upon L2ARC rebuild so that zpool status -t can correctly report them. Whole device TRIM is done asynchronously so that the user can export of the pool or remove the cache device while it is trimming (ie if it is too slow). We do not TRIM the whole device if persistent L2ARC has been disabled by l2arc_rebuild_enabled = 0 because we may not want to lose all cached buffers (eg we may want to import the pool with l2arc_rebuild_enabled = 0 only once because of memory pressure). If persistent L2ARC has been disabled by setting the module parameter l2arc_rebuild_blocks_min_l2size to a value greater than the size of the cache device then the whole device is trimmed upon creation or import of a pool if l2arc_trim_ahead > 0. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam D. Moss <c@yotes.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #9713 Closes #9789 Closes #10224
The l2arc_evict() function is responsible for evicting buffers which reference the next bytes of the L2ARC device to be overwritten. Teach this function to additionally TRIM that vdev space before it is overwritten if the device has been filled with data. This is done by vdev_trim_simple() which trims by issuing a new type of TRIM, TRIM_TYPE_SIMPLE. We also implement a "Trim Ahead" feature. It is a zfs module parameter, expressed in % of the current write size. This trims ahead of the current write size. A minimum of 64MB will be trimmed. The default is 0 which disables TRIM on L2ARC as it can put significant stress to underlying storage devices. To enable TRIM on L2ARC we set l2arc_trim_ahead > 0. We also implement TRIM of the whole cache device upon addition to a pool, pool creation or when the header of the device is invalid upon importing a pool or onlining a cache device. This is dependent on l2arc_trim_ahead > 0. TRIM of the whole device is done with TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t. We save the TRIM state for the whole device and the time of completion on-disk in the header, and restore these upon L2ARC rebuild so that zpool status -t can correctly report them. Whole device TRIM is done asynchronously so that the user can export of the pool or remove the cache device while it is trimming (ie if it is too slow). We do not TRIM the whole device if persistent L2ARC has been disabled by l2arc_rebuild_enabled = 0 because we may not want to lose all cached buffers (eg we may want to import the pool with l2arc_rebuild_enabled = 0 only once because of memory pressure). If persistent L2ARC has been disabled by setting the module parameter l2arc_rebuild_blocks_min_l2size to a value greater than the size of the cache device then the whole device is trimmed upon creation or import of a pool if l2arc_trim_ahead > 0. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam D. Moss <c@yotes.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes openzfs#9713 Closes openzfs#9789 Closes openzfs#10224
The l2arc_evict() function is responsible for evicting buffers which reference the next bytes of the L2ARC device to be overwritten. Teach this function to additionally TRIM that vdev space before it is overwritten if the device has been filled with data. This is done by vdev_trim_simple() which trims by issuing a new type of TRIM, TRIM_TYPE_SIMPLE. We also implement a "Trim Ahead" feature. It is a zfs module parameter, expressed in % of the current write size. This trims ahead of the current write size. A minimum of 64MB will be trimmed. The default is 0 which disables TRIM on L2ARC as it can put significant stress to underlying storage devices. To enable TRIM on L2ARC we set l2arc_trim_ahead > 0. We also implement TRIM of the whole cache device upon addition to a pool, pool creation or when the header of the device is invalid upon importing a pool or onlining a cache device. This is dependent on l2arc_trim_ahead > 0. TRIM of the whole device is done with TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t. We save the TRIM state for the whole device and the time of completion on-disk in the header, and restore these upon L2ARC rebuild so that zpool status -t can correctly report them. Whole device TRIM is done asynchronously so that the user can export of the pool or remove the cache device while it is trimming (ie if it is too slow). We do not TRIM the whole device if persistent L2ARC has been disabled by l2arc_rebuild_enabled = 0 because we may not want to lose all cached buffers (eg we may want to import the pool with l2arc_rebuild_enabled = 0 only once because of memory pressure). If persistent L2ARC has been disabled by setting the module parameter l2arc_rebuild_blocks_min_l2size to a value greater than the size of the cache device then the whole device is trimmed upon creation or import of a pool if l2arc_trim_ahead > 0. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam D. Moss <c@yotes.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes openzfs#9713 Closes openzfs#9789 Closes openzfs#10224
I still cannot get l2arc devices to trim. using 2.1.11 on kernel 5.15. only the log device is trimmed while the cache device show as |
|
Is an error returned when you run |
yes, meanwhile my understanding of further explanation given in #14488 (comment), #9789 (comment) and the manual page / online documentation is that only autotrim works and there is no manual trimming of cache devices. |
System information
Describe the problem you're observing
I believe that trim does currently not work on L2ARC devices.
Describe how to reproduce the problem
zpool trim rpool
)zpool status -t
As you can see in the output the L2ARC is discovered as
untrimmed
. The HDDs are correctly showingtrim unsupported
.Trying to force a trim on the L2ARC device itself directly, does not work either:
Taking the device offline beforehand and issuing the trim command does not change the output.
Include any warning/errors/backtraces from the system logs
There are no warnings/errors/backtraces in the logs. I can only find trim_start & trim_finish for the corresponding devices that are getting trimmed. The cache devices (sda1/sdb1) are not mentioned in the logs.
The text was updated successfully, but these errors were encountered: