Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot trim L2ARC on SSD #9713

Closed
lukeb2e opened this issue Dec 11, 2019 · 28 comments · Fixed by #9789
Closed

Cannot trim L2ARC on SSD #9713

lukeb2e opened this issue Dec 11, 2019 · 28 comments · Fixed by #9789
Labels
Type: Feature Feature request or new feature Type: Performance Performance improvement or performance problem

Comments

@lukeb2e
Copy link

lukeb2e commented Dec 11, 2019

System information

Type Version/Name
Distribution Name Debian (Proxmox VE 6.1)
Distribution Version 10
Linux Kernel 5.3.10-1-pve
Architecture x86_64
ZFS Version 0.8.2-pve2
SPL Version 0.8.2-pve2

Describe the problem you're observing

I believe that trim does currently not work on L2ARC devices.

Describe how to reproduce the problem

  • L2ARC on SSDs
  • trim correct pool (zpool trim rpool)
  • check trim status with zpool status -t
root@server:~# zpool status -t
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 1 days 10:42:22 with 0 errors on Mon Dec  9 11:06:24 2019
config:

        NAME                              STATE     READ WRITE CKSUM
        rpool                             ONLINE       0     0     0
          mirror-0                        ONLINE       0     0     0
            wwn-0x5000cca22bdaa077-part1  ONLINE       0     0     0  (trim unsupported)
            wwn-0x5000cca22bd9d2e5-part1  ONLINE       0     0     0  (trim unsupported)
        logs
          mirror-1                        ONLINE       0     0     0
            wwn-0x500253850016023e-part2  ONLINE       0     0     0  (100% trimmed, completed at Sat 07 Dec 2019 04:48:15 PM CET)
            wwn-0x5002538c403f69a6-part2  ONLINE       0     0     0  (100% trimmed, completed at Sat 07 Dec 2019 04:48:15 PM CET)
        cache
          sda1                            ONLINE       0     0     0  (untrimmed)
          sdb1                            ONLINE       0     0     0  (untrimmed)

errors: No known data errors

As you can see in the output the L2ARC is discovered as untrimmed. The HDDs are correctly showing trim unsupported.

Trying to force a trim on the L2ARC device itself directly, does not work either:

zpool trim rpool sda1
cannot trim 'sda1': device is in use as a cache

Taking the device offline beforehand and issuing the trim command does not change the output.

Include any warning/errors/backtraces from the system logs

There are no warnings/errors/backtraces in the logs. I can only find trim_start & trim_finish for the corresponding devices that are getting trimmed. The cache devices (sda1/sdb1) are not mentioned in the logs.

@scineram
Copy link

Why would it?

@lukeb2e
Copy link
Author

lukeb2e commented Dec 11, 2019

If you have a long running server with regular reboot intervals and loose your L2ARC you encounter performance issues over time.
This is caused by the missing trim for the cache devices.

From wikipedia:

SSD write performance is significantly impacted by the availability of free, 
programmable blocks. Previously written data blocks no longer in use can 
be reclaimed by TRIM; however, even with TRIM, fewer free blocks cause 
slower performance.

Therefore in this case a trim of the L2ARC does increase L2ARC performance which is a good thing. :)

@lukeb2e
Copy link
Author

lukeb2e commented Dec 11, 2019

Persistent l2arc would only minimize the performance hit over time. Data in the l2arc would still be overwritten from time to time, which would result in the same issue as with the reboot. Only slower though.

@h1z1
Copy link

h1z1 commented Dec 11, 2019

Lack of trim is of course drive dependent. While not Datacenter caliber, Samsung 840, 850, etc are very popular and all have issues.

Simply filling a disk with dd and removing the file also works in cases where trim isn't available. It's something I've had to do with every Samsung I've ever encountered. Anyone who routinely runs scrub may not notice it.

@lukeb2e
Copy link
Author

lukeb2e commented Dec 12, 2019

The server has a routine scrub configured, I still encountered an issue with slow L2ARC anyways. After removing the SSDs from the pool, manually reformatting and trimming the disk the performance went back to normal.

In our case the performance was so bad, that it was actually better with the L2ARC removed compared to our default setup.

So I still believe that the issue we encountered can be traced back to the missing trim support for L2ARC.

@scineram
Copy link

scineram commented Dec 12, 2019

Sorry, not clear to me how lack of trim could inhibit read performance.

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Dec 12, 2019
@lukeb2e
Copy link
Author

lukeb2e commented Dec 12, 2019

Sadly we dont have a choice in the used SSDs. :( We also do not have autotrim activated as it can affect performance negatively as @kpande mentioned.

It did not corrupt any data, just the general performance was reduced so much that the whole system became nearly unusable. The issue was fixed when the L2ARC was removed. After trimming the disks manually and readding the SSDs as L2ARC our performance did not degrade anymore.

The initial reason I thought of trim was that I knew of an issue with Macbooks which did not support trim in older versions. This resulted in SSDs which slowed down over time.

In the end it comes down to the question: does zfs ever perform a "delete" operation (which would definitely require a trim) or is it always "just" overwriting data? As well as do some (maybe cheap) SSDs require trim even if you "overwrite" data? From what I read so far about trim I don't believe that this can be answered easily as it seems to me that trim behaves different for any configuration of controller, manufacturer, firmware, ...

@behlendorf
Copy link
Contributor

@lukeb2e today the l2arc device is always overwritten, it does not get trimmed. This optimization was left as follow up work to the initial trim feature, but it is something we'd like to eventually implement.

@behlendorf behlendorf added the Type: Feature Feature request or new feature label Dec 12, 2019
@ronnyegner
Copy link

ronnyegner commented Dec 14, 2019

I really wonder why Trim on L2ARC SSDs is required. During normal operation i would expect the SSD to be fully (or close to fully) utilized. As a result there is nothing to trim as the data is always overwritten and never deleted.

After a reboot (and without persistent l2arc) trimming the l2arc SSD would be helpful to warm up l2arc faster as you can write faster. If i remember correctly l2arc is feed with 30 MB/s by default. This should be doable even without prior trim.

@lukeb2e
Copy link
Author

lukeb2e commented Dec 17, 2019

True, but performance of SSDs can degrade quite a bit. The worst case I found right now is someone mentioning degredation of performance to 8MB/s. Sadly we worked around the issue for now, so I can't tell you what our performance was before our workaround.

@richardelling
Copy link
Contributor

Default value for l2arc_write_max is, coincidentally, 8MB/sec.
https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#l2arc_write_max

@lukeb2e
Copy link
Author

lukeb2e commented Dec 17, 2019

I don't think it is as easy as that.

After doing some more digging on Read/Write Performance for SSDs without trim I found this:
https://www.bit-tech.net/reviews/tech/storage/windows-7-ssd-performance-and-trim/13/

It’s random write speeds where there’s real cause for concern though, with the P128 in its heavily used condition recording just 1.11MB/s random write speed alongside write latencies that peaked at a whopping 1410ms.

One can now argue wether the P128 should be used in "server applications/hardware". But since this is just an example, I think this should support the general issue about trim on L2ARC.

@h1z1
Copy link

h1z1 commented Dec 18, 2019

Can't speak for everyone of course but it is a reality I see. Had a drive today do exactly what has been reported.. These will degrade to sub 1M/s if not conditioned. While I absolutely agree I wouldn't put critical vms on them, they are still useful once you understand the problem and Samsung has nfi how to make firmware.

# dd if=/dev/zero of=test.delme bs=4096 &                                       
[1] 2994                                                                        
# while kill -USR1 $! ; do sleep 1 ; done                                       
323716+0 records in                                                             
323716+0 records out                                                            
1325940736 bytes (1.3 GB) copied, 10.1787 s, 130 MB/s                           
408706+0 records in                                                             
408706+0 records out                                                            
1674059776 bytes (1.7 GB) copied, 13.7129 s, 122 MB/s                           
525787+0 records in                                                             
525787+0 records out                                                            
2153623552 bytes (2.2 GB) copied, 18.2957 s, 118 MB/s                           
606019+0 records in                                                             
606019+0 records out                                                            
2482253824 bytes (2.5 GB) copied, 18.5913 s, 134 MB/s                           
^C                                                                              
# fg                                                   
dd if=/dev/zero of=test.delme bs=4096                                           
^C^C                                                                            
645825+0 records in                                                             
645825+0 records out                                                            
2645299200 bytes (2.6 GB) copied, 22.1985 s, 119 MB/s                           
# zpool trim testvol2                                  
# zpool status testvol2 -t                             
  pool: testvol2                                                                
 state: ONLINE                                                                  
status: Some supported features are not enabled on the pool. The pool can       
  still be used, but some features are unavailable.                             
action: Enable all features using 'zpool upgrade'. Once this is done,           
  the pool may no longer be accessible by software that does not support        
  the features. See zpool-features(5) for details.                              
  scan: scrub repaired 0B in 0 days 00:04:11 with 0 errors on Sat Nov 16 04:30:01 2019
config:                                                                         
                                                                                
  NAME                      STATE     READ WRITE CKSUM                          
  testvol2                  ONLINE       0     0     0                          
    wwn-0x50025385a00XXXXX  ONLINE       0     0     0  (100% trimmed, completed at Wed 18 Dec 2019 02:42:21 AM EST)
                                                                                
errors: No known data errors                                                    
# zpool status testvol2 -t                             
  pool: testvol2                                                                
 state: ONLINE                                                                  
status: Some supported features are not enabled on the pool. The pool can       
  still be used, but some features are unavailable.                             
action: Enable all features using 'zpool upgrade'. Once this is done,       
#
# dd if=/dev/zero of=test.delme bs=4096 &              
[1] 11303                                                                       
# while kill -USR1 $! ; do sleep 1 ; done              
244673+0 records in                                                             
244673+0 records out                                                            
1002180608 bytes (1.0 GB) copied, 3.31499 s, 302 MB/s                           
368957+0 records in                                                             
368957+0 records out                                                            
1511247872 bytes (1.5 GB) copied, 5.02608 s, 301 MB/s                           
411965+0 records in                                                             
411964+0 records out                                                            
1687404544 bytes (1.7 GB) copied, 5.18055 s, 326 MB/s                           
494398+0 records in                                                             
494398+0 records out                                                            
2025054208 bytes (2.0 GB) copied, 6.73668 s, 301 MB/s                           
594850+0 records in                                                             
594850+0 records out                                                            
2436505600 bytes (2.4 GB) copied, 8.07323 s, 302 MB/s                           
625569+0 records in                                                             
625569+0 records out                                                            
2562330624 bytes (2.6 GB) copied, 8.18257 s, 313 MB/s                           
695330+0 records in                                                             
695330+0 records out                                                            
2848071680 bytes (2.8 GB) copied, 9.40134 s, 303 MB/s                           
795810+0 records in                                                             
795810+0 records out                                                            
3259637760 bytes (3.3 GB) copied, 10.7216 s, 304 MB/s                           
920545+0 records in                                                             
920545+0 records out                                                            
3770552320 bytes (3.8 GB) copied, 11.1921 s, 337 MB/s                           
921867+0 records in                                                             
921867+0 records out                                                            
3775967232 bytes (3.8 GB) copied, 12.4199 s, 304 MB/s                           
1006722+0 records in                                                            
1006722+0 records out                                                           
4123533312 bytes (4.1 GB) copied, 13.5586 s, 304 MB/s                           
1091554+0 records in                                                            
1091554+0 records out                                                           
4471005184 bytes (4.5 GB) copied, 14.6854 s, 304 MB/s                           

etc. Drive itself is fine

# smartctl -a /dev/sdb
smartctl 6.2 2017-02-27 r4394 [x86_64-linux-4.20.0+] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 840 PRO Series
Serial Number:    S1ANNSXXXXXXXXX
LU WWN Device Id: 5 002538 5a00XXXXX
Firmware Version: DXM05B0Q
User Capacity:    128,035,676,160 bytes [128 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Dec 18 02:59:55 2019 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x53) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  15) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   099   099   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       48575
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       147
177 Wear_Leveling_Count     0x0013   090   090   000    Pre-fail  Always       -       199
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   099   099   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   099   099   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   099   099   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   073   051   000    Old_age   Always       -       23
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   099   099   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       98053916085

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  255        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Eventually the firmware will complete whatever cleanup it needs to and we're back at "normal"

 while kill -USR1 $! ; do sleep 1 ; done
414301+0 records in
414300+0 records out
1696972800 bytes (1.7 GB) copied, 3.53948 s, 479 MB/s
533437+0 records in
533436+0 records out
2184953856 bytes (2.2 GB) copied, 4.53862 s, 481 MB/s
650397+0 records in
650396+0 records out
2664022016 bytes (2.7 GB) copied, 5.53923 s, 481 MB/s
766301+0 records in
766300+0 records out
3138764800 bytes (3.1 GB) copied, 6.53996 s, 480 MB/s
882257+0 records in
882257+0 records out
3613724672 bytes (3.6 GB) copied, 7.54068 s, 479 MB/s
996605+0 records in
996604+0 records out
4082089984 bytes (4.1 GB) copied, 8.54161 s, 478 MB/s
1111412+0 records in
1111411+0 records out
4552339456 bytes (4.6 GB) copied, 9.54211 s, 477 MB/s
1226173+0 records in
1226172+0 records out
5022400512 bytes (5.0 GB) copied, 10.5429 s, 476 MB/s
1339965+0 records in
1339964+0 records out
5488492544 bytes (5.5 GB) copied, 11.5437 s, 475 MB/s
1452893+0 records in
1452892+0 records out
5951045632 bytes (6.0 GB) copied, 12.5442 s, 474 MB/s
1567453+0 records in
1567452+0 records out
6420283392 bytes (6.4 GB) copied, 13.5451 s, 474 MB/s
^C
# fg
dd if=/dev/sdb of=/dev/null bs=4096
^C1824029+0 records in
1824028+0 records out
7471218688 bytes (7.5 GB) copied, 15.7939 s, 473 MB/s

@scineram
Copy link

What does that have to do with the L2ARC? Which will be constantly full anyway (assuming the headers fit in memory). If the fill rate is still too slow, then make a partition for some overprovisioning.

@h1z1
Copy link

h1z1 commented Dec 18, 2019

The entire drive slows to a crawl.

@behlendorf
Copy link
Contributor

For anyone interested in working on this, there may be some relatively low hanging fruit to be had. The l2arc_evict() function is responsible for evicting headers which reference the next N bytes of the l2arc device to be overwritten. If this function were updated to additionally TRIM that vdev space before it's overwritten that may help performance.

As @richardelling mentioned currently it's only overwritten in l2arc_write_max (8M) chunks. This default value hasn't changed since at least 2008, since today's SSDs are so much more capable than a decade ago I'd be surprised if increasing the default wasn't beneficial. It really should be at least as large as the maximum block size (16M). 64M would retain the original scaling factor of 64 times the average block size now that 1M blocks are very common. It would be interesting to do a scaling study with l2arc_evict() trimming ahead.

@h1z1
Copy link

h1z1 commented Dec 18, 2019

Thought everyone tweaked those already? :) There are a few defaults that could use revisiting. They are sane but not necessarily ideal.

@Vlad1mir-D
Copy link

Vlad1mir-D commented Dec 19, 2019

I can confirm this issue can take place even on underprovisioned (i.e. not the full capacity of SSD dedicated to the L2ARC, some left unformatted as an empty partition), but low-end SSDs which tends to write data slower and slower while capacity utilization increases.

Most obvious this issue became if we're using some of the L2ARC-SSDs free capacity for the SLOG and L2ARC partition on the same SSD devices once became fully occupied: if afterward for some reason utilization of L2ARC device decreases, write performance of the sync writes to SLOG is not restored to the level it was before the partition dedicated for the L2ARC became fully utilized.

I did notice this in one of my ZFS boxes with two 120GB SSDs, which space (~70%) used for the L2ARC and 1GB of each used for the mirror SLOG device but I'm able to reproduce it in almost all of my ZFS setups where I'm using same SSDs for the L2ARC and SLOG mirror with the following steps:

  1. Running openssl enc -aes-128-ecb -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt < /dev/zero | dd iflag=fullblock of=/dev/sdb bs=2M oflag=direct where /dev/sdb is one of my SLOG backed ZVOLs to put some highly-random sync write workload on it.
    I'm using openssl instead of reading from the /dev/urandom to make sure that random will outperform the write speed of the SLOG.
  2. Running iostat to monitor write latencies for some time
  3. Starting some IO-intensive read workload on ZFS/ZVOL devices backed with L2ARC on the same SSDs used for SLOG at step 1
  4. Watching utilization of L2ARC grows and write latencies of dd from step 1 also increases
  5. Waiting until the utilization of L2ARC became nearly 100% and noticing the write latencies which will be much worse than at step 2
  6. From this moment it doesn't matter whatever I'm going to do or not as the write latencies can only be restored if I'll destroy my L2ARC and then use blkdiscard on the SSD partitions dedicated for the L2ARC

So it would be best to have write latencies of SLOG restored automatically whenever utilization of L2ARC goes down (BTW this isn't such rare case in my workloads) without destroying L2ARC and discarding all of the data.
In the light of #9582 this issue becomes more crucial as previously with any reboot we're going to lose L2ARC anyway so I do have a 'destroy L2ARC, discard partition data, create new L2ARC'-procedure executed each reboot but after persistent L2ARC will go to the live, it won't be any longer acceptable for me to trade write performance for the read using this procedure.

@h1z1
Copy link

h1z1 commented Dec 19, 2019

Depends on the workload, most SSD's are still orders of magnitude faster then a platter drive though.

@Vlad1mir-D
Copy link

why are you sharing a device with L2ARC and SLOG? that runs contradictory to the purpose of a SLOG device, low latency I/O.

Depends on the workload, most SSD's are still orders of magnitude faster then a platter drive though.

This ^

@Vlad1mir-D
Copy link

Well, most probably I will wait for the persistent L2ARC to become live and then provide PR for this.

@h1z1
Copy link

h1z1 commented Dec 22, 2019

well, no, latency introduced by sharing the device is not workload dependent, it is hardware dependent. optane might (might) not exhibit the problem, but all drives do.

I don't understand the analogy you're trying to make. Does it really matter if the hardware is to blame or the workload when the end result is the same?

you're using a setup that is explicitly mentioned as a bad idea in documentation everywhere and you want others to put in effort to make sure you can continue using bad hardware in subpar configurations. please, can't you just fix your setup?

There are different cases above. Regardless of what some think about sharing services between SLOG and ARC, it is a valid deployment. The slowness comes from lack of trim on some drives, and yes that too happens on all hardware. As the example I gave above, it's a fairly serious degradation with some drives.

One cases where trim does not make sense is security but you're not arguing that?

@gamanakis gamanakis mentioned this issue Dec 30, 2019
12 tasks
@gamanakis
Copy link
Contributor

For those interested I created #9789.
@Vlad1mir-D you already have the testing setup. Would you mind giving this a try?

@Vlad1mir-D
Copy link

For those interested I created #9789.
@Vlad1mir-D you already have the testing setup. Would you mind giving this a try?

Sure I will, thank you again for all your hard work making ZFS more suitable for the cheap tiered storage setups!

behlendorf pushed a commit that referenced this issue Jun 9, 2020
The l2arc_evict() function is responsible for evicting buffers which
reference the next bytes of the L2ARC device to be overwritten. Teach
this function to additionally TRIM that vdev space before it is
overwritten if the device has been filled with data. This is done by
vdev_trim_simple() which trims by issuing a new type of TRIM,
TRIM_TYPE_SIMPLE.

We also implement a "Trim Ahead" feature. It is a zfs module parameter,
expressed in % of the current write size. This trims ahead of the
current write size. A minimum of 64MB will be trimmed. The default is 0
which disables TRIM on L2ARC as it can put significant stress to
underlying storage devices. To enable TRIM on L2ARC we set
l2arc_trim_ahead > 0.

We also implement TRIM of the whole cache device upon addition to a
pool, pool creation or when the header of the device is invalid upon
importing a pool or onlining a cache device. This is dependent on
l2arc_trim_ahead > 0. TRIM of the whole device is done with
TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t.
We save the TRIM state for the whole device and the time of completion
on-disk in the header, and restore these upon L2ARC rebuild so that
zpool status -t can correctly report them. Whole device TRIM is done
asynchronously so that the user can export of the pool or remove the
cache device while it is trimming (ie if it is too slow).

We do not TRIM the whole device if persistent L2ARC has been disabled by
l2arc_rebuild_enabled = 0 because we may not want to lose all cached
buffers (eg we may want to import the pool with
l2arc_rebuild_enabled = 0 only once because of memory pressure). If
persistent L2ARC has been disabled by setting the module parameter
l2arc_rebuild_blocks_min_l2size to a value greater than the size of the
cache device then the whole device is trimmed upon creation or import of
a pool if l2arc_trim_ahead > 0.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #9713
Closes #9789 
Closes #10224
BrainSlayer pushed a commit to BrainSlayer/zfs that referenced this issue Jun 10, 2020
The l2arc_evict() function is responsible for evicting buffers which
reference the next bytes of the L2ARC device to be overwritten. Teach
this function to additionally TRIM that vdev space before it is
overwritten if the device has been filled with data. This is done by
vdev_trim_simple() which trims by issuing a new type of TRIM,
TRIM_TYPE_SIMPLE.

We also implement a "Trim Ahead" feature. It is a zfs module parameter,
expressed in % of the current write size. This trims ahead of the
current write size. A minimum of 64MB will be trimmed. The default is 0
which disables TRIM on L2ARC as it can put significant stress to
underlying storage devices. To enable TRIM on L2ARC we set
l2arc_trim_ahead > 0.

We also implement TRIM of the whole cache device upon addition to a
pool, pool creation or when the header of the device is invalid upon
importing a pool or onlining a cache device. This is dependent on
l2arc_trim_ahead > 0. TRIM of the whole device is done with
TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t.
We save the TRIM state for the whole device and the time of completion
on-disk in the header, and restore these upon L2ARC rebuild so that
zpool status -t can correctly report them. Whole device TRIM is done
asynchronously so that the user can export of the pool or remove the
cache device while it is trimming (ie if it is too slow).

We do not TRIM the whole device if persistent L2ARC has been disabled by
l2arc_rebuild_enabled = 0 because we may not want to lose all cached
buffers (eg we may want to import the pool with
l2arc_rebuild_enabled = 0 only once because of memory pressure). If
persistent L2ARC has been disabled by setting the module parameter
l2arc_rebuild_blocks_min_l2size to a value greater than the size of the
cache device then the whole device is trimmed upon creation or import of
a pool if l2arc_trim_ahead > 0.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes openzfs#9713
Closes openzfs#9789 
Closes openzfs#10224
lundman referenced this issue in openzfsonosx/openzfs Jun 12, 2020
The l2arc_evict() function is responsible for evicting buffers which
reference the next bytes of the L2ARC device to be overwritten. Teach
this function to additionally TRIM that vdev space before it is
overwritten if the device has been filled with data. This is done by
vdev_trim_simple() which trims by issuing a new type of TRIM,
TRIM_TYPE_SIMPLE.

We also implement a "Trim Ahead" feature. It is a zfs module parameter,
expressed in % of the current write size. This trims ahead of the
current write size. A minimum of 64MB will be trimmed. The default is 0
which disables TRIM on L2ARC as it can put significant stress to
underlying storage devices. To enable TRIM on L2ARC we set
l2arc_trim_ahead > 0.

We also implement TRIM of the whole cache device upon addition to a
pool, pool creation or when the header of the device is invalid upon
importing a pool or onlining a cache device. This is dependent on
l2arc_trim_ahead > 0. TRIM of the whole device is done with
TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t.
We save the TRIM state for the whole device and the time of completion
on-disk in the header, and restore these upon L2ARC rebuild so that
zpool status -t can correctly report them. Whole device TRIM is done
asynchronously so that the user can export of the pool or remove the
cache device while it is trimming (ie if it is too slow).

We do not TRIM the whole device if persistent L2ARC has been disabled by
l2arc_rebuild_enabled = 0 because we may not want to lose all cached
buffers (eg we may want to import the pool with
l2arc_rebuild_enabled = 0 only once because of memory pressure). If
persistent L2ARC has been disabled by setting the module parameter
l2arc_rebuild_blocks_min_l2size to a value greater than the size of the
cache device then the whole device is trimmed upon creation or import of
a pool if l2arc_trim_ahead > 0.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #9713
Closes #9789 
Closes #10224
jsai20 pushed a commit to jsai20/zfs that referenced this issue Mar 30, 2021
The l2arc_evict() function is responsible for evicting buffers which
reference the next bytes of the L2ARC device to be overwritten. Teach
this function to additionally TRIM that vdev space before it is
overwritten if the device has been filled with data. This is done by
vdev_trim_simple() which trims by issuing a new type of TRIM,
TRIM_TYPE_SIMPLE.

We also implement a "Trim Ahead" feature. It is a zfs module parameter,
expressed in % of the current write size. This trims ahead of the
current write size. A minimum of 64MB will be trimmed. The default is 0
which disables TRIM on L2ARC as it can put significant stress to
underlying storage devices. To enable TRIM on L2ARC we set
l2arc_trim_ahead > 0.

We also implement TRIM of the whole cache device upon addition to a
pool, pool creation or when the header of the device is invalid upon
importing a pool or onlining a cache device. This is dependent on
l2arc_trim_ahead > 0. TRIM of the whole device is done with
TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t.
We save the TRIM state for the whole device and the time of completion
on-disk in the header, and restore these upon L2ARC rebuild so that
zpool status -t can correctly report them. Whole device TRIM is done
asynchronously so that the user can export of the pool or remove the
cache device while it is trimming (ie if it is too slow).

We do not TRIM the whole device if persistent L2ARC has been disabled by
l2arc_rebuild_enabled = 0 because we may not want to lose all cached
buffers (eg we may want to import the pool with
l2arc_rebuild_enabled = 0 only once because of memory pressure). If
persistent L2ARC has been disabled by setting the module parameter
l2arc_rebuild_blocks_min_l2size to a value greater than the size of the
cache device then the whole device is trimmed upon creation or import of
a pool if l2arc_trim_ahead > 0.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes openzfs#9713
Closes openzfs#9789 
Closes openzfs#10224
sempervictus pushed a commit to sempervictus/zfs that referenced this issue May 31, 2021
The l2arc_evict() function is responsible for evicting buffers which
reference the next bytes of the L2ARC device to be overwritten. Teach
this function to additionally TRIM that vdev space before it is
overwritten if the device has been filled with data. This is done by
vdev_trim_simple() which trims by issuing a new type of TRIM,
TRIM_TYPE_SIMPLE.

We also implement a "Trim Ahead" feature. It is a zfs module parameter,
expressed in % of the current write size. This trims ahead of the
current write size. A minimum of 64MB will be trimmed. The default is 0
which disables TRIM on L2ARC as it can put significant stress to
underlying storage devices. To enable TRIM on L2ARC we set
l2arc_trim_ahead > 0.

We also implement TRIM of the whole cache device upon addition to a
pool, pool creation or when the header of the device is invalid upon
importing a pool or onlining a cache device. This is dependent on
l2arc_trim_ahead > 0. TRIM of the whole device is done with
TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t.
We save the TRIM state for the whole device and the time of completion
on-disk in the header, and restore these upon L2ARC rebuild so that
zpool status -t can correctly report them. Whole device TRIM is done
asynchronously so that the user can export of the pool or remove the
cache device while it is trimming (ie if it is too slow).

We do not TRIM the whole device if persistent L2ARC has been disabled by
l2arc_rebuild_enabled = 0 because we may not want to lose all cached
buffers (eg we may want to import the pool with
l2arc_rebuild_enabled = 0 only once because of memory pressure). If
persistent L2ARC has been disabled by setting the module parameter
l2arc_rebuild_blocks_min_l2size to a value greater than the size of the
cache device then the whole device is trimmed upon creation or import of
a pool if l2arc_trim_ahead > 0.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes openzfs#9713
Closes openzfs#9789 
Closes openzfs#10224
@mailinglists35
Copy link

I still cannot get l2arc devices to trim. using 2.1.11 on kernel 5.15. only the log device is trimmed while the cache device show as untrimmed

@mailinglists35
Copy link

$ zpool status -t mypool'
  pool: mypool
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 21.6G in 00:07:39 with 0 errors on Thu May 25 09:08:04 2023
config:

        NAME            STATE     READ WRITE CKSUM
        mypool            DEGRADED     0     0     0
          mirror-0      DEGRADED     0     0     0
            hgst        ONLINE       0     0     0  (trim unsupported)
            enterprise  OFFLINE      0     0     0  (trim unsupported)
            iron        ONLINE       0     0     0  (trim unsupported)
        logs
          zlog.mypool     ONLINE       0     0     0  (100% trimmed, completed at Wed 07 Jun 2023 10:09:34 PM EEST)
        cache
          zcache.mypool   ONLINE       0     0     0  (untrimmed)

errors: No known data errors

@behlendorf
Copy link
Contributor

Is an error returned when you run zpool trim mypool zcache.mypool?

@mailinglists35
Copy link

yes, cannot trim 'zcache.mypool: device is in use as a cache

meanwhile my understanding of further explanation given in #14488 (comment), #9789 (comment) and the manual page / online documentation is that only autotrim works and there is no manual trimming of cache devices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants