Broken L2ARC space accounting: 16.0E cache device #3114

jflandry · 2015-02-17T19:18:01Z

I'm having an issue with some NFS servers, after the cache device fills up the reported size jumps up to 16 Exabytes. If the cache device is removed and re-added the correct size is shown.

logs                                                 -      -      -      -      -      -
  scsi-SATA_Crucial_CT240M5_1347095ABE09-part4    140K   160G      0      0    276    106
cache                                                -      -      -      -      -      -
  scsi-SATA_Crucial_CT240M5_1347095ABE0D-part4   2.41T   16.0E      0      9     28   237K
-----------------------------------------------  -----  -----  -----  -----  -----  -----

Running on 2.6.32-431.23.3.el6.x86_64 , spl-0.6.3-53_ga3c1eb7 , zfs-0.6.3-163_g9063f65

These have 90 drives hanging off two Supermicro JBODs, 2x Xeon E5-2620, 64GB ram, exporting nfs on IPoIB on QDR Mellanox InfiniBand.

For what it's worth we also have a couple of OpenVZ hosts with zfs and cache devices on partitions or whole SSDs, only the nfs servers get weird 16E cache devices, but they do get hammered a lot more.

I have found some references from the zfs-discuss and just this morning OmniOS-discuss mailing lists, there seems to be an issue with the upstream ZFS code, the problem is also present on FreeNAS.

Here are the mailing list posts:

https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-discuss/aMwHZrZa5J4/discussion
https://www.mail-archive.com/omnios-discuss@lists.omniti.com/msg03820.html

And a FreeNAS bug report and fix:

https://bugs.freenas.org/issues/6239
https://bugs.freenas.org/projects/freenas/repository/trueos/revisions/6ec48ebf5a1596ec7d2732e891fce3f116105ae5/diff/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c

One of the NFS servers is for internal use, I could easily test a patch if needed.

The text was updated successfully, but these errors were encountered:

prakashsurya · 2015-02-17T19:43:55Z

Hm, it's not obvious why that fixes the problem. Quickly scanning the code, it appears to use the header's "b_asize" field when decrementing the size, so it makes sense to use the same value when incrementing the size accounting. I must be overlooking something..?

jflandry · 2015-02-17T20:16:55Z

@prakashsurya I can only suggest that this may or may not be related to the compressed L2ARC feature.

jflandry · 2015-02-18T20:23:18Z

@prakashsurya Some additional info: I've replaced the cache device on two of the four servers with a full disk SSD and so far everything is behaving normally. I'll keep monitoring for changes.

logs                                                -      -      -      -      -      -
  scsi-SATA_Crucial_CT240M5_1328095AA9D6-part4      0   160G      0      0      0      6
cache                                               -      -      -      -      -      -
  scsi-SATA_OCZ-VERTEX2_OCZ-DE9TLH9B1BOUU5VW     223G   793M      0     15  3.36K  1.21M

logs                                                -      -      -      -      -      -
  scsi-SATA_Crucial_CT240M5_1347095ABDD8-part4   128K   160G      0      0      2      2
cache                                               -      -      -      -      -      -
  scsi-SATA_OCZ-VERTEX2_OCZ-E6K1WP2N49Q19D0U     223G   787M      1     10   157K  1.04M

Use write_psize instead of write_asize when doing vdev_space_update. Without this change the accounting of L2ARC usage would be wrong and give 16EB free space because the number became negative and overflows. Obtained from: FreeNAS (issue openzfs#6239) MFC after: 2 weeks fixes ZFS on Linux: openzfs#3114 openzfs#3400

kernelOfTruth · 2015-05-19T02:03:03Z

@jflandry you're still using the l2arc and are affected ?

please give the following commit, patch a try: kernelOfTruth@d7e1fd0

jflandry · 2015-05-19T19:20:16Z

@kernelOfTruth That patch seems promising. We've switched the affected servers to whole disk SSDs for now, we've yet to encounter the same issue running with this configuration, but it may just be harder to trigger.

I'm afraid we can't reboot those servers at the moment, except maybe the one for internal use. I'll see if we can patch that one and generate some artificial load.

Use write_psize instead of write_asize when doing vdev_space_update. Without this change the accounting of L2ARC usage would be wrong and give 16EB free space because the number became negative and overflows. Obtained from: FreeNAS (issue openzfs#6239) MFC after: 2 weeks fixes ZFS on Linux: openzfs#3114 openzfs#3400

adapted to openzfs#3216, adaption to openzfs#2129 in @ l2arc_compress_buf(l2arc_buf_hdr_t *l2hdr) /* * Compression succeeded, we'll keep the cdata around for * writing and release it afterwards. */ + if (rounded > csize) { + bzero((char *)cdata + csize, rounded - csize); + csize = rounded; + } to /* * Compression succeeded, we'll keep the cdata around for * writing and release it afterwards. */ if (rounded > csize) { abd_zero_off(cdata, rounded - csize, csize); csize = rounded; } ZFSonLinux: openzfs#3114 openzfs#3400 openzfs#3433

adapted to abd_next (May 19th 2015) /* * Compression succeeded, we'll keep the cdata around for * writing and release it afterwards. */ + if (rounded > csize) { + bzero((char *)cdata + csize, rounded - csize); + csize = rounded; + } to /* * Compression succeeded, we'll keep the cdata around for * writing and release it afterwards. */ if (rounded > csize) { abd_zero_off(cdata, rounded - csize, csize); csize = rounded; } ZFSonLinux: zfsonlinux#3114 zfsonlinux#3400 zfsonlinux#3433

kernelOfTruth · 2015-06-12T20:35:14Z

bump

already somewhat tested patch: #3451
(prior to the ARC mutex lock contention fixes merge)

newly updated & "rebased" #3491

odoucet · 2015-06-21T22:26:20Z

I've just experienced this bug with very huge L2ARC (two 1.2TB disks). It just took several weeks to fill the whole disks.
Taking offline the devices seems not to work (everything was frozen), so rebot was the only way.
I can try to reproduce it with smaller drives on a second server if it helps.

To document the issue :
l2_io_error and l2_cksum_bad started growing very fast, all access to zfs devs started to be very slow. I tried disabling cache devices (zpool offline), but with no luck : counter l2_size starts decreasing slowly (I calculated I'll have to wait 12h at this rate to reach l2_size=0).
Setting secondarycache=none on the whole pool did not fix it either. Hard reboot was necessary. I kept cache devices offline and secondarycache=none.

arcstat when bug triggered :

evict_l2_cached                 4    12935407407616
evict_l2_eligible               4    1298529380352
evict_l2_ineligible             4    2532005249024
l2_hits                         4    187427463
l2_misses                       4    971629639
l2_feeds                        4    3945185
l2_rw_clash                     4    12376
l2_read_bytes                   4    1375706628096
l2_write_bytes                  4    2988676811776
l2_writes_sent                  4    616145
l2_writes_done                  4    616145
l2_writes_error                 4    0
l2_writes_hdr_miss              4    2720
l2_evict_lock_retry             4    141
l2_evict_reading                4    0
l2_free_on_write                4    7529641
l2_cdata_free_on_write          4    357228
l2_abort_lowmem                 4    87411
l2_cksum_bad                    4    59004147
l2_io_error                     4    55709282
l2_size                         4    3807087760384
l2_asize                        4    2556936671744
l2_hdr_size                     4    48200121848
l2_compress_successes           4    158274896
l2_compress_zeros               4    0
l2_compress_failures            4    19776807

Now, all values are at 0 (normal behaviour with no cache devices online)

pzwahlen · 2015-06-24T15:52:32Z

@odoucet From my experience, removing a cache device is a blocking operation (from an IO point of view, at least on zvols). When checked with strace, the 'zpool remove' command blocks on IOCTL "0x5a0c". When my cache was growing to infinity, this could take up to 5 minutes (when my l2 reported size was about 400G). If extrapolating, your 1.2TB cache would need about 15 minutes to be removed.

Now that I run #3491, my cache doesn't grow that much anymore and removing a 60G cache device takes about 15 seconds (but is still a blocking operation).

behlendorf · 2015-06-25T16:25:58Z

@odoucet removing a cache device is a blocking operation. It doesn't strictly have to be but that's how it was implemented.

Between the following two commits this issue is believe to be resolved in master.

ef56b07 Account for ashift when gathering buffers to be written to l2arc device
d962d5d Illumos 5701 - zpool list reports incorrect "alloc" value for cache device

odoucet · 2015-06-26T22:45:03Z

These two fixes were not merged in 0.6.4.2 ; was it expected ?

behlendorf · 2015-06-26T23:43:34Z

@odoucet they we're deemed too high risk and intentionally skipped. We want to very conservative in regarding what we backport.

behlendorf added Bug - Minor labels Feb 17, 2015

kernelOfTruth mentioned this issue May 18, 2015

L2ARC Capacity and Usage FUBAR - significant performance penalty apparently associated #3400

Closed

kernelOfTruth mentioned this issue May 20, 2015

[superseded by #3451][RFC, l2arc accounting & checksum issue] l2arc-write-target-size.diff , cumulative patch #3433

Closed

kernelOfTruth mentioned this issue May 28, 2015

[WIP; RFC, discussion, feedback] account for ashift when choosing buffers … #3451

Closed

kernelOfTruth mentioned this issue Jun 12, 2015

[after ARC lock-contention fixes] account for ashift when gathering buffers to be written to l2arc device #3491

Closed

kernelOfTruth mentioned this issue Jun 24, 2015

Account for ashift when gathering buffers to be written to l2arc device #3521

Closed

behlendorf closed this as completed Jun 25, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken L2ARC space accounting: 16.0E cache device #3114

Broken L2ARC space accounting: 16.0E cache device #3114

jflandry commented Feb 17, 2015

prakashsurya commented Feb 17, 2015

jflandry commented Feb 17, 2015

jflandry commented Feb 18, 2015

kernelOfTruth commented May 19, 2015

jflandry commented May 19, 2015

kernelOfTruth commented Jun 12, 2015

odoucet commented Jun 21, 2015

pzwahlen commented Jun 24, 2015

behlendorf commented Jun 25, 2015

odoucet commented Jun 26, 2015

behlendorf commented Jun 26, 2015

Broken L2ARC space accounting: 16.0E cache device #3114

Broken L2ARC space accounting: 16.0E cache device #3114

Comments

jflandry commented Feb 17, 2015

prakashsurya commented Feb 17, 2015

jflandry commented Feb 17, 2015

jflandry commented Feb 18, 2015

kernelOfTruth commented May 19, 2015

jflandry commented May 19, 2015

kernelOfTruth commented Jun 12, 2015

odoucet commented Jun 21, 2015

pzwahlen commented Jun 24, 2015

behlendorf commented Jun 25, 2015

odoucet commented Jun 26, 2015

behlendorf commented Jun 26, 2015