Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken L2ARC space accounting: 16.0E cache device #3114

Closed
jflandry opened this issue Feb 17, 2015 · 11 comments
Closed

Broken L2ARC space accounting: 16.0E cache device #3114

jflandry opened this issue Feb 17, 2015 · 11 comments

Comments

@jflandry
Copy link

I'm having an issue with some NFS servers, after the cache device fills up the reported size jumps up to 16 Exabytes. If the cache device is removed and re-added the correct size is shown.

logs                                                 -      -      -      -      -      -
  scsi-SATA_Crucial_CT240M5_1347095ABE09-part4    140K   160G      0      0    276    106
cache                                                -      -      -      -      -      -
  scsi-SATA_Crucial_CT240M5_1347095ABE0D-part4   2.41T   16.0E      0      9     28   237K
-----------------------------------------------  -----  -----  -----  -----  -----  -----

Running on 2.6.32-431.23.3.el6.x86_64 , spl-0.6.3-53_ga3c1eb7 , zfs-0.6.3-163_g9063f65

These have 90 drives hanging off two Supermicro JBODs, 2x Xeon E5-2620, 64GB ram, exporting nfs on IPoIB on QDR Mellanox InfiniBand.

For what it's worth we also have a couple of OpenVZ hosts with zfs and cache devices on partitions or whole SSDs, only the nfs servers get weird 16E cache devices, but they do get hammered a lot more.

I have found some references from the zfs-discuss and just this morning OmniOS-discuss mailing lists, there seems to be an issue with the upstream ZFS code, the problem is also present on FreeNAS.

Here are the mailing list posts:

https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-discuss/aMwHZrZa5J4/discussion
https://www.mail-archive.com/omnios-discuss@lists.omniti.com/msg03820.html

And a FreeNAS bug report and fix:

https://bugs.freenas.org/issues/6239
https://bugs.freenas.org/projects/freenas/repository/trueos/revisions/6ec48ebf5a1596ec7d2732e891fce3f116105ae5/diff/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c

One of the NFS servers is for internal use, I could easily test a patch if needed.

@prakashsurya
Copy link
Member

Hm, it's not obvious why that fixes the problem. Quickly scanning the code, it appears to use the header's "b_asize" field when decrementing the size, so it makes sense to use the same value when incrementing the size accounting. I must be overlooking something..?

@jflandry
Copy link
Author

@prakashsurya I can only suggest that this may or may not be related to the compressed L2ARC feature.

@jflandry
Copy link
Author

@prakashsurya Some additional info: I've replaced the cache device on two of the four servers with a full disk SSD and so far everything is behaving normally. I'll keep monitoring for changes.

logs                                                -      -      -      -      -      -
  scsi-SATA_Crucial_CT240M5_1328095AA9D6-part4      0   160G      0      0      0      6
cache                                               -      -      -      -      -      -
  scsi-SATA_OCZ-VERTEX2_OCZ-DE9TLH9B1BOUU5VW     223G   793M      0     15  3.36K  1.21M
logs                                                -      -      -      -      -      -
  scsi-SATA_Crucial_CT240M5_1347095ABDD8-part4   128K   160G      0      0      2      2
cache                                               -      -      -      -      -      -
  scsi-SATA_OCZ-VERTEX2_OCZ-E6K1WP2N49Q19D0U     223G   787M      1     10   157K  1.04M

kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue May 19, 2015
Use write_psize instead of write_asize when doing vdev_space_update.
Without this change the accounting of L2ARC usage would be wrong and
give 16EB free space because the number became negative and overflows.

Obtained from:	FreeNAS (issue openzfs#6239)
MFC after:	2 weeks

fixes ZFS on Linux:

openzfs#3114
openzfs#3400
@kernelOfTruth
Copy link
Contributor

@jflandry you're still using the l2arc and are affected ?

please give the following commit, patch a try: kernelOfTruth@d7e1fd0

@jflandry
Copy link
Author

@kernelOfTruth That patch seems promising. We've switched the affected servers to whole disk SSDs for now, we've yet to encounter the same issue running with this configuration, but it may just be harder to trigger.

I'm afraid we can't reboot those servers at the moment, except maybe the one for internal use. I'll see if we can patch that one and generate some artificial load.

kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue May 21, 2015
Use write_psize instead of write_asize when doing vdev_space_update.
Without this change the accounting of L2ARC usage would be wrong and
give 16EB free space because the number became negative and overflows.

Obtained from:	FreeNAS (issue openzfs#6239)
MFC after:	2 weeks

fixes ZFS on Linux:

openzfs#3114
openzfs#3400
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue May 21, 2015
adapted to openzfs#3216,

adaption to openzfs#2129 in
@ l2arc_compress_buf(l2arc_buf_hdr_t *l2hdr)

 		/*
 		 * Compression succeeded, we'll keep the cdata around for
 		 * writing and release it afterwards.
 		 */
+		if (rounded > csize) {
+			bzero((char *)cdata + csize, rounded - csize);
+			csize = rounded;
+		}

to

		/*
		 * Compression succeeded, we'll keep the cdata around for
		 * writing and release it afterwards.
		 */
		if (rounded > csize) {
			abd_zero_off(cdata, rounded - csize, csize);
			csize = rounded;
		}

ZFSonLinux:
openzfs#3114
openzfs#3400
openzfs#3433
kernelOfTruth referenced this issue in kernelOfTruth/zfs May 22, 2015
adapted to abd_next (May 19th 2015)

 		/*
 		 * Compression succeeded, we'll keep the cdata around for
 		 * writing and release it afterwards.
 		 */
+		if (rounded > csize) {
+			bzero((char *)cdata + csize, rounded - csize);
+			csize = rounded;
+		}

to

		/*
		 * Compression succeeded, we'll keep the cdata around for
		 * writing and release it afterwards.
		 */
		if (rounded > csize) {
			abd_zero_off(cdata, rounded - csize, csize);
			csize = rounded;
		}

ZFSonLinux:
zfsonlinux#3114
zfsonlinux#3400
zfsonlinux#3433
@kernelOfTruth
Copy link
Contributor

bump

already somewhat tested patch: #3451
(prior to the ARC mutex lock contention fixes merge)

newly updated & "rebased" #3491

@odoucet
Copy link

odoucet commented Jun 21, 2015

I've just experienced this bug with very huge L2ARC (two 1.2TB disks). It just took several weeks to fill the whole disks.
Taking offline the devices seems not to work (everything was frozen), so rebot was the only way.
I can try to reproduce it with smaller drives on a second server if it helps.

To document the issue :
l2_io_error and l2_cksum_bad started growing very fast, all access to zfs devs started to be very slow. I tried disabling cache devices (zpool offline), but with no luck : counter l2_size starts decreasing slowly (I calculated I'll have to wait 12h at this rate to reach l2_size=0).
Setting secondarycache=none on the whole pool did not fix it either. Hard reboot was necessary. I kept cache devices offline and secondarycache=none.

arcstat when bug triggered :

evict_l2_cached                 4    12935407407616
evict_l2_eligible               4    1298529380352
evict_l2_ineligible             4    2532005249024
l2_hits                         4    187427463
l2_misses                       4    971629639
l2_feeds                        4    3945185
l2_rw_clash                     4    12376
l2_read_bytes                   4    1375706628096
l2_write_bytes                  4    2988676811776
l2_writes_sent                  4    616145
l2_writes_done                  4    616145
l2_writes_error                 4    0
l2_writes_hdr_miss              4    2720
l2_evict_lock_retry             4    141
l2_evict_reading                4    0
l2_free_on_write                4    7529641
l2_cdata_free_on_write          4    357228
l2_abort_lowmem                 4    87411
l2_cksum_bad                    4    59004147
l2_io_error                     4    55709282
l2_size                         4    3807087760384
l2_asize                        4    2556936671744
l2_hdr_size                     4    48200121848
l2_compress_successes           4    158274896
l2_compress_zeros               4    0
l2_compress_failures            4    19776807

Now, all values are at 0 (normal behaviour with no cache devices online)

@pzwahlen
Copy link

@odoucet From my experience, removing a cache device is a blocking operation (from an IO point of view, at least on zvols). When checked with strace, the 'zpool remove' command blocks on IOCTL "0x5a0c". When my cache was growing to infinity, this could take up to 5 minutes (when my l2 reported size was about 400G). If extrapolating, your 1.2TB cache would need about 15 minutes to be removed.

Now that I run #3491, my cache doesn't grow that much anymore and removing a 60G cache device takes about 15 seconds (but is still a blocking operation).

@behlendorf
Copy link
Contributor

@odoucet removing a cache device is a blocking operation. It doesn't strictly have to be but that's how it was implemented.

Between the following two commits this issue is believe to be resolved in master.

ef56b07 Account for ashift when gathering buffers to be written to l2arc device
d962d5d Illumos 5701 - zpool list reports incorrect "alloc" value for cache device

@odoucet
Copy link

odoucet commented Jun 26, 2015

These two fixes were not merged in 0.6.4.2 ; was it expected ?

@behlendorf
Copy link
Contributor

@odoucet they we're deemed too high risk and intentionally skipped. We want to very conservative in regarding what we backport.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants