Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.6.4 Slow NFS Performance Corrected? #2899

Closed
xflou opened this issue Nov 14, 2014 · 16 comments
Closed

0.6.4 Slow NFS Performance Corrected? #2899

xflou opened this issue Nov 14, 2014 · 16 comments
Labels
Type: Performance Performance improvement or performance problem
Milestone

Comments

@xflou
Copy link

xflou commented Nov 14, 2014

Hello, Not sure if this is the correct place to ask this, but browsing through and reading one of the issues relate to slow NFS performance, I noticed that the problem was corrected in release 0.6.4.

I am having a very similar issue and wanted to know if anyone can confirm that this release has indeed corrected the problem. I would like to apply this release but would like confirmation before removing 0.6.3 and installing 0.6.4.

Frank

@FransUrbo
Copy link
Contributor

I noticed that the problem was corrected in release 0.6.4.

There is no 0.6.4 release. Yet. There's quite a number of issues left.

You COULD try building your own packages from the GIT master repository.

Where did you find the part that said it was fixed [in 0.6.4]?

@xflou
Copy link
Author

xflou commented Nov 14, 2014

I ran across this here --> "#2373"

My issue was not exact, but I had been experiencing the same symptoms. Extremely slow copies between client and file server nfs mounts, and even slower copy rates between zpool exported mounts.

Lame question, but Is there any information on how to build your own package you can point me to?

I've set some zfs parameters; which seem to be working so far, but I wanted to attempt correcting the NFS issue permanently with the 0.6.4 release, if possible.

@behlendorf
Copy link
Contributor

@xflou It should be considerably improved in the next tag which will be 0.6.4. There are directions for how to build generic rpm and deb packages here. Alternately, there may be testing/development packages available for your distribution which contain this improvement.

http://zfsonlinux.org/generic-rpm.html
http://zfsonlinux.org/generic-deb.html

@behlendorf behlendorf added this to the 0.6.4 milestone Nov 14, 2014
@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Nov 14, 2014
@xflou
Copy link
Author

xflou commented Nov 14, 2014

@behlendorf Thank you for the information. I have a non-production system I'm putting together to try this on before patching my production server. I'll give the generic-rpm a shot. Hopefully, it will work for Centos 6.4.

@FransUrbo
Copy link
Contributor

Lame question, but Is there any information on how to build your own package you can point me to?

Depending on your distribution, it's either

http://zfsonlinux.org/generic-deb.html

or

http://zfsonlinux.org/generic-rpm.html

@behlendorf
Copy link
Contributor

@xflou For Centos 6.4 you can just install from the zfs-testing repository. Just enable it in /etc/yum.repos.d/zfs.repo and disable the default zfs repository. By default the zfs repository tracks the stable tag and zfs-testing tracks master.

@xflou
Copy link
Author

xflou commented Nov 14, 2014

@behlendorf Thanks!! Using the zfs-testing repository will save me lots of time.

@deajan
Copy link

deajan commented Nov 28, 2014

@xflou Did you test the current master of zfs against NFS performance yet ?
I'm having serious NFS write performance trouble here with 0.6.3.

@xflou
Copy link
Author

xflou commented Dec 14, 2014

Finally found a window when to apply the upgrade(Today). I have several problems I need serious help with:

First, the sequence of events:

  1. I first tested the same upgrade on an exact system and things upgraded and seem to be fine including the zpool status command came back with all pool online
  2. I applied the same upgrade to my production system "yum upgrade zfs" using the same zfs-test repo and this upgrade seem go fine - no errors during the yum upgrade.

Now. when I check the status of the pools on my production system after the upgrade, I have a bunch of "UNAVAIL" disks with the a 'DEGRADED" state in almost every pool with one pool not able to mount at all with the "raidz-0 DEGRADED" . The output of my zpool status command follows after my questions below.

  1. I upgraded zfs, but did it with the zfs file systems "unmounted". Could this have caused this issue?

  2. I attempted to place the disk that was "UNAVAIL" back online, but it returns the message below indicating that I should replace the disk.


    warning: device 'sdao' onlined, but remains in faulted state
    use 'zpool replace' to replace devices that are no longer present


    Does the above message indicate that I need to replace the disk or is there another way to save this?

  3. I also have a faulted disk and degraded raidz2-0 array, does this disk have to be replaced?

  4. If forced to replace the disk, and replace it with a new disk, will it automatically resilver and can the system be used while this is being done.

  5. In general, It seems that there are a lot of disk that went bad, could this be possible or accurate, given the information above?

  6. Could "downgrading" help in this situation? I would not think so, but have to ask.

7.. zfs version output below:
OLD VERSION
libnvpair1.x86_64 0.6.3-1.1.el6 @zfs
libuutil1.x86_64 0.6.3-1.1.el6 @zfs
libzfs2.x86_64 0.6.3-1.1.el6 @zfs
libzpool2.x86_64 0.6.3-1.1.el6 @zfs
spl.x86_64 0.6.3-1.1.el6 @zfs
spl-dkms.noarch 0.6.3-1.1.el6 @zfs
zfs.x86_64 0.6.3-1.1.el6 @zfs
zfs-dkms.noarch 0.6.3-1.1.el6 @zfs
zfs-release.noarch 1-4.el6 @/zfs-release.el6.noarch
libzfs2-devel.x86_64 0.6.3-1.1.el6 zfs
lustre.x86_64 2.4.2-1dkms.el6 zfs
lustre-debuginfo.x86_64 2.4.2-1dkms.el6 zfs
lustre-dkms.noarch 2.4.2-1dkms.el6 zfs
lustre-source.x86_64 2.4.2-1dkms.el6 zfs
lustre-tests.x86_64 2.4.2-1dkms.el6 zfs
spl-debuginfo.x86_64 0.6.3-1.1.el6 zfs
zfs-debuginfo.x86_64 0.6.3-1.1.el6 zfs
zfs-devel.x86_64 0.6.2-1.el6 zfs
zfs-dracut.x86_64 0.6.3-1.1.el6 zfs
zfs-fuse.x86_64 0.6.9-6.20100709git.el6 epel
zfs-test.x86_64 0.6.3-1.1.el6 zfs

*** NEW VERSION **

zfs-dkms-0.6.3-1.el6.noarch
zfs-release-1-4.el6.noarch
libzfs2-0.6.3-1.el6.x86_64
zfs-0.6.3-159_gc944be5.el6.x86_64

nspluginwrapper-1.4.4-1.el6_3.x86_64
spl-dkms-0.6.3-1.el6.noarch
spl-0.6.3-1.el6.x86_64

libnvpair1.x86_64 0.6.3-1.el6 @zfs
libuutil1.x86_64 0.6.3-1.el6 @zfs
libzfs2.x86_64 0.6.3-1.el6 @zfs
libzpool2.x86_64 0.6.3-1.el6 @zfs
spl.x86_64 0.6.3-1.el6 @zfs
spl-dkms.noarch 0.6.3-1.el6 @zfs
zfs.x86_64 0.6.3-159_gc944be5.el6 @zfs-testing
zfs-dkms.noarch 0.6.3-1.el6 @zfs
zfs-release.noarch 1-4.el6 @/zfs-release.el6.noarch
libnvpair1.x86_64 0.6.3-159_gc944be5.el6 zfs-testing
libuutil1.x86_64 0.6.3-159_gc944be5.el6 zfs-testing
libzfs2.x86_64 0.6.3-159_gc944be5.el6 zfs-testing
libzfs2-devel.x86_64 0.6.3-159_gc944be5.el6 zfs-testing
libzpool2.x86_64 0.6.3-159_gc944be5.el6 zfs-testing
lustre.x86_64 2.4.2-1dkms.el6 zfs-testing
lustre-debuginfo.x86_64 2.4.2-1dkms.el6 zfs-testing
lustre-dkms.noarch 2.4.2-1dkms.el6 zfs-testing
lustre-source.x86_64 2.4.2-1dkms.el6 zfs-testing
lustre-tests.x86_64 2.4.2-1dkms.el6 zfs-testing
spl.x86_64 0.6.3-52_g52479ec.el6 zfs-testing
spl-debuginfo.x86_64 0.6.3-52_g52479ec.el6 zfs-testing
spl-dkms.noarch 0.6.3-52_g52479ec.el6 zfs-testing
zfs-debuginfo.x86_64 0.6.3-159_gc944be5.el6 zfs-testing
zfs-devel.x86_64 0.6.2-287_g2024041.el6 zfs-testing
zfs-dkms.noarch 0.6.3-159_gc944be5.el6 zfs-testing
zfs-dracut.x86_64 0.6.3-159_gc944be5.el6 zfs-testing
zfs-test.x86_64 0.6.3-159_gc944be5.el6 zfs-testing

below is the output from the three different errors with my production pools.

This particular pool will not mount since looks like 3 disks failed: (will I need to restore from alternate backup?)


pool: tools
state: UNAVAIL
status: One or more devices could not be used because the label is missing
or invalid. There are insufficient replicas for the pool to continue
functioning.
action: Destroy and re-create the pool from
a backup source.
see: http://zfsonlinux.org/msg/ZFS-8000-5E
scan: none requested
config:

NAME        STATE     READ WRITE CKSUM
tools       UNAVAIL      0     0     0  insufficient replicas
  raidz2-0  UNAVAIL      0     0     0  insufficient replicas
    sdar    UNAVAIL      0     0     0
    sdav    ONLINE       0     0     0
    sdaz    ONLINE       0     0     0
    sdas    ONLINE       0     0     0
    sdaw    UNAVAIL      0     0     0
    sdba    UNAVAIL      0     0     0
    sdat    UNAVAIL      0     0     0
    sdbb    ONLINE       0     0     0
    sdau    ONLINE       0     0     0
    sdbc    ONLINE       0     0     0

This pool has one disk UNAVAILABLE, and I attempted to place it back online but indicated that I must replace the disk. (Do I need to replace the disk in this case or can I use the same disk?)


pool: pub
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: scrub repaired 0 in 2h59m with 0 errors on Sun Sep 21 17:30:30 2014
config:

NAME        STATE     READ WRITE CKSUM
publish     DEGRADED     0     0     0
  raidz2-0  DEGRADED     0     0     0
    sdd     ONLINE       0     0     0
    sdk     ONLINE       0     0     0
    sdl     ONLINE       0     0     0
    sds     ONLINE       0     0     0
    sdt     ONLINE       0     0     0
    sdaa    ONLINE       0     0     0
    sdab    ONLINE       0     0     0
    sdai    ONLINE       0     0     0
    sdaj    ONLINE       0     0     0
    sdao    UNAVAIL      0     0     0

errors: No known data errors


Last pool: Same question as before,(will I need to replace this disk or can I reuse it)


pool: data
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: scrub repaired 0 in 7h49m with 0 errors on Sun Sep 21 22:20:38 2014
config:

NAME        STATE     READ WRITE CKSUM
proj        DEGRADED     0     0     0
  raidz2-0  DEGRADED     0     0     0
    sdg     ONLINE       0     0     0
    sdh     ONLINE       0     0     0
    sdo     ONLINE       0     0     0
    sdp     ONLINE       0     0     0
    sdw     ONLINE       0     0     0
    sdx     ONLINE       0     0     0
    sdae    ONLINE       0     0     0
    sdaf    ONLINE       0     0     0
    sdam    ONLINE       0     0     0
    sdaq    FAULTED      0     0     0  corrupted data

errors: No known data errors


@xflou
Copy link
Author

xflou commented Dec 15, 2014

Please disregard. A quick reboot did the trick. Now I need to load my production system and see if NFS can handle the things better.

@deajan
Copy link

deajan commented Dec 15, 2014

Please keep us up to date with benchs whenever you can.

@dswartz
Copy link
Contributor

dswartz commented Dec 15, 2014

As far as I can tell, the slow NFS writes with an SSD SLOG seems to have been addressed by the AIO changes. I put a good 200GB over-provisioned and freshly erased SSD on my 3x2 raid10 pool. I added a vhd from that pool to my virtual win7 guest (vsphere) and re-ran crystadiskmark. Sequential reads 106 MB/sec writes 88 MB/sec (over a gigabit link).

@deajan
Copy link

deajan commented Dec 30, 2014

How safe would it be to update to zfs testing in a production environment that has big NFS issues ?

@FransUrbo
Copy link
Contributor

How safe would it be to update to zfs testing in a production environment that has big NFS issues ?

Generally 'very [safe]'. Your milage may vary, but I run latest GIT on my primary storage and all my machines. Usually 'we' recommend to run latest...

There HAVE been issues and problems introduced in GIT master/latest, but they are rare (can only remember one actually), and steps have been taken to avoid it in the future…

DO NOTE that if you're unlucky, features in the pool is/can be enabled when importing it with the new version, and some of these [features] don't exist in 0.6.3/tagged. If that happens, you won't be able to import the pool on an older version and have to stick with latest…

Next tagg (0.6.4) is probably a couple of months away, there's still 61 issues left (many of those are finished, they just need to be tested, verified and accepted).

I say 'tagged' because we shouldn't really talk 'stable'. The latest/GIT master is usually more stable than the tagged (because of the sheer number of issues/bugs fixed).=

@deajan
Copy link

deajan commented Dec 30, 2014

Thanks a lot for your explanation, i'll dive into ZFS testing first in my home server, than my backup machine, and later in bigger backup machines :)

@behlendorf
Copy link
Contributor

Since this has been confirmed fixed in master by several people I'm closing this issue. As mentioned above for those that need this fix now it's available from the zfs-testing repository and will be part of the 0.6.4 tag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

5 participants