-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate ARC more tightly with Linux #618
Conversation
Per commentary on commit 201e2b2bfb1a017c378f6be8a590bd975f82516e, I started bisecting kernel and ZoL versions, but the panic is now happening on all recent combinations. Something in the affected pool is broken. After recompiling SPL and ZFS with
Plus the dump files are now written out and readable. First
Second
|
The pool becomes importable again using
The tail of the
|
Thanks for following up on this, but the more that I look at it the more it seems unrelated to the VM change above. Probably just bad timing and we're now seeing some previous damage to the pool. The first set of assertions indicate some damage to the zil log which was written out. Unfortunately, the assertions don't provide enough data to see exactly what's wrong. Changing the to ASSERTs to the ASSERT3x variants would be helpful to see exactly what condition is failing but I'd also probably need a hex dump of those blocks to see exactly whats wrong. The second assertion after you disabled zil replay is a little more troubling, although it might be related. That's going to take some digging as well. I think we should probably move both of these issues to new bugs so as not to cloud any issues with this VM patch. The spl /tmp/spl-log.1332715670.7780 /tmp/spl-log.1332715670.7780.txt |
Under Solaris the ARC was designed to stay one step ahead of the VM subsystem. It would attempt to recognize low memory situtions before they occured and evict data from the cache. It would also make assessments about if there was enough free memory to perform a specific operation. This was all possible because Solaris exposes a fairly decent view of the memory state of the system to other kernel threads. Linux on the other hand does not make this information easily available. To avoid extensive modifications to the ARC the SPL attempts to provide these same interfaces. While this works it is not ideal and problems can arise when the ARC and Linux have different ideas about when your out of memory. This has manifested itself in the past as a spinning arc_reclaim_thread. This patch abandons the emulated Solaris interfaces in favor of the prefered Linux interface. That means moving the bulk of the memory reclaim logic out of the arc_reclaim_thread and in to the evict driven shrinker callback. The Linux VM will call this function when it needs memory. The ARC is then responsible for attempting to free the requested amount of memory if possible. Several interfaces have been modified to accomidate this approach, however the basic user space implementation remains the same. The following changes almost exclusively just apply to the kernel implementation. * Removed the hdr_recl() reclaim callback which is redundant with the broader arc_shrinker_func(). * Reduced arc_grow_retry to 5 seconds from 60. This is now used internally in the ARC with arc_no_grow to indicate that direct reclaim was recently performed. This typically indicates a rapid change in memory demands which the kswapd threads were unable to keep ahead of. As long as direct reclaim is happening once every 5 seconds arc growth will be paused to avoid further contributing to the existing memory pressure. The more common indirect reclaim paths will not set arc_no_grow. * arc_shrink() has been extended to take the number of bytes by which arc_c should be reduced. This allows for a more granual reduction of the arc target. Since the kernel provides a reclaim value to the arc_shrinker_func() this value is used instead of 1<<arc_shrink_shift. * arc_reclaim_needed() has been removed. It was used to determine if the system was under memory pressure and relied extensively on Solaris specific VM interfaces. In most case the new code just checks arc_no_grow which indicates that within the last arc_grow_retry seconds direct memory reclaim occurred. * arc_memory_throttle() has been updated to always include the amount of evictable memory (arc and page cache) in its free space calculations. This space is largely available in most call paths due to direct memory reclaim. * The Solaris pageout code was also removed to avoid confusion. It has always been disabled due to proc_pageout being defined as NULL in the Linux port. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Hi Brian, i really like this ARC patch. Usage of memory is now smooth and no more erratic then before. I run it for some days on Suse 12.1 kernel 3.1.9 without any problems. It also improves zvol performance by magnitudes. Thanks! |
Brian, I decided to try your patch on my server. The following happened: [ 7109.054462] [] ? nf_hook_slow+0x6f/0x150 |
Just a question, did you check if there is a swap device with some space available? In my case, there was some 200 MB swap space needed, while using 20 GB RAM and konfigured 2 GB swap space on opensuse. It maybe the case that your server cant use swap " Free swap = 0kB, Total swap = 0kB". |
I have no swap device. My plan is to use a zvol for swap when the issues involving that have been solved. |
Ok, i need to try that too. |
Ok, within Opensuse (kernel 3.1.9) i created a swap zvol : zfs create -o volblocksize=64k -V 10G /stor2/zfsswap created 100.000 random data files using 100 GB on another zpool, it uses up to 12 GB RAM. It is using |
First off thanks for trying the patch. This is the kind of feedback I need. The memory allocation failed because it was GFP_ATOMIC. These types of allocations are typically only done in interrupt context thus are not allowed to sleep and invoke any of the memory reclaim logic. The free memory must be available right now on the system. There's really not much we can do about that other than use less memory or increase min_free_kbytes to leave a larger reserve. A swap device might help too, in my experience Linux still can't really run without swap safely for a wide variety of workloads. As for your success with the swap device that's good to hear although I haven't fixed anything there yet. So I'm not sure how stable that will be. |
@pyavdr You might be interested in issue #342. As far as testing swap goes, you probably should fill the swap file to half full by filling a tmpfs or some other means. If your system doesn't lockup, then that would be very good news. @behlendorf I am not sure if increasing min_free_kbytes is the right way to address this. This issue looks like a mix of internal and external fragmentation in the SLUB allocator. It is internal because SLUB has wasted up to 50% of our allocated space and some of that wasted free space is likely adjacent to unallocated free space that might have been big enough to satisfy these allocation requests had the wasted space been available. It is also external because if the free space chunks were more continuous, these allocations would have succeed. I imagine that we could patch the ZFS code to do allocations from a heap in a virtual address space, which should permit us to fight this kind of fragmentation. Another possibility is to patch the kernel to provide better guarantees on the buckets available when entering an ATOMIC section, but that does not seem to be as good of a solution to me. What do you think? |
Actually, this issue is more complicated than I first thought. I suspect that it would not happen if CONFIG_PREEMPT_VOLUNTARY=y was not set, although this is not a proper fix. I will test that and post my results. |
@behlendorf, I seem to have overreacted earlier. Disabling CONFIG_PREEMPT_VOLUNTARY made the system stable. ZFS is freeing memory in low memory situations. My only complaint is that it appears to free about 6 gigabytes on a 16GB system, which seems to be a little extreme. This would appear to be safe as long as you patch the ./configure script to fail when CONFIG_PREEMPT_VOLUNTARY is enabled. |
@gentoofan Thanks for the update. Adding support for preempt is getting overdue at this point, we have a long standing issue open #83. However, I only fail configure on CONFIG_PREEMPT I should probably add CONFIG_PREEMPT_VOLUNTARY. Getting full preempt support going isn't a ton of work, it just requires some care and a whole lot of testing. As for the 6GB free that does seem a bit much. This is likely due to have the shrinker reclaim functions are implemented they always err on the side of returning to much. They could/should be updated simply to scan the requested number of objects and free what they can. We'll certainly look in to that too, but one step at a time. |
@gentofan Testing zfsswap on opensuse 12.1 (SMP kernel 3.1.9) with memtester and the ARC patch on 0.56-rc8. As soon the usage of zfsswap increases, the system locks. Need to reset the system, no chance to get some traces. @behlendorf It looks like that every new feature rises new problems, which are sometimes caused by older not solved issues. Maybe you guys should concentrate on solving the essential old issues, thus making ZOL reliable. This should help to implement new features much faster and with less errors, because you dont need to search for problems caused by known older issues and you can incorporate known design rules for new features. Not solving the old issues may cause workarounds which costs double work and making ZOL unreliable. In terms of reliability error handling on vdevs, and making memory usage waterproof is essential. Features for usage (Samba support, Iscsi support, zvols, snapshots) makes ZOL usable. So implementing this ARC patch is the right way to solve well known design problems from old issues. Please go on for the essential old issues. In todays state ZOL makes more fun then ever. Im prepared to assist on testing as far as my knowledge allows. Just my 2 cents. |
@pyadvr Does this patch break anything that was working previously? |
@gentoofan As far as i tested everything works fine with opensuse 12.1 kernel 3.1.9 and this patch. ( But Opensuse uses systemd, which is not implemented yet .. so no automount or installing with ppa, zvols are not creating the /dev/zvol directory and more ...) I checked also vanila kernel 3.3.1 on opensuse, but not complete. To answer your question in total, the ZOL and SPL testsuites should become reliable for all supported plattforms, which is - as far as i see - not the case. In detail we are free to add some more testcases ( rsync? creating&deleting large files? access to xattrb? .zfs directory? zfs send&receive? snapshots?... ). Im sure that Brian uses lots of testcases. Maybe some of them should be available for testing. |
@pyavdr This is a pull request for code that improves memory management. Your suggestions are unrelated to it, so they are unlikely to receive much attention here. You should file a new issue for them. |
@pyavdr Correctness and stability has always been my first priority. That's one of the major reasons I have yet to tag a stable release... we're not there yet. Yes, lots of people are running the release candidates successfully, but it's not to the point yet where I feel comfortable calling it stable. We are talking about a filesystem here and that means we need to set an exceptionally high bar. People are slow to forgive data loss and often first impressions last a long time. As for pulling is new features I've tried to keep that to a minimum. However, there are certain features I want to be there in the production release so they have been merged after some decent testing, for example snapshots. Other features such as the pending smbfs support are relatively low risk and isolated so I may bring in that code. I'm also all for additional testing. I do quite a bit of testing for every commit in an automated build farm but more is always better. What I would like to see happen, but haven't found anyone to do, is for the upstream ZFS test suite to get ported to Linux (Issue #6). It has a whole mess of existing tests to verify the basic ZFS functionality. If you have the ability and time getting this ported would certainly help improve stability and prevent regressions. As for this VM patch I'm currently feeling pretty good about it. The reported issues this far have been unrelated to this change near as I can tell. There's still certainly room for improvement here but this should help. |
Brian, I started some rsyncs while running 3 Virtual Machines and the system locked up, although it did it gradually. I first lost access to root, but I was able to I had a few levels of virtualization running on the system. First, I had 3 virtual machines running in KVM. In one was Open Indiana, in which I was installing Gentoo Prefix inside a Solaris Zone. Interestingly, my SSH connection to the host system and the KVM virtual machine died around the same time, while the Solaris Zone kept operating for about a minute. I was able to log into the system and get a shell, but attemping to sudo locked up. I then started piping information to my desktop until the machine had locked up entirely. I was only able to run a few commands. You can see the output of most of them below. Here is an excerpt of dmesg: [17914.034994] failure to allocate a tage (552) It had several hundred of those. That was the only thing printed in the log in relation to this event. Free memory was at about 4GB, although I did not try to pipe that to a file until after SSH had stopped functioning, which prevented me from recording it. Here is /proc/spl/kstat/zfs/arcstats: 4 1 0x01 77 3696 2006449653 18313977911159 The system still responds to pings and port 22 is still open, although ssh blocks upon "debug1: Entering interactive session." when running it with -vv. Looking at network utilization from the router, the rsync processes do not appear to be downloading anything. Lastly, the system software watchdog did not trigger. |
Hi Brian, [22703.082949] BUG: soft lockup - CPU#7 stuck for 22s! [splat_kmem_cach:22555] This lines are repeated for each CPU. I give it a second run, ths system gives again a soft lock message at kmem:slab_overcommit: [23341.132679] BUG: soft lockup - CPU#7 stuck for 23s! [splat_kmem_cach:25351] After that the system runs the rest of the splat tests fine. There was no reset needed. |
@gentoofan: Were you using the latest version of this patch when you encountered the deadlock. There was a change made last week to more correctly account for the ghost list in the reclaim logic. We observed a similar issue. |
@pyavdr Thanks for the feedback. Those particular splat tests are designed to push the spl slab implementation as hard as possible. Likely much harder than zfs ever will. The slab_overcommit test is particularly nasty. Since the system just issued a few warning about being slow under the pressure that's actually pretty encouraging. Still, there's always room for improvement. |
@behlendorf I was using commit 36a0e0d6ce2b18d9a5fca266fc3e8c7fd9baab4e. If I recall correctly, I cherry picked it two days ago. I take it that is the old version. I will retest with 525c13f. |
@gentoofan Actually, commit 36a0e0d6ce2b18d9a5fca266fc3e8c7fd9baab4e does have the fix I was referring too, so I'll look a bit more carefully at the arc stats |
This just happened to me again, although I was not able to get any output. I was recompiling Gentoo inside a virtual machine as part of a Gentoo install. The purpose had been to test booting Gentoo Linux off a raidz2 vdev using GRUB2. There was a brief lag a few minutes before it happened, after which I had spotted a message in the host's dmesg stating that "hpet increased min_delta_ns". This is with Linux 3.3.1 in the host and Linux 3.3.0 from the Gentoo 12.1 LiveDVD in the guest. After it happened, I could still interact with the host using screen over SSH, but running dmesg locked the screen window and new ones would not open. The system still reponded to pings and port 22 was still open, but I could not open new SSH connections.. Additionally, the system's CPU fan was running at an elevated RPM, which would suggest that there was some sort of CPU activity, although I do not know what that was for certain. |
FYI, I've now had 2 lockups whilst running with openzfs/spl@8920c69, behlendorf/zfs@525c13f, the first on linux-3.3.0 and the second on linux-3.3.1. The machine would respond to pings, but I couldn't get any action from ssh windows open to the machine nor login anew, no response on the console, and kern.log ("locally" on an NFS-mounted root and to a remote syslog server) didn't show anything. The workload in both cases was a large tar to zfs from an md/lvm/ext4 partition on the same machine but different disks to zfs. Now continuing on with same workload to see if I get lockup number 3, and if so will then try without "Integrate ARC more tightly with Linux". |
@behlendorf Is it possible that you have been testing this with behlendorf/zfs@c475167 included in your build? |
@gentoofan Yes, this commit, c475167 , has been in our master branch for quite some time now. You should be using it as well. Why? |
I had managed to import this fix from Illumos in the process of importing fixes for issue #644 and mistakenly thought that we were missing that patch. If I recall, this one was odd in that it required manual editing to import and until now, I had not had a chance to follow-up on my mental note as to why. Looking at it more closely, the reason is that we had already merged it. Please disregard my previous comment. |
I have been testing this on my desktop. When compiling things with many X processes running, the system will lag. There is a correlation between increases in memory_direct_count (and to a lesser extent, memory_indirect_count) and these lags. I managed to get a split-second view of htop during a lag. All CPU cores were spending >90% of their time in the kernel. Previously, I had tested this on my server and my SSH sessions would appear to freeze temporarily when the system was under load. I assume that the issue I observed on my server is the same as the one I observe on my desktop. |
The issues I mentioned in my previous comment appear to have been the result of memory pressure caused by the limits set in 23bdb07. I have written a patch to address that. It is a bit early to say that this has resolved the issue that I described in my previous comment, but my initial results are promising. I have opened pull request #660 with my patch. |
While my patch helped, the original lag issue is not solved. It just happened again when compiling GCC 4.6.2. |
@gentoofan I suspect the lag your seeing is being caused by zfs attempting too free to much in the direct reclaim path. The Linux shrinker's request that a specific number of objects in a cache be scanned and any objects which can be freed in that group are. This is done so the kernel can load balance freeing from all the various caches on the system. The rub is that the Solaris APIs which are used by the kmem slab don't take a scan count. They assume that once called the entire cache will be scanned and all eligible objects freed. That's obviously is going to take a little longer for a large cache. Prior to this patch the reclaim was done in the arc_reclaim thread so it was probably harder to observe this issue, now any process may encounter it as part of direct reclaim. Hence the lag. This isn't a big deal for a server but I can see it being more annoying on a desktop system. The fix is going to have to be to update the shrinkers to only do partial scans rather than the Solaris style full scans. But that change should probably be done in a separate patch. |
Oh, and we really do need to keep the shrinker at least until we can completely map the ARC in to the Linux page cache. I've planned that work for 0.7.0 since it should solve most/all of our memory related problems. But it's a big change and will be destabilizing. |
While testing performance this patch (525c13f ) introduces a performance regression of about 10 % versus the plain rc8 code. My test case (opensuse 12.1 kernel 3.1.9 SMP 0.6.59.rc8 in vmware, 2 * RAIDZ1 with each 4 vdevs striped and 12 GB RAM ) is creating 100.000 randomly filled files with a total size of 95 GB and deleting them afterwards. With rc8 it takes: 29 min , applying this patch to rc8 it takes: 33 min. |
@pyavdr I imagine that the slow down is due to memory pressure. Try setting zfs_arc_max to 2GB and repeating your tests. |
Playing around with zfs_arc_max (first: standard, then 2 GB, then 10 GB) and total system memory ( first 12 GB, then 16 GB) : the system ist not under memory pressure, there is about 1-2 GB always free. The difference between rc8 and rc8 with that patch remains. But Brian wrote, that he wants to go on with the total integration into standard linux memory handling. So at this time, those performance numbers are of no relevance, much more will come, but it remains strange. I expected the reverse situation. Besides that: zfs rollback for a 100 GB case is really slow, and deleting 100.000 files too. In both cases the 8 cpu´s are idleing at 1-3% and zpool iostat reports about 100 iops/s, while on write there are 2200 iops/s. |
Thanks for the performance feedback. I'd also expect the slow down to be caused by more direct reclaim occurring in your test case rather than this being done asynchronously. Anyway, this is just a few step towards better Linux integration so for now it seems like the price we need to pay. |
Merged as commit 302f753 |
…aster Merge remote-tracking branch '6.0/stage' into 'master'
Under Solaris the ARC was designed to stay one step ahead of the
VM subsystem. It would attempt to recognize low memory situtions
before they occured and evict data from the cache. It would also
make assessments about if there was enough free memory to perform
a specific operation.
This was all possible because Solaris exposes a fairly decent
view of the memory state of the system to other kernel threads.
Linux on the other hand does not make this information easily
available. To avoid extensive modifications to the ARC the SPL
attempts to provide these same interfaces. While this works it
is not ideal and problems can arise when the ARC and Linux have
different ideas about when your out of memory. This has manifested
itself in the past as a spinning arc_reclaim_thread.
This patch abandons the emulated Solaris interfaces in favor of
the prefered Linux interface. That means moving the bulk of the
memory reclaim logic out of the arc_reclaim_thread and in to the
evict driven shrinker callback. The Linux VM will call this
function when it needs memory. The ARC is then responsible for
attempting to free the requested amount of memory if possible.
Several interfaces have been modified to accomidate this approach,
however the basic user space implementation remains the same.
The following changes almost exclusively just apply to the kernel
implementation.
with the broader arc_shrinker_func().
internally in the ARC with arc_no_grow to indicate that direct
reclaim was recently performed. This typically indicates a
rapid change in memory demands which the kswapd threads were
unable to keep ahead of. As long as direct reclaim is happening
once every 5 seconds arc growth will be paused to avoid further
contributing to the existing memory pressure. The more common
indirect reclaim paths will not set arc_no_grow.
which arc_c should be reduced. This allows for a more granual
reduction of the arc target. Since the kernel provides a
reclaim value to the arc_shrinker_func() this value is used
instead of 1<<arc_shrink_shift.
if the system was under memory pressure and relied extensively
on Solaris specific VM interfaces. In most case the new code
just checks arc_no_grow which indicates that within the last
arc_grow_retry seconds direct memory reclaim occurred.
ammount of evictable memory (arc and page cache) in its free
space calculations. This space is largely available in most
call paths due to direct memory reclaim.
It has always been disabled due to proc_pageout being defined
as NULL in the Linux port.
Signed-off-by: Prakash Surya surya1@llnl.gov
Signed-off-by: Brian Behlendorf behlendorf1@llnl.gov