Improve ARC hit rate with metadata heavy workloads #2110

prakashsurya · 2014-02-06T01:05:14Z

This stack of patches has been empirically shown to drastically improve         
the hit rate of the ARC for certain workloads. As a result, fewer reads         
to disk are required, which is generally a good thing and can                   
drastically improve performance if the workload is disk limited.                

For the impatient, I'll summarize the results of the tests performed:           

    * Test 1 - Creating many empty directories. This test saw 99.9%             
               fewer reads and 12.8% more inodes created when running           
               *with* these changes.                                            

    * Test 2 - Creating many empty files. This test saw 4% fewer reads          
               and 0% more inodes created when running *with* these             
               changes.                                                         

    * Test 3 - Creating many 4 KiB files. This test saw 96.7% fewer             
               reads and 4.9% more inodes created when running *with*           
               these changes.                                                   

    * Test 4 - Creating many 4096 KiB files. This test saw 99.4% fewer          
               reads and 0% more inodes created (but took 6.9% fewer            
               seconds to complete) when running *with* these changes.                                              

    * Test 5 - Rsync'ing a dataset with many empty directories. This            
               test saw 36.2% fewer reads and 66.2% more inodes created         
               when running *with* these changes.                               

    * Test 6 - Rsync'ing a dataset with many empty files. This test saw         
               30.9% fewer reads and 0% more inodes created (but took           
               24.3% fewer seconds to complete) when running *with*             
               these changes.                                                   

    * Test 7 - Rsync'ing a dataset with many 4 KiB files. This test saw         
               30.8% fewer reads and 173.3% more inodes created when            
               running *with* these changes.                                    

For the patient, the following consists of more a more detailed                 
description of the tests performed and the results gathered.                    

All the tests were run using identical machines, each with a pool               
consisting of 5 mirror pairs with 2TB 7200 RPM disks. Each test was run         
twice, once *without* this set of patches and again *with* this set of          
patches to highlight the performance changes introduced. The first four         
workloads tested were:                                                          

    ** NOTE: None of these tests were run to completion. They ran for a         
             set amount of time and then were terminated or hit ENOSPC.         

    1. Creating many empty directories:                                         

       * fdtree -d 10 -l 8 -s 0 -f 0 -C                                         
         -> 111,111,111 Directories                                             
         ->           0 Files                                                   
         ->           0 KiB File Data                                           

    2. Creating many empty files:                                               

       * fdtree -d 10 -l 5 -s 0 -f 10000 -C                                     
         ->       111,111 Directories                                           
         -> 1,111,110,000 Files                                                 
         ->             0 KiB File Data                                         

    3. Creating many 4 KiB files:                                               

       * fdtree -d 10 -l 5 -s 1 -f 10000 -C                                     
         ->       111,111 Directories                                           
         -> 1,111,110,000 Files                                                 
         -> 4,444,440,000 KiB File Data                                         

    4. Creating many 4096 KiB files:                                            

       * fdtree -d 10 -l 5 -s 1024 -f 10000 -C                                  
         ->           111,111 Directories                                       
         ->     1,111,110,000 Files                                             
         -> 4,551,106,560,000 KiB File Data                                     

Results for these first four tests are below:                                   

                  | Time (s) |   inodes |  reads |    writes |                  
                --+----------+----------+--------+-----------+                  
    Test 1 Before |    65069 | 37845363 | 831975 |   3214646 |                  
    Test 1 After  |    65069 | 42703608 |    778 |   3327674 |                  
                --+----------+----------+--------+-----------+                  
    Test 2 Before |    65073 | 54257583 | 208647 |   2413056 |                  
    Test 2 After  |    65069 | 54255782 | 200038 |   2533759 |                  
                --+----------+----------+--------+-----------+                  
    Test 3 Before |    65068 | 49857744 | 487130 |   5533348 |                  
    Test 3 After  |    65071 | 52294311 |  16078 |   5648354 |                  
                --+----------+----------+--------+-----------+                  
    Test 4 Before |    34854 |  2448329 | 385870 | 162116572 |                  
    Test 4 After  |    32419 |  2448329 |   2339 | 162175706 |                  
                --+----------+----------+--------+-----------+                  

    * "Time (s)" - The run time of the test in seconds                          
    * "inodes"   - The number of inodes created by the test                     
    * "reads"    - The number of reads performed by the test                    
    * "writes"   - The number of writes performed by the test

As you can see from the table above, running with this patch stack              
*significantly* reduced the number of reads performed in 3 out of the 4         
tests (due to an improved ARC hit rate).                                        

In addition to the tests described above, which specifically targeted           
creates only, three other workloads were tested. These additional tests         
were targeting rsync performance against the datasets created in the            
previous tests. A brief description of the workloads and results for            
these tests are below:                                                          

    ** NOTE: Aside from (6), these tests didn't run to completion. They         
             ran for a set amount of time and then were terminated.             

    5. Rsync the dataset created in Test 1 to a new dataset:                    

       * rsync -a /tank/test-1 /tank/test-5                                     

    6. Rsync the dataset created in Test 2 to a new dataset:                    

       * rsync -a /tank/test-2 /tank/test-6                                     

    7. Rsync the dataset created in Test 3 to a new dataset:                    

       * rsync -a /tank/test-3 /tank/test-7                                     

Results for Test 5, 6, and 7 are below:                                         

                  | Time (s) |   inodes |    reads |  writes |                  
                --+----------+----------+----------+---------+                  
    Test 5 Before |    93041 | 17921014 | 47632823 | 4094848 |                  
    Test 5 After  |    93029 | 29785847 | 30376206 | 4484459 |                  
                --+----------+----------+----------+---------+                  
    Test 6 Before |    15290 | 54264474 |  6018331 |  733087 |                  
    Test 6 After  |    11573 | 54260826 |  4155661 |  617285 |                  
                --+----------+----------+----------+---------+                  
    Test 7 Before |    93057 | 10093749 | 41561635 | 3659098 |                  
    Test 7 After  |    93045 | 27587043 | 28773151 | 5612234 |                  
                --+----------+----------+----------+---------+                  

    * "Time (s)" - The run time of the test in seconds                          
    * "inodes"   - The number of inodes created by the test                     
    * "reads"    - The number of reads performed by the test                    
    * "writes"   - The number of writes performed by the test                   

Signed-off-by: Prakash Surya <surya1@llnl.gov>

DeHackEd · 2014-02-06T01:12:46Z

I've had my L2ARC size go bonkers (used exceeds size of SSD, free = 16 EB) as viewed from zpool iostat -v while using the Limit L2ARC header footprint from the previous patch. I've mostly been omitting that patch from the series I've been running thus far.

I'll see about giving this series a spin later.

prakashsurya · 2014-02-06T01:27:03Z

And running without 74e2045, things are "normal"? Honestly, I haven't given that patch the testing attention it probably deserves, so I wouldn't be surprised if it's bugged. In that case, it might be best to take it out until I can properly test it.

behlendorf · 2014-02-06T02:27:24Z

@DeHackEd The fix for the l2arc issue #2093 was merged today in to master. @prakashsurya if you rebase these changes on the latest master we can avoid other people having that issue while testing.

prakashsurya · 2014-02-06T18:54:49Z

OK, I just rebased this stack onto master. It should include the fix from #2093 now.

prakashsurya · 2014-02-07T00:54:07Z

In case these are useful to anybody, I'm going to post graphs of various arcstat parameters vs. time for each of the 14 unique tests I've run so far. They helped me understand the ARC's behavior, so maybe they'll help others as well.

Test 1 - After

Test 1 - Before

Test 2 - After

Test 2 - Before

Test 3 - After

Test 3 - Before

Test 4 - After

Test 4 - Before

Test 5 - After

Test 5 - Before

Test 6 - After

Test 6 - Before

Test 7 - After

Test 7 - Before

Decrease the mimimum ARC size from 1/32 of total system memory (or 64MB) to a much smaller 4MB. 1) Large systems with over a 1TB of memory are being deployed and reserving 1/32 of this memory (32GB) as the mimimum requirement is overkill. 2) Tiny systems like the raspberry pi may only have 256MB of memory in which case 64MB is far too large. The ARC should be reclaimable if the VFS determines it needs the memory for some other purpose. If you want to ensure the ARC is never completely reclaimed due to memory pressure you may still set a larger value with zfs_arc_min. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

In an attempt to prevent arc_c from collapsing "too fast", the arc_shrink() function was updated to take a "bytes" parameter by this change: commit 302f753 Author: Brian Behlendorf <behlendorf1@llnl.gov> Date: Tue Mar 13 14:29:16 2012 -0700 Integrate ARC more tightly with Linux Unfortunately, that change failed to make a similar change to the way that arc_p was updated. So, there still exists the possibility for arc_p to collapse to near 0 when the kernel start calling the arc's shrinkers. This change attempts to fix this, by decrementing arc_p by the "bytes" parameter in the same way that arc_c is updated. In addition, a minimum value of arc_p is attempted to be maintained, similar to the way a minimum arc_p value is maintained in arc_adapt(). Signed-off-by: Prakash Surya <surya1@llnl.gov>

For specific workloads consisting mainly of mfu data and new anon data buffers, the aggressive growth of arc_p found in the arc_get_data_buf() function can have detrimental effects on the mfu list size and ghost list hit rate. Running a workload consisting of two processes: * Process 1 is creating many small files * Process 2 is tar'ing a directory consisting of many small files I've seen arc_p and the mru grow to their maximum size, while the mru ghost list receives 100K times fewer hits than the mfu ghost list. Ideally, as the mfu ghost list receives hits, arc_p should be driven down and the size of the mfu should increase. Given the specific workload I was testing with, the mfu list size should grow to a point where almost no mfu ghost list hits would occur. Unfortunately, this does not happen because the newly dirtied anon buffers constancy drive arc_p to its maximum value and keep it there (effectively prioritizing the mru list and starving the mfu list down to a negligible size). The logic to increment arc_p from within the arc_get_data_buf() function was introduced many years ago in this upstream commit: commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc Author: maybee <none@none> Date: Wed Dec 20 15:46:12 2006 -0800 6505658 target MRU size (arc.p) needs to be adjusted more aggressively and since I don't fully understand the motivation for the change, I am reluctant to completely remove it. As a way to test out how it's removal might affect performance, I've disabled that code by default, but left it tunable via a module option. Thus, if its removal is found to be grossly detrimental for certain workloads, it can be re-enabled on the fly, without a code change. Signed-off-by: Prakash Surya <surya1@llnl.gov>

Setting a limit on the minimum value of "arc_p" has been shown to have detrimental effects on the arc hit rate for certain "metadata" intensive workloads. Specifically, this has been exhibited with a workload that constantly dirties new "meatadata" but also frequently touches a "small" amount of mfu data (e.g. mkdir's). What is seen is that the new anon data throttles the mfu list to a negligible size (because arc_p > anon + mru in arc_get_data_buf), even though the mfu ghost list receives a constant stream of hits. To remedy this, arc_p is now allowed to drop to zero if the algorithm deems it necessary. Signed-off-by: Prakash Surya <surya1@llnl.gov>

It's unclear why adjustments to arc_p need to be dampened as they are in arc_adjust. With that said, it's removal significantly improves the arc's ability to "warm up" to a given workload. Thus, I'm disabling by default until its usefulness is better understood. Signed-off-by: Prakash Surya <surya1@llnl.gov>

This reverts commit c11a12b. Out of memory events were fixed by reverting this patch. Signed-off-by: Prakash Surya <surya1@llnl.gov>

prakashsurya · 2014-02-12T20:35:50Z

I've rebased onto master and removed the L2ARC patches (those need some more testing before I trust them).

To maintain a strict limit on the metadata contained in the arc, while preventing the arc buffer headers from completely consuming the "arc_meta_used" space, we need to evict metadata buffers from the arc's ghost lists along with the regular lists. This change modifies arc_adjust_meta such that it more closely models the adjustments made in arc_adjust. "arc_meta_used" is used similarly to "arc_size", and "arc_meta_limit" is used similarly to "arc_c". Testing metadata intensive workloads (e.g. creating, copying, and removing millions of small files and/or directories) has shown this change to make a dramatic improvement to the hit rate maintained in the arc. While I think there is still room for improvement, this is a big step in the right direction. In addition, zpl_free_cached_objects was made into a no-op as I'm not yet sure how to properly implement that function. Signed-off-by: Prakash Surya <surya1@llnl.gov>

Using "arc_meta_used" to determine if the arc's mru list is over it's target value of "arc_p" doesn't seem correct. The size of the mru list and the value of "arc_meta_used", although related, are completely independent. Buffers contained in "arc_meta_used" may not even be contained in the arc's mru list. As such, this patch removes "arc_meta_used" from the calculation in arc_adjust. Signed-off-by: Prakash Surya <surya1@llnl.gov>

behlendorf · 2014-02-18T04:00:33Z

@prakashsurya Can you run these patches under ztest. I saw an assertion related to list handling tripped by the testing.

prakashsurya · 2014-02-18T19:32:26Z

@behlendorf Sure, I have a fedora VM that I can run it on.

FWIW, I've been running this code underneath Lustre for almost 2 weeks now with a create workload running. With 1 MDS, 8 OSS, and 14 compute nodes; I've managed to create about 745 million files so far at an averaged rate of about 2.25K creates per second.

prakashsurya · 2014-02-19T20:00:25Z

Here's the backtrace for the ztest assertion:

(gdb) bt
#0  0x0000003b0f4359e9 in raise () from /lib64/libc.so.6
#1  0x0000003b0f4370f8 in abort () from /lib64/libc.so.6
#2  0x0000003b0f42e956 in __assert_fail_base () from /lib64/libc.so.6
#3  0x0000003b0f42ea02 in __assert_fail () from /lib64/libc.so.6
#4  0x00007ffff7bdb03e in list_destroy (list=<optimized out>) at ../../lib/libspl/list.c:80
#5  0x00007ffff7890014 in arc_fini () at ../../module/zfs/arc.c:4238
#6  0x00007ffff78a48cd in dmu_fini () at ../../module/zfs/dmu.c:2033
#7  0x00007ffff78fd085 in spa_fini () at ../../module/zfs/spa_misc.c:1677
#8  0x00007ffff788040c in kernel_fini () at ../../lib/libzpool/kernel.c:1144
#9  0x0000000000404ccd in main (argc=<optimized out>, argv=0x7fffffffe570) at ../../cmd/zdb/zdb.c:3425

Looks to be introduced by this commit:

commit 513198ff25e9c3dd1b9573e594caa19b72c091fb
Author: Prakash Surya <surya1@llnl.gov>
Date:   Mon Dec 30 09:30:00 2013 -0800

    Prioritize "metadata" in arc_get_data_buf

I'll try to fix it and rebase this pull request today.

When the arc is at it's size limit and a new buffer is added, data will be evicted (or recycled) from the arc to make room for this new buffer. As far as I can tell, this is to try and keep the arc from over stepping it's bounds (i.e. keep it below the size limitation placed on it). This makes sense conceptually, but there appears to be a subtle flaw in its current implementation, resulting in metadata buffers being throttled. When it evicts from the arc's lists, it also passes in a "type" so as to remove a buffer of the same type that it is adding. The problem with this is that once the size limit is hit, the ratio of "metadata" to "data" contained in the arc essentially becomes fixed. For example, consider the following scenario: * the size of the arc is capped at 10G * the meta_limit is capped at 4G * 9G of the arc contains "data" * 1G of the arc contains "metadata" Now, every time a new "metadata" buffer is created and added to the arc, an older "metadata" buffer(s) will be removed from the arc; preserving the 9G "data" to 1G "metadata" ratio that was in-place when the size limit was reached. This occurs even though the amount of "metadata" is far below the "metadata" limit. This can result in the arc behaving pathologically for certain workloads. To fix this, the arc_get_data_buf function was modified to evict "data" from the arc even when adding a "metadata" buffer; unless it's at the "metadata" limit. In addition, arc_evict now more closely resembles arc_evict_ghost; such that when evicting "data" from the arc, it may make a second pass over the arc lists and evict "metadata" if it cannot meet the eviction size the first time around. Signed-off-by: Prakash Surya <surya1@llnl.gov>

Previously, the "data_size" field in the arcstats kstat contained the amount of cached "metadata" and "data" in the ARC. The problem is this then made it difficult to extract out just the "metadata" size, or just the "data" size. To make it easier to distinguish the two values, "data_size" has been modified to count only buffers of type ARC_BUFC_DATA, and "meta_size" was added to count only buffers of type ARC_BUFC_METADATA. If one wants the old "data_size" value, simply sum the new "data_size" and "meta_size" values. Signed-off-by: Prakash Surya <surya1@llnl.gov>

Unfortunately, this change is an cheap attempt to work around a pathological workload for the ARC. A "real" solution still needs to be fleshed out, so this patch is intended to alleviate the situation in the meantime. Let me try and describe the problem.. Data buffers residing in the dbuf hash table (dbuf cache) will keep a hold on their respective dnode, this dnode will in turn keep a hold on its backing dbuf (the physical block of the dnode object backing it). Since the dnode has a hold on its backing dbuf, the arc buffer for this dbuf is unevictable. What this essentially boils down to, "data" buffers have the potential to pin "metadata" in the arc (as a result of these dnode object buffers being unevictable). This scenario becomes a real problem when the workload consists of many small files (e.g. creating millions of 4K files). With this workload, the arc's "arc_meta_used" space get filled up with buffers for any resident directories as well as buffers for the objset's dnode object. Once the "arc_meta_limit" is reached, the directory buffers will be evicted and only the unevictable dnode object buffers will reside. If the workload is simply creating new small files, these dnode object buffers will never even be needed again, whereas the directory buffers will be used constantly until the creates move to a new directory. If "arc_c" and "arc_meta_limit" are sized appropriately, this situation wont occur. This is because as the data buffers accumulate, "arc_size" will eventually approach "arc_c" (before "arc_meta_used" reaches "arc_meta_limit"); at that point the data buffers will be evicted, which releases the hold on the dnode, which releases the hold on the dnode object's dbuf, which allows that buffer to be evicted from the arc in preference to more "useful" metadata. So, to side step the issue, we simply need to ensure "arc_size" reaches "arc_c" before "arc_meta_used" reaches "arc_meta_limit". In order to pick a proper limit, we have to do some math. To make things a little easier to follow, it is assumed that there will only be a single data buffer per file (which is probably always the case for "small" files anyways). Based on the current internals of the arc, if N files residing in the dbuf cache all pin a single dnode buffer (i.e. their dnodes all share the same physical dnode object block), then the following amount of "arc_meta_used" space will be consumed: - 16K for the dnode object's block - [ 16384 bytes] - N * sizeof(dnode_t) -------------- [ N * 928 bytes] - (N + 1) * sizeof(arc_buf_t) ------ [(N + 1) * 72 bytes] - (N + 1) * sizeof(arc_buf_hdr_t) -- [(N + 1) * 264 bytes] - (N + 1) * sizeof(dmu_buf_impl_t) - [(N + 1) * 280 bytes] To simplify, these N files will pin the following amount of "arc_meta_used" space as unevictable: Pinned "arc_meta_used" bytes = 16384 + N * 928 + (N + 1) * (72 + 264 + 280) Pinned "arc_meta_used" bytes = 17000 + N * 1544 This pinned space is regardless of the size of the files, and is only dependent on the number of pinned dnodes sharing a physical block (i.e. N). For example, 32 512b files sharing a single dnode object block would consume the same "arc_meta_used" space as 32 4K files sharing a single dnode object block. Now, given a files size of S, we can determine the total amount of space that will be consumed in the arc: Total = 17000 + N * 1544 + S * N ^^^^^^^^^^^^^^^^ ^^^^^ metadata data So, given these formulas, we can generate a table which states the ratio of pinned metadata to total arc (meta + data) using different values of N (number of pinned dnodes per pinned physical dnode block) and S (size of the file). File Sizes (S) | 512 | 1024 | 2048 | 4096 | 8192 | 16384 | ---+----------+----------+----------+----------+----------+----------+ 1 | 0.973132 | 0.947670 | 0.900544 | 0.819081 | 0.693597 | 0.530921 | 2 | 0.951497 | 0.907481 | 0.830632 | 0.710325 | 0.550779 | 0.380051 | N 4 | 0.918807 | 0.849809 | 0.738842 | 0.585844 | 0.414271 | 0.261250 | 8 | 0.877541 | 0.781803 | 0.641770 | 0.472505 | 0.309333 | 0.182965 | 16 | 0.835819 | 0.717945 | 0.559996 | 0.388885 | 0.241376 | 0.137253 | 32 | 0.802106 | 0.669597 | 0.503304 | 0.336277 | 0.202123 | 0.112423 | As you can see, if we wanted to support the absolute worst case of 1 dnode per physical dnode block and 512b files, we would have to set the "arc_meta_limit" to something greater than 97.3132% of "arc_c_max". At that point, it essentially defeats the purpose of having an "arc_meta_limit" at all. This patch changes the default value of "arc_meta_limit" to be 75% of "arc_c_max", which should be good enough for "most" workloads (I think). Signed-off-by: Prakash Surya <surya1@llnl.gov>

prakashsurya · 2014-02-19T20:41:22Z

I think I fixed the ztest failure. I just removed this change from commit 513198f

@@ -2337,27 +2332,8 @@ arc_flush(spa_t *spa)                                    
        if (spa)                                                                   
                guid = spa_load_guid(spa);                                         

-       while (list_head(&arc_mru->arcs_list[ARC_BUFC_DATA])) {                    
-               (void) arc_evict(arc_mru, guid, -1, FALSE, ARC_BUFC_DATA);         
-               if (spa)                                                           
-                       break;                                                     
-       }                                                                          
-       while (list_head(&arc_mru->arcs_list[ARC_BUFC_METADATA])) {                
-               (void) arc_evict(arc_mru, guid, -1, FALSE, ARC_BUFC_METADATA);  
-               if (spa)                                                           
-                       break;                                                     
-       }                                                                          
-       while (list_head(&arc_mfu->arcs_list[ARC_BUFC_DATA])) {                    
-               (void) arc_evict(arc_mfu, guid, -1, FALSE, ARC_BUFC_DATA);         
-               if (spa)                                                           
-                       break;                                                     
-       }                                                                          
-       while (list_head(&arc_mfu->arcs_list[ARC_BUFC_METADATA])) {                
-               (void) arc_evict(arc_mfu, guid, -1, FALSE, ARC_BUFC_METADATA);  
-               if (spa)                                                           
-                       break;                                                     
-       }                                                                          
-                                                                                  
+       arc_evict(arc_mru, guid, -1, FALSE, ARC_BUFC_DATA);                        
+       arc_evict(arc_mfu, guid, -1, FALSE, ARC_BUFC_DATA);

behlendorf · 2014-02-19T21:43:31Z

Yes. We should stick with the original arc_flush, if there's some cleanup to do here we can always follow up in another patch. Although I don't think we'll need too.

That's a nice result for Lustre. We should expect similar good behavior through the ZPL which is what we have seen in testing.

prakashsurya · 2014-02-19T21:50:53Z

Yea, that portion was largely just cleanup. I think it brought about the failures in the arc_fini callpath because I removed the "while" loop. So if any buffers were skipped in arc_evict, list_destroy would assert due to the arc state lists not being empty.

Decrease the mimimum ARC size from 1/32 of total system memory (or 64MB) to a much smaller 4MB. 1) Large systems with over a 1TB of memory are being deployed and reserving 1/32 of this memory (32GB) as the mimimum requirement is overkill. 2) Tiny systems like the raspberry pi may only have 256MB of memory in which case 64MB is far too large. The ARC should be reclaimable if the VFS determines it needs the memory for some other purpose. If you want to ensure the ARC is never completely reclaimed due to memory pressure you may still set a larger value with zfs_arc_min. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov> Issue #2110

In an attempt to prevent arc_c from collapsing "too fast", the arc_shrink() function was updated to take a "bytes" parameter by this change: commit 302f753 Author: Brian Behlendorf <behlendorf1@llnl.gov> Date: Tue Mar 13 14:29:16 2012 -0700 Integrate ARC more tightly with Linux Unfortunately, that change failed to make a similar change to the way that arc_p was updated. So, there still exists the possibility for arc_p to collapse to near 0 when the kernel start calling the arc's shrinkers. This change attempts to fix this, by decrementing arc_p by the "bytes" parameter in the same way that arc_c is updated. In addition, a minimum value of arc_p is attempted to be maintained, similar to the way a minimum arc_p value is maintained in arc_adapt(). Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110

For specific workloads consisting mainly of mfu data and new anon data buffers, the aggressive growth of arc_p found in the arc_get_data_buf() function can have detrimental effects on the mfu list size and ghost list hit rate. Running a workload consisting of two processes: * Process 1 is creating many small files * Process 2 is tar'ing a directory consisting of many small files I've seen arc_p and the mru grow to their maximum size, while the mru ghost list receives 100K times fewer hits than the mfu ghost list. Ideally, as the mfu ghost list receives hits, arc_p should be driven down and the size of the mfu should increase. Given the specific workload I was testing with, the mfu list size should grow to a point where almost no mfu ghost list hits would occur. Unfortunately, this does not happen because the newly dirtied anon buffers constancy drive arc_p to its maximum value and keep it there (effectively prioritizing the mru list and starving the mfu list down to a negligible size). The logic to increment arc_p from within the arc_get_data_buf() function was introduced many years ago in this upstream commit: commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc Author: maybee <none@none> Date: Wed Dec 20 15:46:12 2006 -0800 6505658 target MRU size (arc.p) needs to be adjusted more aggressively and since I don't fully understand the motivation for the change, I am reluctant to completely remove it. As a way to test out how it's removal might affect performance, I've disabled that code by default, but left it tunable via a module option. Thus, if its removal is found to be grossly detrimental for certain workloads, it can be re-enabled on the fly, without a code change. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110

Setting a limit on the minimum value of "arc_p" has been shown to have detrimental effects on the arc hit rate for certain "metadata" intensive workloads. Specifically, this has been exhibited with a workload that constantly dirties new "metadata" but also frequently touches a "small" amount of mfu data (e.g. mkdir's). What is seen is that the new anon data throttles the mfu list to a negligible size (because arc_p > anon + mru in arc_get_data_buf), even though the mfu ghost list receives a constant stream of hits. To remedy this, arc_p is now allowed to drop to zero if the algorithm deems it necessary. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110

It's unclear why adjustments to arc_p need to be dampened as they are in arc_adjust. With that said, it's removal significantly improves the arc's ability to "warm up" to a given workload. Thus, I'm disabling by default until its usefulness is better understood. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110

This reverts commit c11a12b. Out of memory events were fixed by reverting this patch. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110

To maintain a strict limit on the metadata contained in the arc, while preventing the arc buffer headers from completely consuming the "arc_meta_used" space, we need to evict metadata buffers from the arc's ghost lists along with the regular lists. This change modifies arc_adjust_meta such that it more closely models the adjustments made in arc_adjust. "arc_meta_used" is used similarly to "arc_size", and "arc_meta_limit" is used similarly to "arc_c". Testing metadata intensive workloads (e.g. creating, copying, and removing millions of small files and/or directories) has shown this change to make a dramatic improvement to the hit rate maintained in the arc. While I think there is still room for improvement, this is a big step in the right direction. In addition, zpl_free_cached_objects was made into a no-op as I'm not yet sure how to properly implement that function. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110

Using "arc_meta_used" to determine if the arc's mru list is over it's target value of "arc_p" doesn't seem correct. The size of the mru list and the value of "arc_meta_used", although related, are completely independent. Buffers contained in "arc_meta_used" may not even be contained in the arc's mru list. As such, this patch removes "arc_meta_used" from the calculation in arc_adjust. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110

When the arc is at it's size limit and a new buffer is added, data will be evicted (or recycled) from the arc to make room for this new buffer. As far as I can tell, this is to try and keep the arc from over stepping it's bounds (i.e. keep it below the size limitation placed on it). This makes sense conceptually, but there appears to be a subtle flaw in its current implementation, resulting in metadata buffers being throttled. When it evicts from the arc's lists, it also passes in a "type" so as to remove a buffer of the same type that it is adding. The problem with this is that once the size limit is hit, the ratio of "metadata" to "data" contained in the arc essentially becomes fixed. For example, consider the following scenario: * the size of the arc is capped at 10G * the meta_limit is capped at 4G * 9G of the arc contains "data" * 1G of the arc contains "metadata" Now, every time a new "metadata" buffer is created and added to the arc, an older "metadata" buffer(s) will be removed from the arc; preserving the 9G "data" to 1G "metadata" ratio that was in-place when the size limit was reached. This occurs even though the amount of "metadata" is far below the "metadata" limit. This can result in the arc behaving pathologically for certain workloads. To fix this, the arc_get_data_buf function was modified to evict "data" from the arc even when adding a "metadata" buffer; unless it's at the "metadata" limit. In addition, arc_evict now more closely resembles arc_evict_ghost; such that when evicting "data" from the arc, it may make a second pass over the arc lists and evict "metadata" if it cannot meet the eviction size the first time around. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110

Previously, the "data_size" field in the arcstats kstat contained the amount of cached "metadata" and "data" in the ARC. The problem is this then made it difficult to extract out just the "metadata" size, or just the "data" size. To make it easier to distinguish the two values, "data_size" has been modified to count only buffers of type ARC_BUFC_DATA, and "meta_size" was added to count only buffers of type ARC_BUFC_METADATA. If one wants the old "data_size" value, simply sum the new "data_size" and "meta_size" values. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110

Unfortunately, this change is an cheap attempt to work around a pathological workload for the ARC. A "real" solution still needs to be fleshed out, so this patch is intended to alleviate the situation in the meantime. Let me try and describe the problem.. Data buffers residing in the dbuf hash table (dbuf cache) will keep a hold on their respective dnode, this dnode will in turn keep a hold on its backing dbuf (the physical block of the dnode object backing it). Since the dnode has a hold on its backing dbuf, the arc buffer for this dbuf is unevictable. What this essentially boils down to, "data" buffers have the potential to pin "metadata" in the arc (as a result of these dnode object buffers being unevictable). This scenario becomes a real problem when the workload consists of many small files (e.g. creating millions of 4K files). With this workload, the arc's "arc_meta_used" space get filled up with buffers for any resident directories as well as buffers for the objset's dnode object. Once the "arc_meta_limit" is reached, the directory buffers will be evicted and only the unevictable dnode object buffers will reside. If the workload is simply creating new small files, these dnode object buffers will never even be needed again, whereas the directory buffers will be used constantly until the creates move to a new directory. If "arc_c" and "arc_meta_limit" are sized appropriately, this situation wont occur. This is because as the data buffers accumulate, "arc_size" will eventually approach "arc_c" (before "arc_meta_used" reaches "arc_meta_limit"); at that point the data buffers will be evicted, which releases the hold on the dnode, which releases the hold on the dnode object's dbuf, which allows that buffer to be evicted from the arc in preference to more "useful" metadata. So, to side step the issue, we simply need to ensure "arc_size" reaches "arc_c" before "arc_meta_used" reaches "arc_meta_limit". In order to pick a proper limit, we have to do some math. To make things a little easier to follow, it is assumed that there will only be a single data buffer per file (which is probably always the case for "small" files anyways). Based on the current internals of the arc, if N files residing in the dbuf cache all pin a single dnode buffer (i.e. their dnodes all share the same physical dnode object block), then the following amount of "arc_meta_used" space will be consumed: - 16K for the dnode object's block - [ 16384 bytes] - N * sizeof(dnode_t) -------------- [ N * 928 bytes] - (N + 1) * sizeof(arc_buf_t) ------ [(N + 1) * 72 bytes] - (N + 1) * sizeof(arc_buf_hdr_t) -- [(N + 1) * 264 bytes] - (N + 1) * sizeof(dmu_buf_impl_t) - [(N + 1) * 280 bytes] To simplify, these N files will pin the following amount of "arc_meta_used" space as unevictable: Pinned "arc_meta_used" bytes = 16384 + N * 928 + (N + 1) * (72 + 264 + 280) Pinned "arc_meta_used" bytes = 17000 + N * 1544 This pinned space is regardless of the size of the files, and is only dependent on the number of pinned dnodes sharing a physical block (i.e. N). For example, 32 512b files sharing a single dnode object block would consume the same "arc_meta_used" space as 32 4K files sharing a single dnode object block. Now, given a files size of S, we can determine the total amount of space that will be consumed in the arc: Total = 17000 + N * 1544 + S * N ^^^^^^^^^^^^^^^^ ^^^^^ metadata data So, given these formulas, we can generate a table which states the ratio of pinned metadata to total arc (meta + data) using different values of N (number of pinned dnodes per pinned physical dnode block) and S (size of the file). File Sizes (S) | 512 | 1024 | 2048 | 4096 | 8192 | 16384 | ---+----------+----------+----------+----------+----------+----------+ 1 | 0.973132 | 0.947670 | 0.900544 | 0.819081 | 0.693597 | 0.530921 | 2 | 0.951497 | 0.907481 | 0.830632 | 0.710325 | 0.550779 | 0.380051 | N 4 | 0.918807 | 0.849809 | 0.738842 | 0.585844 | 0.414271 | 0.261250 | 8 | 0.877541 | 0.781803 | 0.641770 | 0.472505 | 0.309333 | 0.182965 | 16 | 0.835819 | 0.717945 | 0.559996 | 0.388885 | 0.241376 | 0.137253 | 32 | 0.802106 | 0.669597 | 0.503304 | 0.336277 | 0.202123 | 0.112423 | As you can see, if we wanted to support the absolute worst case of 1 dnode per physical dnode block and 512b files, we would have to set the "arc_meta_limit" to something greater than 97.3132% of "arc_c_max". At that point, it essentially defeats the purpose of having an "arc_meta_limit" at all. This patch changes the default value of "arc_meta_limit" to be 75% of "arc_c_max", which should be good enough for "most" workloads (I think). Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110

behlendorf · 2014-02-22T00:47:27Z

Merged as:

0ad85ed Merge branch 'arc-changes'
2b13331 Set "arc_meta_limit" to 3/4 arc_c_max by default
cc7f677 Split "data_size" into "meta" and "data"
da8ccd0 Prioritize "metadata" in arc_get_data_buf
77765b5 Remove "arc_meta_used" from arc_adjust calculation
94520ca Prune metadata from ghost lists in arc_adjust_meta
1e3cb67 Revert "Return -1 from arc_shrinker_func()"
6242278 Disable arc_p adapt dampener by default
f521ce1 Allow "arc_p" to drop to zero or grow to "arc_c"
89c8cac Disable aggressive arc_p growth by default
39e055c Adjust arc_p based on "bytes" in arc_shrink
9141582 Set zfs_arc_min to 4MB

Decrease the mimimum ARC size from 1/32 of total system memory (or 64MB) to a much smaller 4MB. 1) Large systems with over a 1TB of memory are being deployed and reserving 1/32 of this memory (32GB) as the mimimum requirement is overkill. 2) Tiny systems like the raspberry pi may only have 256MB of memory in which case 64MB is far too large. The ARC should be reclaimable if the VFS determines it needs the memory for some other purpose. If you want to ensure the ARC is never completely reclaimed due to memory pressure you may still set a larger value with zfs_arc_min. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov> Issue openzfs#2110

prakashsurya mentioned this pull request Feb 6, 2014

Improve ARC behavior with metadata heavy workloads #1967

Closed

behlendorf added this to the 0.6.3 milestone Feb 6, 2014

behlendorf added the Bug label Feb 6, 2014

behlendorf and others added 6 commits February 12, 2014 12:31

Revert "Return -1 from arc_shrinker_func()"

cd661cd

This reverts commit c11a12b. Out of memory events were fixed by reverting this patch. Signed-off-by: Prakash Surya <surya1@llnl.gov>

Prakash Surya added 2 commits February 12, 2014 13:29

kernelOfTruth mentioned this pull request Feb 17, 2014

Use Linux slab allocator for small allocations openzfs/spl#328

Closed

DeHackEd mentioned this pull request Feb 17, 2014

Heavy lock contention in zfs_zinactive causing ARC metadata to grow #1932

Closed

behlendorf mentioned this pull request Feb 19, 2014

task kswapd blocked for more than 120 seconds, zpl_evict_inode in backtrace. #2125

Closed

Prakash Surya added 3 commits February 19, 2014 12:12

behlendorf closed this in 0ad85ed Feb 22, 2014

prakashsurya deleted the arc-changes-2 branch July 24, 2014 18:03

kernelOfTruth mentioned this pull request May 12, 2015

L2ARC Capacity and Usage FUBAR - significant performance penalty apparently associated #3400

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ARC hit rate with metadata heavy workloads #2110

Improve ARC hit rate with metadata heavy workloads #2110

prakashsurya commented Feb 6, 2014

DeHackEd commented Feb 6, 2014

prakashsurya commented Feb 6, 2014

behlendorf commented Feb 6, 2014

prakashsurya commented Feb 6, 2014

prakashsurya commented Feb 7, 2014

prakashsurya commented Feb 12, 2014

behlendorf commented Feb 18, 2014

prakashsurya commented Feb 18, 2014

prakashsurya commented Feb 19, 2014

prakashsurya commented Feb 19, 2014

behlendorf commented Feb 19, 2014

prakashsurya commented Feb 19, 2014

behlendorf commented Feb 22, 2014

Improve ARC hit rate with metadata heavy workloads #2110

Improve ARC hit rate with metadata heavy workloads #2110

Conversation

prakashsurya commented Feb 6, 2014

DeHackEd commented Feb 6, 2014

prakashsurya commented Feb 6, 2014

behlendorf commented Feb 6, 2014

prakashsurya commented Feb 6, 2014

prakashsurya commented Feb 7, 2014

prakashsurya commented Feb 12, 2014

behlendorf commented Feb 18, 2014

prakashsurya commented Feb 18, 2014

prakashsurya commented Feb 19, 2014

prakashsurya commented Feb 19, 2014

behlendorf commented Feb 19, 2014

prakashsurya commented Feb 19, 2014

behlendorf commented Feb 22, 2014