Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ARC hit rate with metadata heavy workloads #2110

Closed
wants to merge 11 commits into from
Closed

Improve ARC hit rate with metadata heavy workloads #2110

wants to merge 11 commits into from

Conversation

prakashsurya
Copy link
Member

This stack of patches has been empirically shown to drastically improve         
the hit rate of the ARC for certain workloads. As a result, fewer reads         
to disk are required, which is generally a good thing and can                   
drastically improve performance if the workload is disk limited.                

For the impatient, I'll summarize the results of the tests performed:           

    * Test 1 - Creating many empty directories. This test saw 99.9%             
               fewer reads and 12.8% more inodes created when running           
               *with* these changes.                                            

    * Test 2 - Creating many empty files. This test saw 4% fewer reads          
               and 0% more inodes created when running *with* these             
               changes.                                                         

    * Test 3 - Creating many 4 KiB files. This test saw 96.7% fewer             
               reads and 4.9% more inodes created when running *with*           
               these changes.                                                   

    * Test 4 - Creating many 4096 KiB files. This test saw 99.4% fewer          
               reads and 0% more inodes created (but took 6.9% fewer            
               seconds to complete) when running *with* these changes.                                              

    * Test 5 - Rsync'ing a dataset with many empty directories. This            
               test saw 36.2% fewer reads and 66.2% more inodes created         
               when running *with* these changes.                               

    * Test 6 - Rsync'ing a dataset with many empty files. This test saw         
               30.9% fewer reads and 0% more inodes created (but took           
               24.3% fewer seconds to complete) when running *with*             
               these changes.                                                   

    * Test 7 - Rsync'ing a dataset with many 4 KiB files. This test saw         
               30.8% fewer reads and 173.3% more inodes created when            
               running *with* these changes.                                    

For the patient, the following consists of more a more detailed                 
description of the tests performed and the results gathered.                    

All the tests were run using identical machines, each with a pool               
consisting of 5 mirror pairs with 2TB 7200 RPM disks. Each test was run         
twice, once *without* this set of patches and again *with* this set of          
patches to highlight the performance changes introduced. The first four         
workloads tested were:                                                          

    ** NOTE: None of these tests were run to completion. They ran for a         
             set amount of time and then were terminated or hit ENOSPC.         

    1. Creating many empty directories:                                         

       * fdtree -d 10 -l 8 -s 0 -f 0 -C                                         
         -> 111,111,111 Directories                                             
         ->           0 Files                                                   
         ->           0 KiB File Data                                           

    2. Creating many empty files:                                               

       * fdtree -d 10 -l 5 -s 0 -f 10000 -C                                     
         ->       111,111 Directories                                           
         -> 1,111,110,000 Files                                                 
         ->             0 KiB File Data                                         

    3. Creating many 4 KiB files:                                               

       * fdtree -d 10 -l 5 -s 1 -f 10000 -C                                     
         ->       111,111 Directories                                           
         -> 1,111,110,000 Files                                                 
         -> 4,444,440,000 KiB File Data                                         

    4. Creating many 4096 KiB files:                                            

       * fdtree -d 10 -l 5 -s 1024 -f 10000 -C                                  
         ->           111,111 Directories                                       
         ->     1,111,110,000 Files                                             
         -> 4,551,106,560,000 KiB File Data                                     

Results for these first four tests are below:                                   

                  | Time (s) |   inodes |  reads |    writes |                  
                --+----------+----------+--------+-----------+                  
    Test 1 Before |    65069 | 37845363 | 831975 |   3214646 |                  
    Test 1 After  |    65069 | 42703608 |    778 |   3327674 |                  
                --+----------+----------+--------+-----------+                  
    Test 2 Before |    65073 | 54257583 | 208647 |   2413056 |                  
    Test 2 After  |    65069 | 54255782 | 200038 |   2533759 |                  
                --+----------+----------+--------+-----------+                  
    Test 3 Before |    65068 | 49857744 | 487130 |   5533348 |                  
    Test 3 After  |    65071 | 52294311 |  16078 |   5648354 |                  
                --+----------+----------+--------+-----------+                  
    Test 4 Before |    34854 |  2448329 | 385870 | 162116572 |                  
    Test 4 After  |    32419 |  2448329 |   2339 | 162175706 |                  
                --+----------+----------+--------+-----------+                  

    * "Time (s)" - The run time of the test in seconds                          
    * "inodes"   - The number of inodes created by the test                     
    * "reads"    - The number of reads performed by the test                    
    * "writes"   - The number of writes performed by the test

As you can see from the table above, running with this patch stack              
*significantly* reduced the number of reads performed in 3 out of the 4         
tests (due to an improved ARC hit rate).                                        

In addition to the tests described above, which specifically targeted           
creates only, three other workloads were tested. These additional tests         
were targeting rsync performance against the datasets created in the            
previous tests. A brief description of the workloads and results for            
these tests are below:                                                          

    ** NOTE: Aside from (6), these tests didn't run to completion. They         
             ran for a set amount of time and then were terminated.             

    5. Rsync the dataset created in Test 1 to a new dataset:                    

       * rsync -a /tank/test-1 /tank/test-5                                     

    6. Rsync the dataset created in Test 2 to a new dataset:                    

       * rsync -a /tank/test-2 /tank/test-6                                     

    7. Rsync the dataset created in Test 3 to a new dataset:                    

       * rsync -a /tank/test-3 /tank/test-7                                     

Results for Test 5, 6, and 7 are below:                                         

                  | Time (s) |   inodes |    reads |  writes |                  
                --+----------+----------+----------+---------+                  
    Test 5 Before |    93041 | 17921014 | 47632823 | 4094848 |                  
    Test 5 After  |    93029 | 29785847 | 30376206 | 4484459 |                  
                --+----------+----------+----------+---------+                  
    Test 6 Before |    15290 | 54264474 |  6018331 |  733087 |                  
    Test 6 After  |    11573 | 54260826 |  4155661 |  617285 |                  
                --+----------+----------+----------+---------+                  
    Test 7 Before |    93057 | 10093749 | 41561635 | 3659098 |                  
    Test 7 After  |    93045 | 27587043 | 28773151 | 5612234 |                  
                --+----------+----------+----------+---------+                  

    * "Time (s)" - The run time of the test in seconds                          
    * "inodes"   - The number of inodes created by the test                     
    * "reads"    - The number of reads performed by the test                    
    * "writes"   - The number of writes performed by the test                   

Signed-off-by: Prakash Surya <surya1@llnl.gov>

@DeHackEd
Copy link
Contributor

DeHackEd commented Feb 6, 2014

I've had my L2ARC size go bonkers (used exceeds size of SSD, free = 16 EB) as viewed from zpool iostat -v while using the Limit L2ARC header footprint from the previous patch. I've mostly been omitting that patch from the series I've been running thus far.

I'll see about giving this series a spin later.

@prakashsurya
Copy link
Member Author

And running without 74e2045, things are "normal"? Honestly, I haven't given that patch the testing attention it probably deserves, so I wouldn't be surprised if it's bugged. In that case, it might be best to take it out until I can properly test it.

@behlendorf behlendorf added this to the 0.6.3 milestone Feb 6, 2014
@behlendorf behlendorf added the Bug label Feb 6, 2014
@behlendorf
Copy link
Contributor

@DeHackEd The fix for the l2arc issue #2093 was merged today in to master. @prakashsurya if you rebase these changes on the latest master we can avoid other people having that issue while testing.

@prakashsurya
Copy link
Member Author

OK, I just rebased this stack onto master. It should include the fix from #2093 now.

@prakashsurya
Copy link
Member Author

In case these are useful to anybody, I'm going to post graphs of various arcstat parameters vs. time for each of the 14 unique tests I've run so far. They helped me understand the ARC's behavior, so maybe they'll help others as well.

Test 1 - After

t1-after

Test 1 - Before

t1-before

Test 2 - After

t2-after

Test 2 - Before

t2-before

Test 3 - After

t3-after

Test 3 - Before

t3-before

Test 4 - After

t4-after

Test 4 - Before

t4-before

Test 5 - After

t5-after

Test 5 - Before

t5-before

Test 6 - After

t6-after

Test 6 - Before

t6-before

Test 7 - After

t7-after

Test 7 - Before

t7-before

behlendorf and others added 6 commits February 12, 2014 12:31
Decrease the mimimum ARC size from 1/32 of total system memory
(or 64MB) to a much smaller 4MB.

1) Large systems with over a 1TB of memory are being deployed
   and reserving 1/32 of this memory (32GB) as the mimimum
   requirement is overkill.

2) Tiny systems like the raspberry pi may only have 256MB of
   memory in which case 64MB is far too large.

The ARC should be reclaimable if the VFS determines it needs
the memory for some other purpose.  If you want to ensure the
ARC is never completely reclaimed due to memory pressure you
may still set a larger value with zfs_arc_min.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In an attempt to prevent arc_c from collapsing "too fast", the
arc_shrink() function was updated to take a "bytes" parameter by this
change:

    commit 302f753
    Author: Brian Behlendorf <behlendorf1@llnl.gov>
    Date:   Tue Mar 13 14:29:16 2012 -0700

        Integrate ARC more tightly with Linux

Unfortunately, that change failed to make a similar change to the way
that arc_p was updated. So, there still exists the possibility for arc_p
to collapse to near 0 when the kernel start calling the arc's shrinkers.

This change attempts to fix this, by decrementing arc_p by the "bytes"
parameter in the same way that arc_c is updated.

In addition, a minimum value of arc_p is attempted to be maintained,
similar to the way a minimum arc_p value is maintained in arc_adapt().

Signed-off-by: Prakash Surya <surya1@llnl.gov>
For specific workloads consisting mainly of mfu data and new anon data
buffers, the aggressive growth of arc_p found in the arc_get_data_buf()
function can have detrimental effects on the mfu list size and ghost
list hit rate.

Running a workload consisting of two processes:

    * Process 1 is creating many small files
    * Process 2 is tar'ing a directory consisting of many small files

I've seen arc_p and the mru grow to their maximum size, while the mru
ghost list receives 100K times fewer hits than the mfu ghost list.

Ideally, as the mfu ghost list receives hits, arc_p should be driven
down and the size of the mfu should increase. Given the specific
workload I was testing with, the mfu list size should grow to a point
where almost no mfu ghost list hits would occur. Unfortunately, this
does not happen because the newly dirtied anon buffers constancy drive
arc_p to its maximum value and keep it there (effectively prioritizing
the mru list and starving the mfu list down to a negligible size).

The logic to increment arc_p from within the arc_get_data_buf() function
was introduced many years ago in this upstream commit:

    commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc
    Author: maybee <none@none>
    Date:   Wed Dec 20 15:46:12 2006 -0800

        6505658 target MRU size (arc.p) needs to be adjusted more aggressively

and since I don't fully understand the motivation for the change, I am
reluctant to completely remove it.

As a way to test out how it's removal might affect performance, I've
disabled that code by default, but left it tunable via a module option.
Thus, if its removal is found to be grossly detrimental for certain
workloads, it can be re-enabled on the fly, without a code change.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Setting a limit on the minimum value of "arc_p" has been shown to have
detrimental effects on the arc hit rate for certain "metadata" intensive
workloads. Specifically, this has been exhibited with a workload that
constantly dirties new "meatadata" but also frequently touches a "small"
amount of mfu data (e.g. mkdir's).

What is seen is that the new anon data throttles the mfu list to a
negligible size (because arc_p > anon + mru in arc_get_data_buf), even
though the mfu ghost list receives a constant stream of hits. To remedy
this, arc_p is now allowed to drop to zero if the algorithm deems it
necessary.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
It's unclear why adjustments to arc_p need to be dampened as they are in
arc_adjust. With that said, it's removal significantly improves the arc's
ability to "warm up" to a given workload. Thus, I'm disabling by default
until its usefulness is better understood.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
This reverts commit c11a12b.

Out of memory events were fixed by reverting this patch.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
@prakashsurya
Copy link
Member Author

I've rebased onto master and removed the L2ARC patches (those need some more testing before I trust them).

Prakash Surya added 2 commits February 12, 2014 13:29
To maintain a strict limit on the metadata contained in the arc, while
preventing the arc buffer headers from completely consuming the
"arc_meta_used" space, we need to evict metadata buffers from the arc's
ghost lists along with the regular lists.

This change modifies arc_adjust_meta such that it more closely models
the adjustments made in arc_adjust. "arc_meta_used" is used similarly to
"arc_size", and "arc_meta_limit" is used similarly to "arc_c".

Testing metadata intensive workloads (e.g. creating, copying, and
removing millions of small files and/or directories) has shown this
change to make a dramatic improvement to the hit rate maintained in the
arc. While I think there is still room for improvement, this is a big
step in the right direction.

In addition, zpl_free_cached_objects was made into a no-op as I'm not
yet sure how to properly implement that function.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Using "arc_meta_used" to determine if the arc's mru list is over it's
target value of "arc_p" doesn't seem correct. The size of the mru list
and the value of "arc_meta_used", although related, are completely
independent. Buffers contained in "arc_meta_used" may not even be
contained in the arc's mru list. As such, this patch removes
"arc_meta_used" from the calculation in arc_adjust.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
@behlendorf
Copy link
Contributor

@prakashsurya Can you run these patches under ztest. I saw an assertion related to list handling tripped by the testing.

@prakashsurya
Copy link
Member Author

@behlendorf Sure, I have a fedora VM that I can run it on.

FWIW, I've been running this code underneath Lustre for almost 2 weeks now with a create workload running. With 1 MDS, 8 OSS, and 14 compute nodes; I've managed to create about 745 million files so far at an averaged rate of about 2.25K creates per second.

@prakashsurya
Copy link
Member Author

Here's the backtrace for the ztest assertion:

(gdb) bt
#0  0x0000003b0f4359e9 in raise () from /lib64/libc.so.6
#1  0x0000003b0f4370f8 in abort () from /lib64/libc.so.6
#2  0x0000003b0f42e956 in __assert_fail_base () from /lib64/libc.so.6
#3  0x0000003b0f42ea02 in __assert_fail () from /lib64/libc.so.6
#4  0x00007ffff7bdb03e in list_destroy (list=<optimized out>) at ../../lib/libspl/list.c:80
#5  0x00007ffff7890014 in arc_fini () at ../../module/zfs/arc.c:4238
#6  0x00007ffff78a48cd in dmu_fini () at ../../module/zfs/dmu.c:2033
#7  0x00007ffff78fd085 in spa_fini () at ../../module/zfs/spa_misc.c:1677
#8  0x00007ffff788040c in kernel_fini () at ../../lib/libzpool/kernel.c:1144
#9  0x0000000000404ccd in main (argc=<optimized out>, argv=0x7fffffffe570) at ../../cmd/zdb/zdb.c:3425

Looks to be introduced by this commit:

commit 513198ff25e9c3dd1b9573e594caa19b72c091fb
Author: Prakash Surya <surya1@llnl.gov>
Date:   Mon Dec 30 09:30:00 2013 -0800

    Prioritize "metadata" in arc_get_data_buf

I'll try to fix it and rebase this pull request today.

Prakash Surya added 3 commits February 19, 2014 12:12
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).

This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.

For example, consider the following scenario:

    * the size of the arc is capped at 10G
    * the meta_limit is capped at 4G
    * 9G of the arc contains "data"
    * 1G of the arc contains "metadata"

Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.

To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Previously, the "data_size" field in the arcstats kstat contained the
amount of cached "metadata" and "data" in the ARC. The problem is this
then made it difficult to extract out just the "metadata" size, or just
the "data" size.

To make it easier to distinguish the two values, "data_size" has been
modified to count only buffers of type ARC_BUFC_DATA, and "meta_size"
was added to count only buffers of type ARC_BUFC_METADATA. If one wants
the old "data_size" value, simply sum the new "data_size" and
"meta_size" values.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Unfortunately, this change is an cheap attempt to work around a
pathological workload for the ARC. A "real" solution still needs to be
fleshed out, so this patch is intended to alleviate the situation in the
meantime. Let me try and describe the problem..

Data buffers residing in the dbuf hash table (dbuf cache) will keep a
hold on their respective dnode, this dnode will in turn keep a hold on
its backing dbuf (the physical block of the dnode object backing it).
Since the dnode has a hold on its backing dbuf, the arc buffer for this
dbuf is unevictable. What this essentially boils down to, "data" buffers
have the potential to pin "metadata" in the arc (as a result of these
dnode object buffers being unevictable).

This scenario becomes a real problem when the workload consists of many
small files (e.g. creating millions of 4K files). With this workload,
the arc's "arc_meta_used" space get filled up with buffers for any
resident directories as well as buffers for the objset's dnode object.
Once the "arc_meta_limit" is reached, the directory buffers will be
evicted and only the unevictable dnode object buffers will reside. If
the workload is simply creating new small files, these dnode object
buffers will never even be needed again, whereas the directory buffers
will be used constantly until the creates move to a new directory.

If "arc_c" and "arc_meta_limit" are sized appropriately, this
situation wont occur. This is because as the data buffers accumulate,
"arc_size" will eventually approach "arc_c" (before "arc_meta_used"
reaches "arc_meta_limit"); at that point the data buffers will be
evicted, which releases the hold on the dnode, which releases the hold
on the dnode object's dbuf, which allows that buffer to be evicted from
the arc in preference to more "useful" metadata.

So, to side step the issue, we simply need to ensure "arc_size" reaches
"arc_c" before "arc_meta_used" reaches "arc_meta_limit". In order to
pick a proper limit, we have to do some math.

To make things a little easier to follow, it is assumed that there will
only be a single data buffer per file (which is probably always the case
for "small" files anyways).

Based on the current internals of the arc, if N files residing in the
dbuf cache all pin a single dnode buffer (i.e. their dnodes all share
the same physical dnode object block), then the following amount of
"arc_meta_used" space will be consumed:

    - 16K for the dnode object's block - [        16384 bytes]
    - N * sizeof(dnode_t) -------------- [      N * 928 bytes]
    - (N + 1) * sizeof(arc_buf_t) ------ [(N + 1) *  72 bytes]
    - (N + 1) * sizeof(arc_buf_hdr_t) -- [(N + 1) * 264 bytes]
    - (N + 1) * sizeof(dmu_buf_impl_t) - [(N + 1) * 280 bytes]

To simplify, these N files will pin the following amount of
"arc_meta_used" space as unevictable:

    Pinned "arc_meta_used" bytes = 16384 + N * 928 + (N + 1) * (72 + 264 + 280)
    Pinned "arc_meta_used" bytes = 17000 + N * 1544

This pinned space is regardless of the size of the files, and is only
dependent on the number of pinned dnodes sharing a physical block
(i.e. N). For example, 32 512b files sharing a single dnode object
block would consume the same "arc_meta_used" space as 32 4K files
sharing a single dnode object block.

Now, given a files size of S, we can determine the total amount of
space that will be consumed in the arc:

    Total = 17000 + N * 1544 + S * N
            ^^^^^^^^^^^^^^^^   ^^^^^
                metadata        data

So, given these formulas, we can generate a table which states the ratio
of pinned metadata to total arc (meta + data) using different values of
N (number of pinned dnodes per pinned physical dnode block) and S (size
of the file).

                                  File Sizes (S)
       |    512   |   1024   |   2048   |   4096   |   8192   |   16384  |
    ---+----------+----------+----------+----------+----------+----------+
     1 | 0.973132 | 0.947670 | 0.900544 | 0.819081 | 0.693597 | 0.530921 |
     2 | 0.951497 | 0.907481 | 0.830632 | 0.710325 | 0.550779 | 0.380051 |
 N   4 | 0.918807 | 0.849809 | 0.738842 | 0.585844 | 0.414271 | 0.261250 |
     8 | 0.877541 | 0.781803 | 0.641770 | 0.472505 | 0.309333 | 0.182965 |
    16 | 0.835819 | 0.717945 | 0.559996 | 0.388885 | 0.241376 | 0.137253 |
    32 | 0.802106 | 0.669597 | 0.503304 | 0.336277 | 0.202123 | 0.112423 |

As you can see, if we wanted to support the absolute worst case of 1
dnode per physical dnode block and 512b files, we would have to set the
"arc_meta_limit" to something greater than 97.3132% of "arc_c_max". At
that point, it essentially defeats the purpose of having an
"arc_meta_limit" at all.

This patch changes the default value of "arc_meta_limit" to be 75% of
"arc_c_max", which should be good enough for "most" workloads (I think).

Signed-off-by: Prakash Surya <surya1@llnl.gov>
@prakashsurya
Copy link
Member Author

I think I fixed the ztest failure. I just removed this change from commit 513198f

@@ -2337,27 +2332,8 @@ arc_flush(spa_t *spa)                                    
        if (spa)                                                                   
                guid = spa_load_guid(spa);                                         

-       while (list_head(&arc_mru->arcs_list[ARC_BUFC_DATA])) {                    
-               (void) arc_evict(arc_mru, guid, -1, FALSE, ARC_BUFC_DATA);         
-               if (spa)                                                           
-                       break;                                                     
-       }                                                                          
-       while (list_head(&arc_mru->arcs_list[ARC_BUFC_METADATA])) {                
-               (void) arc_evict(arc_mru, guid, -1, FALSE, ARC_BUFC_METADATA);  
-               if (spa)                                                           
-                       break;                                                     
-       }                                                                          
-       while (list_head(&arc_mfu->arcs_list[ARC_BUFC_DATA])) {                    
-               (void) arc_evict(arc_mfu, guid, -1, FALSE, ARC_BUFC_DATA);         
-               if (spa)                                                           
-                       break;                                                     
-       }                                                                          
-       while (list_head(&arc_mfu->arcs_list[ARC_BUFC_METADATA])) {                
-               (void) arc_evict(arc_mfu, guid, -1, FALSE, ARC_BUFC_METADATA);  
-               if (spa)                                                           
-                       break;                                                     
-       }                                                                          
-                                                                                  
+       arc_evict(arc_mru, guid, -1, FALSE, ARC_BUFC_DATA);                        
+       arc_evict(arc_mfu, guid, -1, FALSE, ARC_BUFC_DATA);

@behlendorf
Copy link
Contributor

Yes. We should stick with the original arc_flush, if there's some cleanup to do here we can always follow up in another patch. Although I don't think we'll need too.

That's a nice result for Lustre. We should expect similar good behavior through the ZPL which is what we have seen in testing.

@prakashsurya
Copy link
Member Author

Yea, that portion was largely just cleanup. I think it brought about the failures in the arc_fini callpath because I removed the "while" loop. So if any buffers were skipped in arc_evict, list_destroy would assert due to the arc state lists not being empty.

behlendorf added a commit that referenced this pull request Feb 22, 2014
Decrease the mimimum ARC size from 1/32 of total system memory
(or 64MB) to a much smaller 4MB.

1) Large systems with over a 1TB of memory are being deployed
   and reserving 1/32 of this memory (32GB) as the mimimum
   requirement is overkill.

2) Tiny systems like the raspberry pi may only have 256MB of
   memory in which case 64MB is far too large.

The ARC should be reclaimable if the VFS determines it needs
the memory for some other purpose.  If you want to ensure the
ARC is never completely reclaimed due to memory pressure you
may still set a larger value with zfs_arc_min.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Issue #2110
behlendorf pushed a commit that referenced this pull request Feb 22, 2014
In an attempt to prevent arc_c from collapsing "too fast", the
arc_shrink() function was updated to take a "bytes" parameter by this
change:

    commit 302f753
    Author: Brian Behlendorf <behlendorf1@llnl.gov>
    Date:   Tue Mar 13 14:29:16 2012 -0700

        Integrate ARC more tightly with Linux

Unfortunately, that change failed to make a similar change to the way
that arc_p was updated. So, there still exists the possibility for arc_p
to collapse to near 0 when the kernel start calling the arc's shrinkers.

This change attempts to fix this, by decrementing arc_p by the "bytes"
parameter in the same way that arc_c is updated.

In addition, a minimum value of arc_p is attempted to be maintained,
similar to the way a minimum arc_p value is maintained in arc_adapt().

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
behlendorf pushed a commit that referenced this pull request Feb 22, 2014
For specific workloads consisting mainly of mfu data and new anon data
buffers, the aggressive growth of arc_p found in the arc_get_data_buf()
function can have detrimental effects on the mfu list size and ghost
list hit rate.

Running a workload consisting of two processes:

    * Process 1 is creating many small files
    * Process 2 is tar'ing a directory consisting of many small files

I've seen arc_p and the mru grow to their maximum size, while the mru
ghost list receives 100K times fewer hits than the mfu ghost list.

Ideally, as the mfu ghost list receives hits, arc_p should be driven
down and the size of the mfu should increase. Given the specific
workload I was testing with, the mfu list size should grow to a point
where almost no mfu ghost list hits would occur. Unfortunately, this
does not happen because the newly dirtied anon buffers constancy drive
arc_p to its maximum value and keep it there (effectively prioritizing
the mru list and starving the mfu list down to a negligible size).

The logic to increment arc_p from within the arc_get_data_buf() function
was introduced many years ago in this upstream commit:

    commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc
    Author: maybee <none@none>
    Date:   Wed Dec 20 15:46:12 2006 -0800

        6505658 target MRU size (arc.p) needs to be adjusted more aggressively

and since I don't fully understand the motivation for the change, I am
reluctant to completely remove it.

As a way to test out how it's removal might affect performance, I've
disabled that code by default, but left it tunable via a module option.
Thus, if its removal is found to be grossly detrimental for certain
workloads, it can be re-enabled on the fly, without a code change.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
behlendorf pushed a commit that referenced this pull request Feb 22, 2014
Setting a limit on the minimum value of "arc_p" has been shown to have
detrimental effects on the arc hit rate for certain "metadata" intensive
workloads. Specifically, this has been exhibited with a workload that
constantly dirties new "metadata" but also frequently touches a "small"
amount of mfu data (e.g. mkdir's).

What is seen is that the new anon data throttles the mfu list to a
negligible size (because arc_p > anon + mru in arc_get_data_buf), even
though the mfu ghost list receives a constant stream of hits. To remedy
this, arc_p is now allowed to drop to zero if the algorithm deems it
necessary.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
behlendorf pushed a commit that referenced this pull request Feb 22, 2014
It's unclear why adjustments to arc_p need to be dampened as they are in
arc_adjust. With that said, it's removal significantly improves the arc's
ability to "warm up" to a given workload. Thus, I'm disabling by default
until its usefulness is better understood.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
behlendorf pushed a commit that referenced this pull request Feb 22, 2014
This reverts commit c11a12b.

Out of memory events were fixed by reverting this patch.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
behlendorf pushed a commit that referenced this pull request Feb 22, 2014
To maintain a strict limit on the metadata contained in the arc, while
preventing the arc buffer headers from completely consuming the
"arc_meta_used" space, we need to evict metadata buffers from the arc's
ghost lists along with the regular lists.

This change modifies arc_adjust_meta such that it more closely models
the adjustments made in arc_adjust. "arc_meta_used" is used similarly to
"arc_size", and "arc_meta_limit" is used similarly to "arc_c".

Testing metadata intensive workloads (e.g. creating, copying, and
removing millions of small files and/or directories) has shown this
change to make a dramatic improvement to the hit rate maintained in the
arc. While I think there is still room for improvement, this is a big
step in the right direction.

In addition, zpl_free_cached_objects was made into a no-op as I'm not
yet sure how to properly implement that function.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
behlendorf pushed a commit that referenced this pull request Feb 22, 2014
Using "arc_meta_used" to determine if the arc's mru list is over it's
target value of "arc_p" doesn't seem correct. The size of the mru list
and the value of "arc_meta_used", although related, are completely
independent. Buffers contained in "arc_meta_used" may not even be
contained in the arc's mru list. As such, this patch removes
"arc_meta_used" from the calculation in arc_adjust.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
behlendorf pushed a commit that referenced this pull request Feb 22, 2014
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).

This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.

For example, consider the following scenario:

    * the size of the arc is capped at 10G
    * the meta_limit is capped at 4G
    * 9G of the arc contains "data"
    * 1G of the arc contains "metadata"

Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.

To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
behlendorf pushed a commit that referenced this pull request Feb 22, 2014
Previously, the "data_size" field in the arcstats kstat contained the
amount of cached "metadata" and "data" in the ARC. The problem is this
then made it difficult to extract out just the "metadata" size, or just
the "data" size.

To make it easier to distinguish the two values, "data_size" has been
modified to count only buffers of type ARC_BUFC_DATA, and "meta_size"
was added to count only buffers of type ARC_BUFC_METADATA. If one wants
the old "data_size" value, simply sum the new "data_size" and
"meta_size" values.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
behlendorf pushed a commit that referenced this pull request Feb 22, 2014
Unfortunately, this change is an cheap attempt to work around a
pathological workload for the ARC. A "real" solution still needs to be
fleshed out, so this patch is intended to alleviate the situation in the
meantime. Let me try and describe the problem..

Data buffers residing in the dbuf hash table (dbuf cache) will keep a
hold on their respective dnode, this dnode will in turn keep a hold on
its backing dbuf (the physical block of the dnode object backing it).
Since the dnode has a hold on its backing dbuf, the arc buffer for this
dbuf is unevictable. What this essentially boils down to, "data" buffers
have the potential to pin "metadata" in the arc (as a result of these
dnode object buffers being unevictable).

This scenario becomes a real problem when the workload consists of many
small files (e.g. creating millions of 4K files). With this workload,
the arc's "arc_meta_used" space get filled up with buffers for any
resident directories as well as buffers for the objset's dnode object.
Once the "arc_meta_limit" is reached, the directory buffers will be
evicted and only the unevictable dnode object buffers will reside. If
the workload is simply creating new small files, these dnode object
buffers will never even be needed again, whereas the directory buffers
will be used constantly until the creates move to a new directory.

If "arc_c" and "arc_meta_limit" are sized appropriately, this
situation wont occur. This is because as the data buffers accumulate,
"arc_size" will eventually approach "arc_c" (before "arc_meta_used"
reaches "arc_meta_limit"); at that point the data buffers will be
evicted, which releases the hold on the dnode, which releases the hold
on the dnode object's dbuf, which allows that buffer to be evicted from
the arc in preference to more "useful" metadata.

So, to side step the issue, we simply need to ensure "arc_size" reaches
"arc_c" before "arc_meta_used" reaches "arc_meta_limit". In order to
pick a proper limit, we have to do some math.

To make things a little easier to follow, it is assumed that there will
only be a single data buffer per file (which is probably always the case
for "small" files anyways).

Based on the current internals of the arc, if N files residing in the
dbuf cache all pin a single dnode buffer (i.e. their dnodes all share
the same physical dnode object block), then the following amount of
"arc_meta_used" space will be consumed:

    - 16K for the dnode object's block - [        16384 bytes]
    - N * sizeof(dnode_t) -------------- [      N * 928 bytes]
    - (N + 1) * sizeof(arc_buf_t) ------ [(N + 1) *  72 bytes]
    - (N + 1) * sizeof(arc_buf_hdr_t) -- [(N + 1) * 264 bytes]
    - (N + 1) * sizeof(dmu_buf_impl_t) - [(N + 1) * 280 bytes]

To simplify, these N files will pin the following amount of
"arc_meta_used" space as unevictable:

    Pinned "arc_meta_used" bytes = 16384 + N * 928 + (N + 1) * (72 + 264 + 280)
    Pinned "arc_meta_used" bytes = 17000 + N * 1544

This pinned space is regardless of the size of the files, and is only
dependent on the number of pinned dnodes sharing a physical block
(i.e. N). For example, 32 512b files sharing a single dnode object
block would consume the same "arc_meta_used" space as 32 4K files
sharing a single dnode object block.

Now, given a files size of S, we can determine the total amount of
space that will be consumed in the arc:

    Total = 17000 + N * 1544 + S * N
            ^^^^^^^^^^^^^^^^   ^^^^^
                metadata        data

So, given these formulas, we can generate a table which states the ratio
of pinned metadata to total arc (meta + data) using different values of
N (number of pinned dnodes per pinned physical dnode block) and S (size
of the file).

                                  File Sizes (S)
       |    512   |   1024   |   2048   |   4096   |   8192   |   16384  |
    ---+----------+----------+----------+----------+----------+----------+
     1 | 0.973132 | 0.947670 | 0.900544 | 0.819081 | 0.693597 | 0.530921 |
     2 | 0.951497 | 0.907481 | 0.830632 | 0.710325 | 0.550779 | 0.380051 |
 N   4 | 0.918807 | 0.849809 | 0.738842 | 0.585844 | 0.414271 | 0.261250 |
     8 | 0.877541 | 0.781803 | 0.641770 | 0.472505 | 0.309333 | 0.182965 |
    16 | 0.835819 | 0.717945 | 0.559996 | 0.388885 | 0.241376 | 0.137253 |
    32 | 0.802106 | 0.669597 | 0.503304 | 0.336277 | 0.202123 | 0.112423 |

As you can see, if we wanted to support the absolute worst case of 1
dnode per physical dnode block and 512b files, we would have to set the
"arc_meta_limit" to something greater than 97.3132% of "arc_c_max". At
that point, it essentially defeats the purpose of having an
"arc_meta_limit" at all.

This patch changes the default value of "arc_meta_limit" to be 75% of
"arc_c_max", which should be good enough for "most" workloads (I think).

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
@behlendorf
Copy link
Contributor

Merged as:

0ad85ed Merge branch 'arc-changes'
2b13331 Set "arc_meta_limit" to 3/4 arc_c_max by default
cc7f677 Split "data_size" into "meta" and "data"
da8ccd0 Prioritize "metadata" in arc_get_data_buf
77765b5 Remove "arc_meta_used" from arc_adjust calculation
94520ca Prune metadata from ghost lists in arc_adjust_meta
1e3cb67 Revert "Return -1 from arc_shrinker_func()"
6242278 Disable arc_p adapt dampener by default
f521ce1 Allow "arc_p" to drop to zero or grow to "arc_c"
89c8cac Disable aggressive arc_p growth by default
39e055c Adjust arc_p based on "bytes" in arc_shrink
9141582 Set zfs_arc_min to 4MB

ryao pushed a commit to ryao/zfs that referenced this pull request Apr 9, 2014
Decrease the mimimum ARC size from 1/32 of total system memory
(or 64MB) to a much smaller 4MB.

1) Large systems with over a 1TB of memory are being deployed
   and reserving 1/32 of this memory (32GB) as the mimimum
   requirement is overkill.

2) Tiny systems like the raspberry pi may only have 256MB of
   memory in which case 64MB is far too large.

The ARC should be reclaimable if the VFS determines it needs
the memory for some other purpose.  If you want to ensure the
ARC is never completely reclaimed due to memory pressure you
may still set a larger value with zfs_arc_min.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Issue openzfs#2110
@prakashsurya prakashsurya deleted the arc-changes-2 branch July 24, 2014 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants