AOF rewrite should call fdatasync per second to avoid block the main thread's write(2) #1019

springside · 2013-03-21T08:49:45Z

On my 64G memory & 100MB/s HD test machine, when 10G AOF file is rewrited, the main thread's write(2) would be blocked from 7 second to 14 second , whole redis would be blocked.

"Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis." time and time again.

I change the system parameter to let the system flush to disk more often, avoid to flush too much data one time
sysctl vm.dirty_bytes=33554432(32M)

i think it is good to modify the AOF rewrite code, let it call fdatasync directly.

antirez · 2013-03-21T09:00:20Z

Thanks for your report, do you mean to call fdatasync in the child process that writes the AOF so that it avoids to accumulate too big buffers to flush on disk? We did something like that in replication during RDB transfer with good results actually so this probably makes sense for AOF as well.

springside · 2013-03-21T09:35:34Z

yes, and when i change the parameter to 32M, no busy log and no blocking any more :)

100.00% <= 317 milliseconds.

if fdatasync only call per second, it will be 100M on my test machie, it still cause 2-4 seconds blocking.

antirez · 2013-03-21T09:39:59Z

That's cool, however did you checked if the time to produce the AOF file changed in a measurable way? Thanks.

springside · 2013-03-21T09:54:07Z

I am running "redis-benchmark -r 100000008 -n 2500000 -d 256 -t SET" when AOF rewrite

Use 1 minutes to write 8.6G AOF file, when vm.dirty_bytes=33554432 (32M) vm.dirty_background_bytes=33554432(32M)

[3106] 21 Mar 17:38:01.059 * Background append only file rewriting started by pid 3292
[3106] 21 Mar 17:38:08.584 * Background AOF buffer size: 80 MB
[3106] 21 Mar 17:38:15.594 * Background AOF buffer size: 180 MB
[3106] 21 Mar 17:38:23.596 * Background AOF buffer size: 280 MB
[3106] 21 Mar 17:38:30.790 * Background AOF buffer size: 380 MB
[3106] 21 Mar 17:38:37.022 * Background AOF buffer size: 480 MB
[3106] 21 Mar 17:38:43.235 * Background AOF buffer size: 580 MB
[3292] 21 Mar 17:39:02.891 * SYNC append only file rewrite performed

After back to the default value of RedHat Enterprise 6, vm.dirty_ratio = 20, vm.dirty_background_ratio = 10. (6.4G in my 64G machine), use 1.5 minutes to write the same file.

[3106] 21 Mar 17:42:02.811 * Background append only file rewriting started by pid 3323
[3106] 21 Mar 17:42:10.569 * Background AOF buffer size: 80 MB
[3106] 21 Mar 17:42:15.616 * Background AOF buffer size: 180 MB
[3106] 21 Mar 17:42:20.640 * Background AOF buffer size: 280 MB
[3106] 21 Mar 17:42:25.695 * Background AOF buffer size: 380 MB
[3106] 21 Mar 17:42:29.110 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
[3106] 21 Mar 17:43:08.007 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
[3323] 21 Mar 17:43:26.497 * SYNC append only file rewrite performed

and redis-benchmark report:

100.00% <= 37420 milliseconds

antirez · 2013-03-21T09:56:27Z

Very interesting, thanks! I'll provide a branch with explicit fdatasync() call to see if the results are comparable to what you get modifying kernel parameters. Thanks for your help, it is appreciated.

springside · 2013-03-21T10:01:25Z

Looking forward it, i will test it for you. Thanks for you extreme quickly reply.

This prevents the kernel from putting too much stuff in the output buffers, doing too heavy I/O all at once. So the goal of this commit is to split the disk pressure due to the AOF rewrite process into smaller spikes. Please see issue #1019 for more information.

antirez · 2013-04-03T17:01:22Z

Dear @springside, I just implemented what discussed here in the issue-1090 branch, please could you verify that the good effects now that we moved the idea inside Redis instead of modifying the kernel parameters are still here? :-)

The branch is a fork of unstable, but you'll be able to easily cherry pick the two commits implementing this into 2.6 if you need to, or I can make a 2.6 branch with the two commits for you if you want.

Thank you in advance!

antirez · 2013-04-03T17:02:42Z

Sorry I was wrong... cherry-pick is not possible so I'm creating a 2.6-issue-1090 branch right now.

This prevents the kernel from putting too much stuff in the output buffers, doing too heavy I/O all at once. So the goal of this commit is to split the disk pressure due to the AOF rewrite process into smaller spikes. Please see issue #1019 for more information.

antirez · 2013-04-03T17:04:40Z

done :-)

springside · 2013-04-04T05:48:40Z

thank you, but i am enjoying my vacation , and the test machine can only connect at office network.
I will test it at April 7, thank you for your quick action again.

antirez · 2013-04-04T06:45:23Z

No rush! 7 April is perfect :-) Have a great vacation!

springside · 2013-04-07T04:26:43Z

Hi @antirez, the problem is gone.

springside · 2013-04-23T02:49:06Z

Hi @antirez , when would the code merge to 2.6/2,8 branch ?

antirez · 2013-04-24T08:24:15Z

Right now :-)

This prevents the kernel from putting too much stuff in the output buffers, doing too heavy I/O all at once. So the goal of this commit is to split the disk pressure due to the AOF rewrite process into smaller spikes. Please see issue #1019 for more information.

antirez · 2013-04-24T08:31:07Z

Merged everywhere, now adding support to enable/disable it in redis.conf. By default this will be enabled. Thank you for your help!

antirez · 2013-04-24T09:05:25Z

Feature made configurable and committed, so now we are set for a release... but I've another pending thing in Redis Sentinel that I need to address before committing.

yoav-steinberg · 2013-05-02T07:56:16Z

Looking at the fix it seems like when writing more than 32MB (for example a 100MB key) we'll only get a single fsync. Am I missing something?
I've actually dealt with this in the past and came up with this patch: https://github.com/GarantiaData/redis/commit/4fe3d6bd40decabf1ddd3bdb5b69b87521e6233d

charsyam · 2013-05-02T08:42:26Z

@yoav-steinberg hi guy.
I have a question. when do you update "r->processed_bytes"?
I might think because of it, only it will call fsync, if (len > r->io.file.fsync_interval)

so you should add bytes_written to r->processed_bytes before returing rioWrite

yoav-steinberg · 2013-05-02T08:47:57Z

@charsyam, processed_bytes is updated in the rio layer (not specific to files). This is another change in the GarantiaData fork. You can find it here: https://github.com/GarantiaData/redis/commit/3aab06bcc7460b46c96fbb8d3b704ab1227c6be4

charsyam · 2013-05-02T09:01:46Z

@yoav-steinberg Hi, I might think I found another bug in your code.(but, it's trivial)

if bytes_written becomes larger than io.file.fsync_interval and if len remains,
after that, fsync will be called until loop is finished.

static size_t rioFileWrite(rio *r, const void *buf, size_t len) {
    size_t bytes_written = 0;
    while (len) {
        size_t bytes_to_write = (r->io.file.fsync_interval && r->io.file.fsync_interval < len) ? r->io.file.fsync_interval : len;
        if (fwrite((char*)buf + bytes_written,bytes_to_write,1,r->io.file.fp) != 1)
            return 0;
        bytes_written += bytes_to_write;
        len -= bytes_to_write;

        //There is a possibility to call fsync many times(not intented)
        //if bytes_written is larger than io.file.fsync_interval and if there are more data in current loop
        if (r->io.file.fsync_interval && r->processed_bytes/r->io.file.fsync_interval < (r->processed_bytes + bytes_written)/r->io.file.fsync_interval)
            fsync(fileno(r->io.file.fp));
    }
    return 1;
 }

yoav-steinberg · 2013-05-02T09:08:22Z

@charsyam, keen eye and sorry for keeping posting these links :) anyway I fixed that bug shortly after, see: https://github.com/GarantiaData/redis/commit/785cf9297733e32110daed8ca8b16169ea4288b5

This prevents the kernel from putting too much stuff in the output buffers, doing too heavy I/O all at once. So the goal of this commit is to split the disk pressure due to the AOF rewrite process into smaller spikes. Please see issue redis#1019 for more information.

antirez closed this as completed Apr 24, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AOF rewrite should call fdatasync per second to avoid block the main thread's write(2) #1019

AOF rewrite should call fdatasync per second to avoid block the main thread's write(2) #1019

springside commented Mar 21, 2013

antirez commented Mar 21, 2013

springside commented Mar 21, 2013

antirez commented Mar 21, 2013

springside commented Mar 21, 2013

antirez commented Mar 21, 2013

springside commented Mar 21, 2013

antirez commented Apr 3, 2013

antirez commented Apr 3, 2013

antirez commented Apr 3, 2013

springside commented Apr 4, 2013

antirez commented Apr 4, 2013

springside commented Apr 7, 2013

springside commented Apr 23, 2013

antirez commented Apr 24, 2013

antirez commented Apr 24, 2013

antirez commented Apr 24, 2013

yoav-steinberg commented May 2, 2013

charsyam commented May 2, 2013

yoav-steinberg commented May 2, 2013

charsyam commented May 2, 2013

yoav-steinberg commented May 2, 2013

AOF rewrite should call fdatasync per second to avoid block the main thread's write(2) #1019

AOF rewrite should call fdatasync per second to avoid block the main thread's write(2) #1019

Comments

springside commented Mar 21, 2013

antirez commented Mar 21, 2013

springside commented Mar 21, 2013

antirez commented Mar 21, 2013

springside commented Mar 21, 2013

antirez commented Mar 21, 2013

springside commented Mar 21, 2013

antirez commented Apr 3, 2013

antirez commented Apr 3, 2013

antirez commented Apr 3, 2013

springside commented Apr 4, 2013

antirez commented Apr 4, 2013

springside commented Apr 7, 2013

springside commented Apr 23, 2013

antirez commented Apr 24, 2013

antirez commented Apr 24, 2013

antirez commented Apr 24, 2013

yoav-steinberg commented May 2, 2013

charsyam commented May 2, 2013

yoav-steinberg commented May 2, 2013

charsyam commented May 2, 2013

yoav-steinberg commented May 2, 2013