Make stage_size & stage computation thread safe #2734

hlakshmi · 2019-12-12T06:57:59Z

Which issue(s) this PR fixes:
#2712
Fixes #

What this PR does / why we need it:
When writing chunks to buffer, the stage_size computation can be thread-unsafe as mentioned in the bug which can result in BufferOverflowErrors even though the buffer is not actually full. This happened on a live long running fluentd. The method buffer.write is supposed to write to a chunk atmost once and hence stage_size should be added for a chunk only once but due to write_step_by_step the block which updates the value of staged_bytesize may be called multiple times and the chunk can be rollbacked but the value of staged_bytesize is not reverted causing the stage_size to be more than the actual value.

This PR addresses this issue.

Docs Changes:

Release Note:

Signed-off-by: Harish Nelakurthi <harish.nelakurthi@illumio.com>

repeatedly · 2019-12-16T13:17:27Z

Thanks!. We will check issue and patch later.

Signed-off-by: Harish Nelakurthi <harish.nelakurthi@illumio.com>

hlakshmi · 2019-12-23T03:52:55Z

@repeatedly Can we get this merged into the upcoming release if it looks good as this is causing a critical issue of data loss. We have tested this in large scale environment and the fix seems to work. Thanks!

ganmacs

Sorry for the delay. I added a comment.

ganmacs · 2019-12-24T07:33:09Z

lib/fluent/plugin/buffer.rb

+                # but this block **may** run multiple times from write_step_by_step and previous write may be rollbacked
+                # So we should be counting the stage_size only for the last successful write
+                #
+                staged_bytesizes_by_chunk[chunk] = adding_bytesize


I think the following situation can happen. what do you think of this?

staged_bytesizes_by_chunk stores chunk1

rollback happens in write_step_by_step (ShouldRetry raises)

another thread euqueues chunk1 before this thread enters into

fluentd/lib/fluent/plugin/buffer.rb

Line 673 in 065751f

chunk = get_next_chunk.call

create new chunk(chunk2) in

fluentd/lib/fluent/plugin/buffer.rb

Line 663 in 065751f

synchronize{ @stage[metadata] ||= generate_chunk(metadata).staged! }

staged_bytesizes_by_chunk stores chunk2 (still remtains chunk1 in staged_bytesizes_by_chunk)

@ganmacs Steps (3) can't happen because enqueue_chunk can't get a lock and waits until the lock is released by the thread which is writing to the chunk.

The lock is acquired here: https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin/buffer.rb#L279
And released here:
https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin/buffer.rb#L319

So enqueue_chunk can happen only after the commit is done.
Hope this helps.

Make sense. You're right. other thread can get the chunk but can't enter

fluentd/lib/fluent/plugin/buffer.rb

Line 674 in 065751f

chunk.synchronize do

.

ganmacs

👏

ganmacs · 2019-12-25T01:52:23Z

lib/fluent/plugin/buffer.rb

+                # but this block **may** run multiple times from write_step_by_step and previous write may be rollbacked
+                # So we should be counting the stage_size only for the last successful write
+                #
+                staged_bytesizes_by_chunk[chunk] = adding_bytesize


Make sense. You're right. other thread can get the chunk but can't enter

fluentd/lib/fluent/plugin/buffer.rb

Line 674 in 065751f

chunk.synchronize do

.

repeatedly · 2020-01-06T02:01:02Z

Thanks!

hlakshmi force-pushed the fix-stage-size-computation branch from bf1c94c to 4383919 Compare December 12, 2019 06:59

Make stage_size & stage computation thread safe

a1731f1

Signed-off-by: Harish Nelakurthi <harish.nelakurthi@illumio.com>

hlakshmi force-pushed the fix-stage-size-computation branch from 4383919 to a1731f1 Compare December 12, 2019 07:11

add stage_size only for last successful write to the chunk

065751f

Signed-off-by: Harish Nelakurthi <harish.nelakurthi@illumio.com>

hlakshmi force-pushed the fix-stage-size-computation branch from 64a72d5 to 065751f Compare December 17, 2019 18:43

ganmacs requested review from ganmacs and repeatedly December 23, 2019 04:49

ganmacs reviewed Dec 24, 2019

View reviewed changes

ganmacs approved these changes Dec 25, 2019

View reviewed changes

repeatedly merged commit 24ddf6e into fluent:master Jan 6, 2020

cosmo0920 mentioned this pull request Jan 24, 2020

stage size computation in buffer plugin is not thread safe #2712

Closed

rehevkor5 mentioned this pull request May 22, 2020

fluentd_output_status_buffer_total_bytes metric returns negative (minus) results fluent/fluent-plugin-prometheus#55

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make stage_size & stage computation thread safe #2734

Make stage_size & stage computation thread safe #2734

hlakshmi commented Dec 12, 2019 •

edited

Loading

repeatedly commented Dec 16, 2019

hlakshmi commented Dec 23, 2019

ganmacs left a comment

ganmacs Dec 24, 2019

hlakshmi Dec 24, 2019

ganmacs Dec 25, 2019

ganmacs left a comment

ganmacs Dec 25, 2019

repeatedly commented Jan 6, 2020

Make stage_size & stage computation thread safe #2734

Make stage_size & stage computation thread safe #2734

Conversation

hlakshmi commented Dec 12, 2019 • edited Loading

repeatedly commented Dec 16, 2019

hlakshmi commented Dec 23, 2019

ganmacs left a comment

Choose a reason for hiding this comment

ganmacs Dec 24, 2019

Choose a reason for hiding this comment

hlakshmi Dec 24, 2019

Choose a reason for hiding this comment

ganmacs Dec 25, 2019

Choose a reason for hiding this comment

ganmacs left a comment

Choose a reason for hiding this comment

ganmacs Dec 25, 2019

Choose a reason for hiding this comment

repeatedly commented Jan 6, 2020

hlakshmi commented Dec 12, 2019 •

edited

Loading