Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add close_timeout option #1926

Merged
merged 1 commit into from
Aug 31, 2016
Merged

Add close_timeout option #1926

merged 1 commit into from
Aug 31, 2016

Conversation

ruflin
Copy link
Member

@ruflin ruflin commented Jun 29, 2016

close_timeout will end the harvester after the predefined time. In case the output is blocked, close_timeout will only apply on the next event sent. This is identical with the close_* options.

@ruflin ruflin force-pushed the close_ttl branch 2 times, most recently from 76e5c96 to a16b773 Compare July 4, 2016 09:01
@@ -105,6 +106,11 @@ func (h *Harvester) Harvest() {
if !h.sendEvent(event) {
return
}

if h.Config.CloseTTL > 0 && time.Since(h.startTime) > h.Config.CloseTTL {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config validation mandates CloseTTL can not be 0 (due to min=0,nonzero). This on purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, because I don't see any use cases for close_ttl = 0, as this would the harvester directly again.

@ruflin ruflin added the in progress Pull request is currently in progress. label Jul 4, 2016
@ruflin ruflin changed the title Add close_ttl option Add close_timeout option Jul 6, 2016
@ruflin ruflin force-pushed the close_ttl branch 2 times, most recently from 2e5b57f to ff5df38 Compare July 12, 2016 08:05
@ruflin ruflin mentioned this pull request Jul 12, 2016
8 tasks
@ruflin ruflin removed the in progress Pull request is currently in progress. label Jul 12, 2016
if h.config.CloseTimeout > 0 && time.Since(startTime) > h.config.CloseTimeout {
logp.Info("Closing harvester because ttl was reached: %s", h.path)
return
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking for CloseTimeout at this places requires an event to be published for closeTimeout to function.

If reader is waiting long for a new line or sendEvent blocks, due to spooler being blocked on publisher, close_timeout might be very very late.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file reader has some 'inactive timeout'? The 'inactive timeout' should be <= close_timeout.

What about multiline timeout? hm...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative implementation could be to implement the timeout in a separate go routine the close h.done?

@ruflin
Copy link
Member Author

ruflin commented Jul 18, 2016

@urso I pushed an alternative implementation. Can you have a look?

@ruflin
Copy link
Member Author

ruflin commented Jul 18, 2016

jenkins, retest it

@@ -45,6 +46,7 @@ type harvesterConfig struct {
CloseRemoved bool `config:"close_removed"`
CloseRenamed bool `config:"close_renamed"`
CloseEOF bool `config:"close_eof"`
CloseTimeout time.Duration `config:"close_timeout" validate:"min=0,nonzero"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate enforces CloseTimeout>=1ns. This enforces CloseTimeout always enabled. Why disallow 0 or -1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that nonzero should be removed. But -1 doesn't make sense from my point of view?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, -1 is kinda similar to 0. CloseTimeout should only be enabled if >0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is currently the case?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not. using nonzero in validate requires close_timeout > 0 in config files.

@ruflin ruflin force-pushed the close_ttl branch 2 times, most recently from b2f8c91 to 913c3f2 Compare July 20, 2016 08:35
@ruflin
Copy link
Member Author

ruflin commented Jul 20, 2016

@urso PR updated with adding a note to docs about potential multiline issue and added some notes about the channels to the code. We definitively should revise how processors are started / stopped at a later stage.

@ruflin ruflin added in progress Pull request is currently in progress. and removed review labels Jul 20, 2016
@ruflin
Copy link
Member Author

ruflin commented Jul 20, 2016

In discussion with @urso we realised there is a potential issue in reader / processor which could lead to and endless loop potentially. Before merging this PR, the reader / processor part should be cleaned up to guarantee proper stopping.

ruflin added a commit to ruflin/beats that referenced this pull request Aug 22, 2016
Even thought implementation for close_timeout is not finished, it already showed up in the docs and config file. As close_timeout still needs more work, these were now removed to prevent any confusion. All changes for close_timeout go into elastic#1926
tsg pushed a commit that referenced this pull request Aug 22, 2016
Even thought implementation for close_timeout is not finished, it already showed up in the docs and config file. As close_timeout still needs more work, these were now removed to prevent any confusion. All changes for close_timeout go into #1926
@ruflin ruflin force-pushed the close_ttl branch 3 times, most recently from e58fef6 to 9574e3d Compare August 29, 2016 14:21
@ruflin
Copy link
Member Author

ruflin commented Aug 29, 2016

I redid the implementation and add a Close function to the log_file reader. This implementation has the following issues:

  • In case the output applies back pressure and not at least one more event can be sent, the harvester will not be closed, means the file handler will stay open in case the output service is down
  • An alternative implementation was done where the harvester was directly killed not waiting for the output to complete. This had the side affect, that the final state of the harvester (Finished) was not updated properly. The issue here is currently that state updates and data updates go through the same channel. Even though state updates are not sent by the spooler, they still to through the spooler and if the queue is full not further states can be sent either. One potential option here could be to decouple the state updates and event updates. But this will bring lots of new issues with race conditions between state and event updates. In addition, state updates for events should only be applied, if the event was completely sent.

The above implementation is quite simple and works in cases where the output is behaving normally. This is in line with the oder close_* options. They all only apply in case the output works as expected. Otherwise file handlers stay open.

To have some kind of circuit breaker which really stops the harvester independent of the output, I think we also need to make changes to the output to have potentially some feedback loop. Or we not even start new harvesters if there is some congestions (limit number of harvesters ...). It will still not solve the problem, but not open more and more files when output is stuck.

func (r *LogFile) Close() {
// Make sure reader is only closed once
r.singleClose.Do(func() {
close(r.done)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as the underlying reader might block on syscall, the file must be closed right after closing the channel.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@urso
Copy link

urso commented Aug 30, 2016

which error code is passed up the readers if file is closed by timeout (or done channel)?

what happens to buffered (but incomplete) multiline events if file is closed early?

@ruflin
Copy link
Member Author

ruflin commented Aug 30, 2016

If the channel is closed, the Error will be ErrClosed. What the error will be after closing the file handler probably depends in which state the file handler was at this moment.

Incomplete multiline events will be sent on ErrClosed (see tests). It is handled the same as if the timeout was reached in multiline.

@ruflin ruflin force-pushed the close_ttl branch 2 times, most recently from 290ea69 to 632cd65 Compare August 30, 2016 12:05
h.fileReader.Close()
} else {
h.file.Close()
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who is calling (*Harvester).close()? like which go-routine? There a chance of race on h.fileReader being set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same go routine that calls close is also setting the fileReader, so there should not be any race condition.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... why not call defer fileReader.Close() right after creating in said go-routine? fileReader.Close() already ensures it's doing the magic only once + is thread-safe.

@ruflin ruflin force-pushed the close_ttl branch 2 times, most recently from a6f9980 to ba6934c Compare August 30, 2016 13:30
@ruflin ruflin added review and removed blocked in progress Pull request is currently in progress. labels Aug 30, 2016

In case close_timeout is used in combination with multiline events, it can happen that the harvester will be stopped in the middle of a multiline event, means only parts of the event will be sent. In case the harvester is continued at a later stage again and the file still exists, only the second part of the event will be sent.

Close timeout will not apply, in case your output is stuck and no further events can be sent. At least one event further event must be sent to make close_timeout apply.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least one event further event must be sent to make close_timeout apply.

Can you clarify?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@urso
Copy link

urso commented Aug 30, 2016

@ruflin needs rebase

close_timeout will end the harvester after the predefined time. In case the output is blocked, close_timeout will only apply on the next event sent. This is identical with the close_* options.
@ruflin
Copy link
Member Author

ruflin commented Aug 31, 2016

@urso Rebase done

@ruflin
Copy link
Member Author

ruflin commented Aug 31, 2016

jenkins, retest it

@urso urso merged commit 001b29b into elastic:master Aug 31, 2016
@monicasarbu monicasarbu deleted the close_ttl branch September 5, 2016 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants