Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving resiliency through disk buffering capability #2285

Closed
pmm-sumo opened this issue Dec 14, 2020 · 14 comments
Closed

Improving resiliency through disk buffering capability #2285

pmm-sumo opened this issue Dec 14, 2020 · 14 comments
Assignees
Labels

Comments

@pmm-sumo
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Currently, collector buffers all data in memory, which can lead to data loss in case buffer gets overfilled (e.g. during temporary exporter connection issues) or due to a sudden crash.

Describe the solution you'd like

There should be a possibility to enable disk buffering capability. For example, exporter could buffer the incoming data on disk (up to the defined limit) and clear the buffer as data gets sent out. In case of restart, recovery process could happen, which would send-out the buffered data.

Perhaps this capability could be a good fit to queued_retry helper, since it already manages send queues?

Additional context
Gitter discussion

@tigrannajaryan
Copy link
Member

Note that queued_retry is deprecated. We need to think where the persistent queue should fit. Perhaps queued_retry instead of being deprecated needs to be re-worked and persistence added.

@pmm-sumo
Copy link
Contributor Author

Note that queued_retry is deprecated. We need to think where the persistent queue should fit. Perhaps queued_retry instead of being deprecated needs to be re-worked and persistence added.

The processor is deprecated but is it also true for the helper?

@dashpole
Copy link
Contributor

@pmm-sumo I would love to collaborate on this. There are some folks at Google who are interested in solving this problem as well. cc @erain

@tigrannajaryan is this something that can be allowed for GA?

@tigrannajaryan
Copy link
Member

Note that queued_retry is deprecated. We need to think where the persistent queue should fit. Perhaps queued_retry instead of being deprecated needs to be re-worked and persistence added.

The processor is deprecated but is it also true for the helper?

The helper is here to stay. We don't plan to deprecate it. However, if implemented in the helper each exporter will have its own persistent queue, which may or may not be desirable (it will likely mean persistent data is duplicated if more than one exporter is used).

@tigrannajaryan
Copy link
Member

@pmm-sumo I would love to collaborate on this. There are some folks at Google who are interested in solving this problem as well. cc @erain

@tigrannajaryan is this something that can be allowed for GA?

It depends on the implementation. If it requires major changes to existing code and may destabilize the Collector right before the GA then it may not be desirable. But it should not be an issue if it is released immediately after GA, e.g. in 1.1.

If the implementation is a separate component that doesn't touch existing code then I see no problem with allowing it for GA.

Either way, let's see a design document first and we can discuss.

@pmm-sumo
Copy link
Contributor Author

pmm-sumo commented Jan 4, 2021

Thank you for the clarification @tigrannajaryan I agree that we need a design doc first

@dashpole and @erain, did you perhaps start working/researching this? Perhaps we could mention this during the next SIG meeting (if anyone else would like to participate too) and plan a separate meeting for discussing it? I think we might need more than a few minutes to share our ideas, discuss options and decide how to split the work (even if we would just start with creating a design doc). What do you think?

@dashpole
Copy link
Contributor

dashpole commented Jan 5, 2021

Thus far, we have mostly been collecting use-cases, but have also explored a few different possible solutions that we'd be happy to share.

I added it to the next collector sig meeting agenda.

@djaglowski
Copy link
Member

The recent stanza contribution may include some useful code to bootstrap this effort. I'm not deeply familiar with this part of the codebase, but it implements buffering and flushing as independent capabilities.

This code is basically dead at this point, since stanza previously made use of it as part of its output operators, which have been removed due to redundancy with the collector. If this code appears to be useful at all, I believe it would be fine to pull it out of the new repository and adapt as necessary without worrying about duplication.

@pmm-sumo
Copy link
Contributor Author

Perhaps a related issue (when format used for buffering is considered): open-telemetry/opentelemetry-specification#1443

@tigrannajaryan
Copy link
Member

@pmm-sumo please review open-telemetry/opentelemetry-collector-contrib#3087 from the perspective of using in disk buffer.

@pmm-sumo
Copy link
Contributor Author

I have prepared a design doc with some ideas on how this could be achieved. cc @dashpole

@SkrPaoWang
Copy link

how much time the buffer data can be stored in memory?

@pmm-sumo
Copy link
Contributor Author

how much time the buffer data can be stored in memory?

As mentioned in the other issue, this controlled by number of batches using queued_retry configuration. I am working on a PoC of a generic buffering solution, should have an update soon

tigrannajaryan pushed a commit that referenced this issue Aug 4, 2021
As discussed during the SIG, we want to move storage extension to core, starting with the interface (this PR) so persistent buffer implementation (#2285) could use it

**Link to tracking Issue:** #3424 

**Testing:** Just the interface, no tests

**Documentation:** README.md with API

cc @djaglowski @tigrannajaryan
tigrannajaryan pushed a commit that referenced this issue Sep 10, 2021
…3274)

Persistent queue implementation within queued_retry, aimed at being compatible with Jager's [BoundedQueue](https://github.com/jaegertracing/jaeger/blob/master/pkg/queue/bounded_queue.go) interface (providing a simple replacement) and backed by [file storage extension](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/storage/filestorage) for storing WAL.

Currently, to run the persistent queue, OpenTelemetry Collector Contrib with `enable_unstable` build tag is required.

**Link to tracking Issue:** #2285 

[Design doc](https://docs.google.com/document/d/1Y4vNthCGdYI61ezeAzL5dXWgiZ73y9eSjIDitk3zXsU/edit#)

**Testing:** Unit Tests and manual testing, more to come

**Documentation:** README.md updated, including an example
MovieStoreGuy pushed a commit to atlassian-forks/opentelemetry-collector that referenced this issue Nov 11, 2021
…-telemetry#2285)

* Fix IPv6 handling errors in semconv.NetAttributesFromHTTPRequest

fixes open-telemetry#2283

* Enter PR number in CHANGELOG

* Remove unnecessary creation and then assignment

Standardize order of checks for IP, Name, Port

* Assume happy path when parsing host and port

i.e. assume net.SplitHostPort(input) will succeed

* Get rid of uint64 for port

* Fix git merge of main by adding back strings import

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
Co-authored-by: Tyler Yahn <codingalias@gmail.com>
@pmm-sumo
Copy link
Contributor Author

pmm-sumo commented Feb 7, 2022

Closing this since the capability is now available through experimental queued_retry implementation

@pmm-sumo pmm-sumo closed this as completed Feb 7, 2022
hughesjj pushed a commit to hughesjj/opentelemetry-collector that referenced this issue Apr 27, 2023
Troels51 pushed a commit to Troels51/opentelemetry-collector that referenced this issue Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants