Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use dedicated thread for Metrics PeriodicReader #2142

Open
wants to merge 48 commits into
base: main
Choose a base branch
from

Conversation

cijothomas
Copy link
Member

@cijothomas cijothomas commented Sep 24, 2024

  1. This eliminates the need of async runtime, and avoids all the bugs associated with (in)correct usage of async runtimes and other bugs like [Bug]: PeriodicReader uses sync::Mutex. #1507 and [Bug]: PeriodicReader::shutdown() deadlocks on current thread runtime. #2056
    Given the usage of dedicated thread, PeriodicReader is no longer at the mercy of user threads not blocking threads and causing issues like this.
  2. Callbacks for Observable instruments can now make async calls with help of futures executor/others without deadlocking, as the callbacks are now executed from a normal thread, and avoids issues like Observable Gauge Callback #2063
  3. This also uses named thread, addressing [Feature]: Name spawned threads #1949
  4. User thread is never blocked in this. Blocking calls, if any, are only from OTel's own background thread.
  5. Flush - is designed to be blocking the calling thread - this should not be a concern as Flush() is rarely required.
  6. Shutdown - This PR keeps in sync, but will offer an async version of it, so users can call shutdown and await without blocking thread. This can be done without dependency on async-runtimes in the OTel SDK.
  7. This is a working prototype, and I expect to iterate over based on feedback. Have used internal logs liberally for now. This will be revisited to keep it to essentials only before stabilizing.
  8. A notable limitation as-is is the lack of time out enforcement at PeriodicReader, as it is now delegated to exporters. A future PR will modify the exporters to actually enforce this, to keep the scope in control for this PR. Or explore how to enforce timeouts from PeriodicReader itself.

(The ability for users to bring-own-async-runtime - this can be offered as an opt-in feature. How exactly - TBD. Most likely we need to do it post 1.0)

Note: The key change is in periodicreader.rs only. Rest are cascading effects on removing runtime requirement, passing timeout to exporters. The non-relevant changes to examples are to be removed before PR merge - it is added to help anyone run locally and observe the logs themselves.

@cijothomas cijothomas requested a review from a team as a code owner September 24, 2024 19:27
@cijothomas cijothomas changed the title Use dedicated thread for Metrics PeriodicReader [WIP] Use dedicated thread for Metrics PeriodicReader Sep 24, 2024
@demurgos
Copy link
Contributor

demurgos commented Sep 24, 2024

I'm worried about support for situations where threads may be problematic, either because their are not supported or because or due to the use of functions unsafe to call in the presence of threads (non-reentrant function calls). For example, what about WebAssembly?

Would it be possible to have thread not required and instead be another runtime to chose from? Or exposing a lower level interface based on channels and let the use chose its concurrency solution?

@cijothomas
Copy link
Member Author

I'm worried about support for situations where threads may be problematic, either because their are not supported or because or due to the use of functions unsafe to call in the presence of threads (non-reentrant function calls). For example, what about WebAssembly?

Would it be possible to have thread not required and instead be another runtime to chose from? Or exposing a lower level interface based on channels and let the use chose its concurrency solution?

Yes that is very valid point, which is already mentioned in the desc:
The ability for users to bring-own-async-runtime - this can be offered as an opt-in feature. How exactly - TBD. Most likely we need to do it post 1.0)

Yes we expect there'd be scenarios where OTel spawning background threads is not feasible, and OTel will offer a way to "bring your own runtime".

@cijothomas
Copy link
Member Author

@sandersaares Could you take a look?

Copy link

@sandersaares sandersaares left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basic premise seems workable. I suspect my lack of OTel familiarity prevents me from making more useful comments here.

opentelemetry-sdk/src/metrics/periodic_reader.rs Outdated Show resolved Hide resolved
.github/workflows/ci.yml Outdated Show resolved Hide resolved
@cijothomas cijothomas added this to the 0.27 milestone Oct 7, 2024
@cijothomas cijothomas changed the title [WIP] Use dedicated thread for Metrics PeriodicReader Use dedicated thread for Metrics PeriodicReader Oct 8, 2024
@@ -61,7 +61,7 @@ fn init_tracer_provider() -> Result<sdktrace::TracerProvider, TraceError> {

fn init_metrics() -> Result<opentelemetry_sdk::metrics::SdkMeterProvider, MetricsError> {
opentelemetry_otlp::new_pipeline()
.metrics(opentelemetry_sdk::runtime::Tokio)
.metrics()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you verified that this still works? I believe reqwest relies on Tokio. Now that the outgoing Http call through reqwest is being made on a background thread without Tokio (or any runtime), it might be problematic. Looking to ensure that we don't hit this issue in particular:

there is no reactor running, must be called from the context of Tokio runtime

Related to #248, #2137

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point!
Based on offline discussion, yes this appear to cause issues when using both reqwest and hyper. These libraries seem to have a strong requirement that it cannot work unless they are inside tokio runtime.
tonic seem to work fine without issues.

Will check if there are ways to work around the http library limitations.

Copy link
Member

@lalitb lalitb Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will check if there are ways to work around the http library limitations.

I did a quick test with Simple Log Processor + OTLP exporter with request::blocking::Client - this works without need of tokio runtime. This should also work with background thread for batch in that case. Enforcing this for background thread could be one option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will check if there are ways to work around the http library limitations.

I did a quick test with Simple Log Processor + OTLP exporter with request::blocking::Client - this works without need of tokio runtime. This should also work with background thread for batch in that case. Enforcing this for background thread could be one option.

Yes. But there is no blocking version for libraries like tonic, effectively limiting exporting to only support http with reqwest::blockingClient.
Will need to redesign after doing a comparison of all feasible options.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tonic seem to work fine without issues.

This maybe incorrect, as it may have worked during shutdown only. Apologies for the confusion!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, tonic is tricky, there is no alternative either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lalitb I have another correction to make about tonic, based on more testing:
The changes in this PR works fine with tonic, as long as the main function of the app is a tokio one. Yes the tonic::export call is made from our background thread, still this works.
If the main function is not a tokio main, then the app panics at meterprovider build itself. It looks like we attempt to create a grpc channel at build(), and that fails due to lack of tokio runtime.

Copy link
Member

@lalitb lalitb Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this PR works fine with tonic, as long as the main function of the app is a tokio one. Yes the tonic::export call is made from our background thread, still this works.

Interesting, I was thinking the spawned thread (std::thread::spawn) doesn't run in the tokio runtime and so will not have access to tokio runtime. thanks for confirming.

Trying to summarize the scenarios:

These work:

 tokio::main -> background-thread -> gRPC (tonic), HTTP (hyper, reqwest, reqwest-blocking) 
 tokio::main -> simple-exporter -> gRPC (tonic), HTTP (hyper, reqwest, reqwest-blocking)  (assuming we do filtering to avoid infinite loop).
 tokio::main -> simple-exporter -> HTTP (reqwest-blocking) 
 main -> simple-exporter -> HTTP (reqwest-blocking)

And these doesn't work:

 main -> background-thread -> gRPC (tonic), HTTP (hyper, reqwest, reqwest-blocking) 
 tokio::main(current_thread) -> simple-exporter -> gRPC (tonic), HTTP (hyper, reqwest, reqwest-blocking) # hangs

how about this ?

tokio::main(current_thread) -> background-thread -> gRPC (tonic), HTTP (hyper, reqwest, reqwest-blocking) 

self.is_shutdown
.store(true, std::sync::atomic::Ordering::Relaxed);
if response {
Ok(())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Join the background thread before considering it a successful shutdown?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Join the background thread could block the user thread if the background thread is some how blocking (like blocking in the export call), so the timeout parameter could be ignored.

Could we try to join it if it can be done within the specified timeout, and give up the joining operation once timeout is reached?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ThomsonTan Such a situation will not arise in current implementation. If response_rx got the message back, the background thread will be exiting itself, and not doing anything else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants