Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Add a test covering exception thrown in custom sampler #4072

Merged
merged 4 commits into from
Jan 23, 2023

Conversation

CodeBlanch
Copy link
Member

Changes

  • Adds a test covering what happens when an exception is thrown in a custom sampler. Intended to spark some discussion because the current behavior might be flawed?

/cc @alanwest @utpilla @cijothomas

@CodeBlanch CodeBlanch requested a review from a team January 10, 2023 20:53
@codecov
Copy link

codecov bot commented Jan 10, 2023

Codecov Report

Merging #4072 (c9679dc) into main (9e00b8f) will decrease coverage by 0.02%.
The diff coverage is n/a.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4072      +/-   ##
==========================================
- Coverage   85.57%   85.56%   -0.02%     
==========================================
  Files         289      289              
  Lines       11261    11261              
==========================================
- Hits         9637     9635       -2     
- Misses       1624     1626       +2     
Impacted Files Coverage Δ
...emetry.Api/Internal/OpenTelemetryApiEventSource.cs 73.52% <0.00%> (-8.83%) ⬇️
...p/Implementation/HttpInstrumentationEventSource.cs 72.00% <0.00%> (-4.00%) ⬇️
...tpListener/Internal/PrometheusCollectionManager.cs 73.62% <0.00%> (-2.20%) ⬇️
...Telemetry/Internal/SelfDiagnosticsEventListener.cs 97.65% <0.00%> (+0.78%) ⬆️
src/OpenTelemetry/BatchExportProcessor.cs 84.11% <0.00%> (+1.86%) ⬆️
...ter.ZPages/Implementation/ZPagesActivityTracker.cs 100.00% <0.00%> (+2.85%) ⬆️

@reyang
Copy link
Member

reyang commented Jan 10, 2023

A related question - what is the expected behavior if the sampler throws?

Here are some possible options:

  • Exception from a sampler should be considered as SamplingDecision.Drop. In additional, an internal error log might be generated.
  • A sampler should not throw exceptions, any exception from a sampler is considered as a bug of the sampler and the SDK will not handle it (which means the result is undefined and might lead to application crash).

@reyang
Copy link
Member

reyang commented Jan 10, 2023

I personally would vote for leaving it as undefined behavior, and not handle it in the SDK. This also applies to the following situations:

  1. What if the sampler itself ended up creating more activities (which might cause recursive calls with stack overflow)?
  2. What if the sampler is taking indefinite amount of time?
  3. What if the sampler has contention that might interfere with the normal execution of the application?

{
public override SamplingResult ShouldSample(in SamplingParameters samplingParameters)
{
throw new InvalidOperationException("ThrowingSampler");
Copy link
Contributor

@utpilla utpilla Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the direction about who is responsible for catching/ not throwing the exception should come from the spec. Similar to how they have specified it for Processor OnStart and OnEnd methods here.

If we have a particular preference, we should try to get the spec updated for ShouldSample first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the direction about who is responsible for catching the exception should come from the spec. Similar to how they have specified it for Processor OnStart and OnEnd methods here

The spec doesn't seem to talk about who should be responsible if exception was thrown from a Processor's OnStart/OnEnd?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OnEnd is called after a span is ended (i.e., the end timestamp is already set). This method MUST be called synchronously within the Span.End() API, therefore it should not block or throw an exception.

I interpret this statement as "Processor.OnEnd should not throw an exception" which makes it the user's responsibility to provide a spec-compliant processor to the SDK. If the processor logic can throw an exception, it's the user's responsibility to ensure that it is not thrown (they could simply wrap a try-catch around their processor logic.

@utpilla
Copy link
Contributor

utpilla commented Jan 10, 2023

Another related question: Why do we not have a try-catch around Exporter.Export in BatchExportProcessor?

if (this.circularBuffer.Count > 0)
{
using (var batch = new Batch<T>(this.circularBuffer, this.MaxExportBatchSize))
{
this.exporter.Export(batch);
}
try
{
this.dataExportedNotification.Set();
this.dataExportedNotification.Reset();
}
catch (ObjectDisposedException)
{
// the exporter is somehow disposed before the worker thread could finish its job
return;
}
}

We have try-catch around Exporter.Export for SimpleExporterProcessor and BaseExportingMetricReader:

lock (this.syncObject)
{
try
{
this.exporter.Export(new Batch<T>(data));
}
catch (Exception ex)
{
OpenTelemetrySdkEventSource.Log.SpanProcessorException(nameof(this.OnExport), ex);
}
}

try
{
OpenTelemetrySdkEventSource.Log.MetricReaderEvent(this.exportCalledMessage);
var result = this.exporter.Export(metrics);
if (result == ExportResult.Success)
{
OpenTelemetrySdkEventSource.Log.MetricReaderEvent(this.exportSucceededMessage);
return true;
}
else
{
OpenTelemetrySdkEventSource.Log.MetricReaderEvent(this.exportFailedMessage);
return false;
}
}
catch (Exception ex)
{
OpenTelemetrySdkEventSource.Log.MetricReaderException(nameof(this.ProcessMetrics), ex);
return false;
}
}

@cijothomas
Copy link
Member

See #1640

@cijothomas cijothomas closed this Jan 11, 2023
@cijothomas cijothomas reopened this Jan 11, 2023
@CodeBlanch
Copy link
Member Author

CodeBlanch commented Jan 11, 2023

I was just poking around the spec. The docs for the different extension points have varying degrees of clarity, but the error handling doc seems crystal clear 😄

Basic error handling principles

OpenTelemetry implementations MUST NOT throw unhandled exceptions at run time.

The spirit of this doc says to me the primary goal is to not interrupt the function of the code being instrumented a la we should try/catch user code.

@reyang
Copy link
Member

reyang commented Jan 11, 2023

The spirit of this doc says to me the primary goal is to not interrupt the function of the code being instrumented a la we should try/catch user code.

What would you consider as "user code"? E.g. if someone is writing a processor/exporter, do we consider them as SDK developers or SDK plugin developers?

@CodeBlanch
Copy link
Member Author

What would you consider as "user code"?

Let me amend my statement:

The spirit of this doc says to me the primary goal is to not interrupt the function of the code being instrumented a la we should try/catch user all code.

To say that another way, what I feel like the spec is trying to say is: Once started, the SDK should never throw. Regardless of if it is itself throwing the exception or it is bubbling up an exception from some plugin it executed.

@reyang
Copy link
Member

reyang commented Jan 11, 2023

To say that another way, what I feel like the spec is trying to say is: Once started, the SDK should never throw. Regardless of if it is itself throwing the exception or it is bubbling up an exception from some plugin it executed.

Putting these together, how would you phrase the implementation principles/guidelines? I think either way is fine as long as we have clear principles, and the worst situation is that we don't have principles + we never thought about it + we just do things randomly without clarity.

Take an example:

Here are more examples:

  • "If an exporter has thrown an exception to the caller, the exporter will be treated as faulty. As a result, the exception MUST be swallowed by the caller, followed by an internal error log emitted by the caller, and the caller should stop making any further call to the exporter. In case the exporter is running on a batch exporting background thread, the thread should be terminated."
  • "If a sampler has thrown an exception to the caller, the sampler will be treated as faulty. As a result, the SamplingResult should be treated as DROP + no extra attributes + bypass tracestate"

@CodeBlanch
Copy link
Member Author

@reyang

My personal opinion, agree with @utpilla it would be best for the spec to prescribe the behavior. Ambiguity in specifications is bad. What should the spec say? IMO, keep it simple:

OnStart ... should not block or throw exceptions. However, if OnStart throws exceptions, the SDK should swallow the exception and write the details to whatever diagnostic mechanism is in place for the SDK.

My reasoning is that the spec has been ambiguous and our code has been unprotected thus far, and the world hasn't ended, so we shouldn't over engineer something 😄

@reyang
Copy link
Member

reyang commented Jan 13, 2023

@reyang

My personal opinion, agree with @utpilla it would be best for the spec to prescribe the behavior. Ambiguity in specifications is bad. What should the spec say? IMO, keep it simple:

OnStart ... should not block or throw exceptions. However, if OnStart throws exceptions, the SDK should swallow the exception and write the details to whatever diagnostic mechanism is in place for the SDK.

I feel this simple answer is starting to scratch the surface of the design, which leads to couple more questions:

  • Do we keep calling the sampler later (when new telemetry data is flowing through), assuming that it might stop throwing exceptions? - that might result in massive amount of exceptions and how do we handle that?
  • Or we should somehow put the sampler in the abandoned list and never attempt to call it again?
  • If we swallow the exception, what do we do with the telemetry - treat it as Drop or something else?

@CodeBlanch
Copy link
Member Author

This is just my opinion but for the sampler I would say a throw is a drop. I would not prescribe back-off from the spec or anything more perverse. It could leave the door open for that should an SDK want it. My reasoning is a sampler can implement its own retry/backoff/fallback/whatever if it is doing complex things. The goal is just to fulfil the spec's mission to be transparent and not bring down the process/interfere with useful code the app is trying to execute.

@reyang
Copy link
Member

reyang commented Jan 13, 2023

I don't feel this would solve any problems or make things better, and it is more likely to make things worse. I'll take one example - a very bad sampler which triggers a blocking I/O operation:

  1. There isn't much to do from the SDK, the user will for sure be impacted.
  2. The sampler should not do such blocking call - the only reasonable fix is going to be provided by the developer who owns the sampler code, or a mitigation on the application developer to avoid using the sampler.

I consider taking blocking calls, making recursive calls, leaking memory, don't properly handle exceptions - these are equally bad, and from the consistency perspective it makes more sense to treat them in the same way.

More thinking regarding the ecosystem, completely agree with:

My reasoning is that the spec has been ambiguous and our code has been unprotected thus far, and the world hasn't ended, so we shouldn't over engineer something

I think a simple "solution" is to do nothing (I guess LoggingProvider is doing the same - not over protecting), not over protecting things knowing that we won't be able to protect it and on the other side we might introduce bad behavior (e.g. a plugin developer might decide to throw exception from a deep callstack for "simplicity" instead of return SamplingDecision.Drop).

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

@github-actions github-actions bot added the Stale Issues and pull requests which have been flagged for closing due to inactivity label Jan 21, 2023
@alanwest
Copy link
Member

Reminds me of this old conversation #1081 (comment).

At that point in time, my understanding of the spec was that the SDK should suppress exceptions even from custom components.

the error handling doc seems crystal clear

I'm not as convinced anymore 😆.

From the error handling spec

The SDK MUST NOT throw unhandled exceptions for errors in their own operations. For example, an exporter should not throw an exception when it cannot reach the endpoint to which it sends telemetry data.

The MUST NOT in this statement does not imply to me that the SDK must seek to swallow unexpected exceptions raised from all components - custom or otherwise. The second statement suggests that it is the exporter's responsibility to not throw exceptions. I agree that the exporters hosted in this repo must not throw exceptions. If they did, we would certainly treat this as a bug and fix it.

Just as it is our responsibility to ensure our components do not throw exceptions, I think authors of custom exporters, samplers, processors, etc share this responsibility.

If the intent of the spec is that SDKs must suppress all exceptions that may crash an application then I think it should clearly state this.

Specifically with regards to samplers, I asked Java. Turns out an exception thrown from a custom sampler in Java will not be swallowed by the SDK.

Just another data point... another example within the .NET ecosystem: if you write a custom ILogger logging provider that throws an exception, this will not be caught and may cause an app to crash.

@cijothomas
Copy link
Member

@alanwest good points.
Lets be good citizens by reporting bad actors to the cop, i.e if a bad actor throws, lets not eat it, and encourage bad behavior. instead, let the app crash. We are not violating Otel principles by doing so.

Comment on lines 259 to 260
// TODO: Discuss: An exception thrown in custom sampler probably
// should not blow up like this?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove TODO assuming we agree this is the behavior we want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or leave a comment saying this is by design, and link to this PR/issues, so a future reader can easily find all the prior discussions on this topic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@cijothomas
Copy link
Member

It'd be good to modify the docs (separately) meant for custom plugin writers, to strongly warn them of the implications of throwing from their components and be explicit like "if you throw, it might lead to app crash"/similar wordings.
https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/trace/extending-the-sdk

@cijothomas cijothomas merged commit 84b3778 into open-telemetry:main Jan 23, 2023
@CodeBlanch CodeBlanch deleted the sampler-exception branch January 25, 2023 21:20
aunikitin pushed a commit to aunikitin/opentelemetry-dotnet that referenced this pull request Feb 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale Issues and pull requests which have been flagged for closing due to inactivity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants