-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding new inferred sampler #12927
Adding new inferred sampler #12927
Conversation
Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
❌ Gradle check result for e02fc95: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
Compatibility status:Checks if related components are compatible with change b944c0b Incompatible componentsSkipped componentsCompatible componentsCompatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/sql.git] |
❌ Gradle check result for 78dcc6b: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Nishchay Malhotra <114057571+nishchay21@users.noreply.github.com>
❌ Gradle check result for 432d294: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
❌ Gradle check result for ad8d7f3: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
❌ Gradle check result for 5c4128b: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
❌ Gradle check result for 47bef14: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #12927 +/- ##
============================================
- Coverage 71.42% 71.37% -0.05%
- Complexity 59978 60376 +398
============================================
Files 4985 5025 +40
Lines 282275 284175 +1900
Branches 40946 41165 +219
============================================
+ Hits 201603 202835 +1232
- Misses 63999 64520 +521
- Partials 16673 16820 +147 ☔ View full report in Codecov by Sentry. |
libs/telemetry/src/main/java/org/opensearch/telemetry/tracing/DefaultSpanScope.java
Outdated
Show resolved
Hide resolved
libs/core/src/main/java/org/opensearch/core/transport/TransportMessage.java
Outdated
Show resolved
Hide resolved
libs/telemetry/src/main/java/org/opensearch/telemetry/tracing/attributes/AttributeType.java
Outdated
Show resolved
Hide resolved
libs/telemetry/src/main/java/org/opensearch/telemetry/tracing/noop/NoopSpan.java
Outdated
Show resolved
Hide resolved
libs/telemetry/src/main/java/org/opensearch/telemetry/tracing/noop/NoopSpan.java
Outdated
Show resolved
Hide resolved
plugins/telemetry-otel/src/main/java/org/opensearch/telemetry/tracing/OTelSpan.java
Outdated
Show resolved
Hide resolved
plugins/telemetry-otel/src/main/java/org/opensearch/telemetry/tracing/OTelSpan.java
Outdated
Show resolved
Hide resolved
...metry-otel/src/main/java/org/opensearch/telemetry/tracing/sampler/InferredActionSampler.java
Show resolved
Hide resolved
plugins/telemetry-otel/src/main/java/org/opensearch/telemetry/tracing/OTelTracingTelemetry.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Nishchay Malhotra <114057571+nishchay21@users.noreply.github.com>
❕ Gradle check result for 178e48b: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
libs/core/src/main/java/org/opensearch/core/transport/TransportMessage.java
Outdated
Show resolved
Hide resolved
libs/telemetry/src/main/java/org/opensearch/telemetry/tracing/DefaultTracer.java
Outdated
Show resolved
Hide resolved
libs/telemetry/src/main/java/org/opensearch/telemetry/tracing/TracerContextStorage.java
Outdated
Show resolved
Hide resolved
plugins/telemetry-otel/src/main/java/org/opensearch/telemetry/tracing/OTelSpan.java
Show resolved
Hide resolved
plugins/telemetry-otel/src/main/java/org/opensearch/telemetry/tracing/OTelSpan.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/opensearch/telemetry/tracing/handler/TraceableTransportResponseHandler.java
Outdated
Show resolved
Hide resolved
libs/core/src/main/java/org/opensearch/core/transport/TransportMessage.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Nishchay Malhotra <nishcha@amazon.com> Adding new sampled attributes Adding new inferred sampler attributes Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
❌ Gradle check result for 7504c6a: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
❌ Gradle check result for a1b3df1: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Nishchay Malhotra <114057571+nishchay21@users.noreply.github.com>
Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
❕ Gradle check result for b944c0b: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
@reta Can you help review the same |
I will try to find the time this week, thanks @nishchay21 |
.putAll(attributes) | ||
.build(); | ||
SamplingResult result = SamplingResult.recordAndSample(); | ||
return new OTelSamplingResult(result.getDecision(), customSampleAttributes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand what this sampler is doing: it is 100% (aka always) sampler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is 100% sampler. We are using batch span processor itself for the final export so that will only export the span if the span is marked to be sampled. This takes care of marking all span as sampled so that the spans can be exported. With this it also adds an attribute to the span, for the processor layer to know that the span has been sampled by this span processor.
Adding further, we have created a custom delegated processor layer which act as an intermediate layer between batch span processor and Otel framework and forwards the normal span i.e. spans which are not marked by this sampler directly to batch span processor and if the span is marked by this sampler then filter out spans that are outlier or their child are outlier.
Also adding further we need to mark the spans to be sampled beforehand so that the span can record information as well. If the span is not marked for sampling then it will not record any information as well and will act as a Noop span.
SamplingAttributes.INFERRED_SAMPLER.getValue() | ||
)) { | ||
if (span.getAttribute(AttributeKey.booleanKey(SamplingAttributes.SAMPLED.getValue())) != null) { | ||
this.delegateProcessor.onEnd(span); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come we are not calling delegateProcessor.onEnd(span)
in all branches? We don't know what delegateProcessor
is actually doing, it may track spans with onStart / onEnd calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delegated Processor here is the BatchSpanProcessor itself. This custom processor is acting as an intermediate layer between BatchSpanProcessor and Otel framework and forwards the normal span i.e. spans which are not marked by inferred sampler directly to BatchSpanProcessor and if the span is marked by inferred sampler then filter out spans that are outlier or their children are outlier. This is done by adding a sampled attribute to the span itself so that the processor can distinguish between the outlier and non-outlier spans
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot do that: the onStart / onEnd pair has to be called (by OTEL spec actually). The fact that we rely on BatchSpanProcessor
is risky implementation detail we should not rely upon: using LeakDetectingSpanProcessor or even our own StrictCheckSpanProcessor
will fail (the fact we have no failures means we have insufficient test coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct @reta . However we cannot use BatchSpanProcessor as it-is because we want the processor to be attribute aware. What we can do is create a WrappedBatchSpanProcessor that will always initialize BatchSpanProcessor as the delegated processor and use the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are not following the specs (red flag here), the only way implement this feature is to provide our own SpanProcessor
that does not rely on any existing OTEL provided classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So as I understand the concern here is that the onStart for certain spans will be called however the onEnd will not be called which could be an issue . Is that correct? So it would be better to have our own implementation for the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it violates specs and leaks spans (with delegation model)
@@ -32,9 +38,33 @@ public OTelSpan(String spanName, Span span, org.opensearch.telemetry.tracing.Spa | |||
|
|||
@Override | |||
public void endSpan() { | |||
if (isSpanOutlier()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How that is suppose to work in case of async calls when parent is finished before the child? (example below)
try (ScopedSpan startScopedSpan = tracer.startScopedSpan(...)) {
threadPool.executor(...).execute( ... )
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the parent span object is still maintained in the scope of child even if the parent has ended. What we plan to do here is mark the current child's parent and the chain above as one of the parent within the chain would still be alive. As suggested here as well if the parent is still in the recording phase we will go ahead and add the attribute to the parent and mark the parent to be sampled as well. If the parent is not recording then the information about this parent will be stored in event of its parent so that we don't loose on the info of not-recording spans.
Adding the closed parent event to its parent span is still pending and will add in the next revision itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the terminology here (what does it mean for span chain to be alive), sorry about that , let me try to illustrate that better:
try (ScopedSpan startScopedSpan = tracer.startScopedSpan(...)) {
threadPool.executor(...).execute( ... )
}
- since we sample 100%, all spans are sampled
- the parent span is created (
tracer.startScopedSpan
) , onSpanStart is called - the
threadPool
captures context (current span as parent) and schedules the work - the parent span is ended, onSpanEnd is called
After onSpanEnd
, the parent span is immutable, no matter if we still have reference to in from Java code or not - any modification to it have no effect but produce warning.
My question is: since we not be exporting the parent span (due to absence of the expected attributes), what would happen in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here is how this would work:
- If the span is an outlier span we will add the sampled attribute to the span.
- Once the span is marked to be sampled we will mark its parent to be sampled as well. This could now have two possibilities
- One, the parent is still recording. If that is the case we will go ahead and add the attribute to the parent as well and the parent would be sampled as well.
- Second, the parent is not recording/has ended in that case we mark its grand-parent if that is alive to be sampled and record this closed span information in the alive grand-parent itself. This way we will not get the span which was closed however the information about the span would still persist in its parent which will be sampled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I have suspected - we are messing up the trace in unpredictable ways (another red flag), there should be only one possibility: the trace is recorded as it was sampled or not recorded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are correct we will miss the span. However to maintain the lineage we will still maintain the information of closed span in its parent itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here is the thing folks, I my opinion, which I shared on RFC as well:
- if we cannot preserve span hierarchy, this is no go (the mess would be on user to deal with)
- if we cannot follow the rules of the library we are using, this is no go
OTEL cannot solve tail sampling on library level, this is documented and acknowledged:
- https://opentelemetry.io/blog/2022/tail-sampling/
- https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor
There are other tracing instrumentations that try to implement sampling differently, including adaptive sampling that could tune the sampling rates based on the different factors:
- https://github.com/openzipkin-attic/zipkin-zookeeper
- https://medium.com/jaegertracing/adaptive-sampling-in-jaeger-50f336f4334
Please consider those as a robust and reliable alternatives, that do not exhibit flaws of this solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see where you're coming from regarding the potential gaps in the hierarchy, which users may need to address based on their choice of collector and visualization solutions. It's been recognized as a requirement to address such gaps in the Otel specification, and the community appears to be quite receptive to it. If you take a look at the Otel issues provided, you'll notice that even with these gaps, there are still ways to infer missing parent information through parent span events, aiding in understanding the code path and other relevant details.
open-telemetry/opentelemetry-specification#3205
open-telemetry/opentelemetry-specification#3867
In general, there are numerous benefits to this approach, particularly for customers dealing with larger clusters processing tens of thousands of requests per second and managing multiple clusters. It offers insights into problematic areas of code, and the availability of missing span information as part of the event in the parent span can be very helpful in connecting the dots.
Moreover, this won't be the default behavior and can be enabled as needed (though it's crucial to clearly document the shortcomings and benefits of such samplers). There have been requests from Otel users for deferred sampling support and reverse propagation of context, and it's hoped that these features will be available soon, at which point a switch can be made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Gaganjuneja I shared my opinion on the subject many times, please let's get back to this when these features are available, if other maintainers see this is a way to go , I am fine with that, but strong -1 from me moving this further as it stands today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last thing, I think we can solve this issue by adding a link between the span and it's grand parent to persist the relationship. Since this is a follows_from
span.
@nishchay21 @Gaganjuneja @reta From the discussion that I've read on this issue I believe this pull request should be closed unmerged. I'd recommend regrouping with a fresh RFC(s) or considerable investment in a POC with reliability tests (that directly address the questions raised here). |
This PR is stalled because it has been open for 30 days with no activity. |
I'm closing this PR out because it has stalled and there doesn't seem to be a good course of action to move it forward. I'd recommend regrouping with a fresh RFC(s) or considerable investment in a POC with reliability tests that directly address the questions raised here. |
Description
[Describe what this change achieves]
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List