Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new inferred sampler #12927

Closed
wants to merge 14 commits into from

Conversation

nishchay21
Copy link
Contributor

@nishchay21 nishchay21 commented Mar 26, 2024

Description

[Describe what this change achieves]

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
Copy link
Contributor

❌ Gradle check result for e02fc95: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
Copy link
Contributor

github-actions bot commented Mar 26, 2024

Compatibility status:

Checks if related components are compatible with change b944c0b

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/sql.git]

Copy link
Contributor

❌ Gradle check result for 78dcc6b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Nishchay Malhotra <114057571+nishchay21@users.noreply.github.com>
Copy link
Contributor

❌ Gradle check result for 432d294: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
Copy link
Contributor

❌ Gradle check result for ad8d7f3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
Copy link
Contributor

❌ Gradle check result for 5c4128b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
Copy link
Contributor

❌ Gradle check result for 47bef14: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
Copy link
Contributor

✅ Gradle check result for 4f0586a: SUCCESS

Copy link

codecov bot commented Mar 27, 2024

Codecov Report

Attention: Patch coverage is 78.04878% with 27 lines in your changes are missing coverage. Please review.

Project coverage is 71.37%. Comparing base (b15cb0c) to head (b944c0b).
Report is 287 commits behind head on main.

Files Patch % Lines
...racing/ThreadContextBasedTracerContextStorage.java 0.00% 10 Missing ⚠️
...ava/org/opensearch/telemetry/tracing/OTelSpan.java 70.83% 0 Missing and 7 partials ⚠️
...rg/opensearch/telemetry/tracing/DefaultTracer.java 75.00% 2 Missing and 1 partial ⚠️
...ing/handler/TraceableTransportResponseHandler.java 50.00% 2 Missing and 1 partial ⚠️
...telemetry/tracing/processor/OTelSpanProcessor.java 88.88% 0 Missing and 2 partials ⚠️
...tracing/channels/TraceableTcpTransportChannel.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #12927      +/-   ##
============================================
- Coverage     71.42%   71.37%   -0.05%     
- Complexity    59978    60376     +398     
============================================
  Files          4985     5025      +40     
  Lines        282275   284175    +1900     
  Branches      40946    41165     +219     
============================================
+ Hits         201603   202835    +1232     
- Misses        63999    64520     +521     
- Partials      16673    16820     +147     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Nishchay Malhotra <114057571+nishchay21@users.noreply.github.com>
Copy link
Contributor

❕ Gradle check result for 178e48b: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationMultiSearchDuringQueryPhase
      1 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationDuringQueryPhase
      1 org.opensearch.gateway.RecoveryFromGatewayIT.testShardStoreFetchMultiNodeMultiIndexesUsingBatchAction

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
Copy link
Contributor

✅ Gradle check result for 6bd52e9: SUCCESS

Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>

Adding new sampled attributes

Adding new inferred sampler attributes

Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
Copy link
Contributor

❌ Gradle check result for 7504c6a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
@nishchay21 nishchay21 marked this pull request as ready for review April 1, 2024 04:41
Copy link
Contributor

github-actions bot commented Apr 1, 2024

❌ Gradle check result for a1b3df1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Nishchay Malhotra <114057571+nishchay21@users.noreply.github.com>
Copy link
Contributor

github-actions bot commented Apr 1, 2024

✅ Gradle check result for 1e1d624: SUCCESS

Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
Copy link
Contributor

github-actions bot commented Apr 1, 2024

✅ Gradle check result for 8bce540: SUCCESS

Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
Copy link
Contributor

github-actions bot commented Apr 1, 2024

❕ Gradle check result for b944c0b: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testIndicesDeletedFromRepository

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@nishchay21 nishchay21 requested a review from Gaganjuneja April 2, 2024 04:42
@nishchay21
Copy link
Contributor Author

@reta Can you help review the same

@reta
Copy link
Collaborator

reta commented Apr 3, 2024

@reta Can you help review the same

I will try to find the time this week, thanks @nishchay21

.putAll(attributes)
.build();
SamplingResult result = SamplingResult.recordAndSample();
return new OTelSamplingResult(result.getDecision(), customSampleAttributes);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand what this sampler is doing: it is 100% (aka always) sampler?

Copy link
Contributor Author

@nishchay21 nishchay21 Apr 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is 100% sampler. We are using batch span processor itself for the final export so that will only export the span if the span is marked to be sampled. This takes care of marking all span as sampled so that the spans can be exported. With this it also adds an attribute to the span, for the processor layer to know that the span has been sampled by this span processor.

Adding further, we have created a custom delegated processor layer which act as an intermediate layer between batch span processor and Otel framework and forwards the normal span i.e. spans which are not marked by this sampler directly to batch span processor and if the span is marked by this sampler then filter out spans that are outlier or their child are outlier.

Also adding further we need to mark the spans to be sampled beforehand so that the span can record information as well. If the span is not marked for sampling then it will not record any information as well and will act as a Noop span.

SamplingAttributes.INFERRED_SAMPLER.getValue()
)) {
if (span.getAttribute(AttributeKey.booleanKey(SamplingAttributes.SAMPLED.getValue())) != null) {
this.delegateProcessor.onEnd(span);
Copy link
Collaborator

@reta reta Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we are not calling delegateProcessor.onEnd(span) in all branches? We don't know what delegateProcessor is actually doing, it may track spans with onStart / onEnd calls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delegated Processor here is the BatchSpanProcessor itself. This custom processor is acting as an intermediate layer between BatchSpanProcessor and Otel framework and forwards the normal span i.e. spans which are not marked by inferred sampler directly to BatchSpanProcessor and if the span is marked by inferred sampler then filter out spans that are outlier or their children are outlier. This is done by adding a sampled attribute to the span itself so that the processor can distinguish between the outlier and non-outlier spans

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot do that: the onStart / onEnd pair has to be called (by OTEL spec actually). The fact that we rely on BatchSpanProcessor is risky implementation detail we should not rely upon: using LeakDetectingSpanProcessor or even our own StrictCheckSpanProcessor will fail (the fact we have no failures means we have insufficient test coverage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct @reta . However we cannot use BatchSpanProcessor as it-is because we want the processor to be attribute aware. What we can do is create a WrappedBatchSpanProcessor that will always initialize BatchSpanProcessor as the delegated processor and use the same.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are not following the specs (red flag here), the only way implement this feature is to provide our own SpanProcessor that does not rely on any existing OTEL provided classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as I understand the concern here is that the onStart for certain spans will be called however the onEnd will not be called which could be an issue . Is that correct? So it would be better to have our own implementation for the same?

Copy link
Collaborator

@reta reta Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it violates specs and leaks spans (with delegation model)

@@ -32,9 +38,33 @@ public OTelSpan(String spanName, Span span, org.opensearch.telemetry.tracing.Spa

@Override
public void endSpan() {
if (isSpanOutlier()) {
Copy link
Collaborator

@reta reta Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How that is suppose to work in case of async calls when parent is finished before the child? (example below)

       try (ScopedSpan startScopedSpan = tracer.startScopedSpan(...)) {
             threadPool.executor(...).execute( ... )
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the parent span object is still maintained in the scope of child even if the parent has ended. What we plan to do here is mark the current child's parent and the chain above as one of the parent within the chain would still be alive. As suggested here as well if the parent is still in the recording phase we will go ahead and add the attribute to the parent and mark the parent to be sampled as well. If the parent is not recording then the information about this parent will be stored in event of its parent so that we don't loose on the info of not-recording spans.

Adding the closed parent event to its parent span is still pending and will add in the next revision itself.

Copy link
Collaborator

@reta reta Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the terminology here (what does it mean for span chain to be alive), sorry about that , let me try to illustrate that better:

 try (ScopedSpan startScopedSpan = tracer.startScopedSpan(...)) {
           threadPool.executor(...).execute( ... )
  }
  • since we sample 100%, all spans are sampled
  • the parent span is created (tracer.startScopedSpan) , onSpanStart is called
  • the threadPool captures context (current span as parent) and schedules the work
  • the parent span is ended, onSpanEnd is called

After onSpanEnd, the parent span is immutable, no matter if we still have reference to in from Java code or not - any modification to it have no effect but produce warning.

My question is: since we not be exporting the parent span (due to absence of the expected attributes), what would happen in this case?

Copy link
Contributor Author

@nishchay21 nishchay21 Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here is how this would work:

  • If the span is an outlier span we will add the sampled attribute to the span.
  • Once the span is marked to be sampled we will mark its parent to be sampled as well. This could now have two possibilities
    • One, the parent is still recording. If that is the case we will go ahead and add the attribute to the parent as well and the parent would be sampled as well.
    • Second, the parent is not recording/has ended in that case we mark its grand-parent if that is alive to be sampled and record this closed span information in the alive grand-parent itself. This way we will not get the span which was closed however the information about the span would still persist in its parent which will be sampled

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I have suspected - we are messing up the trace in unpredictable ways (another red flag), there should be only one possibility: the trace is recorded as it was sampled or not recorded.

Copy link
Contributor Author

@nishchay21 nishchay21 Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are correct we will miss the span. However to maintain the lineage we will still maintain the information of closed span in its parent itself.

Copy link
Collaborator

@reta reta Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here is the thing folks, I my opinion, which I shared on RFC as well:

  • if we cannot preserve span hierarchy, this is no go (the mess would be on user to deal with)
  • if we cannot follow the rules of the library we are using, this is no go

OTEL cannot solve tail sampling on library level, this is documented and acknowledged:

There are other tracing instrumentations that try to implement sampling differently, including adaptive sampling that could tune the sampling rates based on the different factors:

Please consider those as a robust and reliable alternatives, that do not exhibit flaws of this solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see where you're coming from regarding the potential gaps in the hierarchy, which users may need to address based on their choice of collector and visualization solutions. It's been recognized as a requirement to address such gaps in the Otel specification, and the community appears to be quite receptive to it. If you take a look at the Otel issues provided, you'll notice that even with these gaps, there are still ways to infer missing parent information through parent span events, aiding in understanding the code path and other relevant details.

open-telemetry/opentelemetry-specification#3205
open-telemetry/opentelemetry-specification#3867

In general, there are numerous benefits to this approach, particularly for customers dealing with larger clusters processing tens of thousands of requests per second and managing multiple clusters. It offers insights into problematic areas of code, and the availability of missing span information as part of the event in the parent span can be very helpful in connecting the dots.

Moreover, this won't be the default behavior and can be enabled as needed (though it's crucial to clearly document the shortcomings and benefits of such samplers). There have been requests from Otel users for deferred sampling support and reverse propagation of context, and it's hoped that these features will be available soon, at which point a switch can be made.

Copy link
Collaborator

@reta reta Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gaganjuneja I shared my opinion on the subject many times, please let's get back to this when these features are available, if other maintainers see this is a way to go , I am fine with that, but strong -1 from me moving this further as it stands today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing, I think we can solve this issue by adding a link between the span and it's grand parent to persist the relationship. Since this is a follows_from span.

@peternied
Copy link
Member

@nishchay21 @Gaganjuneja @reta From the discussion that I've read on this issue I believe this pull request should be closed unmerged.

I'd recommend regrouping with a fresh RFC(s) or considerable investment in a POC with reliability tests (that directly address the questions raised here).

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label May 16, 2024
@peternied
Copy link
Member

I'm closing this PR out because it has stalled and there doesn't seem to be a good course of action to move it forward. I'd recommend regrouping with a fresh RFC(s) or considerable investment in a POC with reliability tests that directly address the questions raised here.

@peternied peternied closed this May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stalled Issues that have stalled
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants