Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(llmobs): llmobs context provider #10138

Closed
wants to merge 5 commits into from

Conversation

Yun-Kim
Copy link
Contributor

@Yun-Kim Yun-Kim commented Aug 8, 2024

tl;dr: This PR adds an additional context provider to the LLMObs service to keep track of the active LLMObs-type span.

This PR also adds a convenience LLMObs.current_span() method to get the current LLMObs type span, and uses this in our existing public methods that involve checking for the current span if not provided by the user as an argument (ex: LLMObs.export_span(span=None, ...), LLMObs.annotate(span=None, ...).

Workflow

When an LLMObs span gets started, (i.e. LLMObs._start_span(...), BaseLLMIntegration.trace(submit_to_llmobs=True, ...), BedrockIntegration.set_llmobs_tags(...)), we look at the current active LLMObs span, use that to set the parent_id field of the started span, and then activate the started span as the new current active LLMObs span.

The LLMObsContextProvider updates the current active LLMObs span by iterating through the span's ancestor tree and stopping at the closest LLMObs-type span which is not finished.

For distributed tracing scenarios, we inject the request headers with an LLMObs parent ID. We then create a Context object using the extracted LLMObs parent ID header and activate that in the new process.

Context

Spans submitted to LLM Observability are technically APM spans (but with enhanced metadata/metrics and in a different span structure) marked with an llm span_type. This means APM span/parent/trace IDs are all reused in LLMObs spans.

Problem: if APM-specific spans (via integration/manual) are intermixed with LLMObs spans in a trace, this causes problems with traces submitted to LLM Observability, where spans may have parent_id values referencing spans that do not exist in LLM Observability storage, because those APM-specific spans are not ingested to LLM Observability. (example trace structure below)

Span 1 (LLMObs) --> Span 2 (APM) --> Span 3 (LLMObs)
# Span 3's parent_id is equal to span 2's span_id, but span 2 does not exist in LLM Obs ==> broken LLM Obs trace

Initial Solution: Use Span Context

In an effort to reuse as much tracer-internals as possible, internal LLMObs trace processing involved traversing a span's parent tree to find the nearest LLMObs type ancestor span, and then using that span's span_id as the parent_id. For distributed cases, the LLMObs parent ID was injected into request headers, and then extracted into the underlying span Context object's context._meta dictionary field.
This works fine for the most part, except for multithreading applications as the Context._meta field is a shallow copy and thus shared between threads, making it susceptible to race conditions where the propagated LLMObs parent ID would get overwritten by other threads starting.

Current Approach: Manage our own LLMObs contexts

This approach is similar to the first, but instead of attempting to use the Context._meta field in a use case it was not designed for, we just create our own context provider for tracking active LLMObs span/contexts. Since we're relying on ContextVar which is thread-safe, this does not have the same problems as previously regarding multithreaded applications.

Since we are relying on the new LLMObsContextProvider to manage and update the active LLMObs span (which reuses the same logic of traversing the parent tree for the nearest LLMObs span), we can get rid of all of the special casing required for dealing with Context-propagated tags.

The only changes we need to make to shared library code is adding distributed injection code into HTTPPropagator.inject(), activating the propagated request headers into trace_utils.activate_distributed_headers(), and context propagation/activation in the threading patch code.

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

Copy link
Contributor

github-actions bot commented Aug 8, 2024

CODEOWNERS have been resolved as:

ddtrace/llmobs/_context.py                                              @DataDog/ml-observability
ddtrace/contrib/internal/futures/threading.py                           @DataDog/apm-core-python @DataDog/apm-idm-python
ddtrace/contrib/trace_utils.py                                          @DataDog/apm-core-python @DataDog/apm-idm-python
ddtrace/llmobs/_constants.py                                            @DataDog/ml-observability
ddtrace/llmobs/_integrations/base.py                                    @DataDog/ml-observability
ddtrace/llmobs/_integrations/bedrock.py                                 @DataDog/ml-observability
ddtrace/llmobs/_llmobs.py                                               @DataDog/ml-observability
ddtrace/llmobs/_trace_processor.py                                      @DataDog/ml-observability
ddtrace/llmobs/_utils.py                                                @DataDog/ml-observability
ddtrace/propagation/http.py                                             @DataDog/apm-sdk-api-python

@pr-commenter
Copy link

pr-commenter bot commented Aug 8, 2024

Benchmarks

Benchmark execution time: 2024-08-09 20:55:39

Comparing candidate commit 79303da in PR branch yunkim/llmobs-context-management with baseline commit 6f46cc7 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 214 metrics, 2 unstable metrics.

@Yun-Kim Yun-Kim force-pushed the yunkim/llmobs-context-management branch from 3af4255 to 86dc15e Compare August 8, 2024 18:27
@@ -3,8 +3,7 @@
METADATA = "_ml_obs.meta.metadata"
METRICS = "_ml_obs.metrics"
ML_APP = "_ml_obs.meta.ml_app"
PROPAGATED_PARENT_ID_KEY = "_dd.p.llmobs_parent_id"
PARENT_ID_KEY = "_ml_obs.llmobs_parent_id"
PARENT_ID_KEY = "_dd.p.llmobs_parent_id"
Copy link
Contributor Author

@Yun-Kim Yun-Kim Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure if we need to rename this to a different key name. This format _dd.p.* was required because we were injecting the key-value into the context._meta, but if we are directly injecting/extracting from the request headers this might not be needed.
Note 2: are there any requirements for key-naming for injecting into request headers?

@Yun-Kim
Copy link
Contributor Author

Yun-Kim commented Sep 23, 2024

Closing this due to annoying merge conflicts from upstream changes. Replacing this with #10767.

@Yun-Kim Yun-Kim closed this Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants