[MLOB-1804] feat(langchain): add langchain instrumentation #4860

sabrenner · 2024-11-05T15:29:01Z

What does this PR do?

Adds initial instrumentation support for LangChain >=0.1, specifically for:

chain.invoke and chain.batch (from @langchain/core/runnables/base)
chat_model.generate (and inherently chat_model.invoke, from @langchain/core/language_models/chat_models)
llm.generate (and inherently llm.invoke, from @langchain/core/language_models/llms)
openaiEmbeddings.embedQuery (from @langchain/openai/embeddings)
openaiEmbeddings.embedDocuments (from @langchain/openai/embeddings)

We're restricting to >=0.1 as versions before are slated for deprecation in an upcoming release, although it can be added by request. Additionally, >=0.1 has stable support for LCEL invocations, which this PR adds support for.

This instrumentation will happen on any langchain, @langchain/core, or @langchain/openai imports.

Additionally, we're adding DD_LANGCHAIN_SPAN_CHAR_LIMIT and DD_LANGCHAIN_SPAN_PROMPT_COMPLETION_SAMPLE_RATE to control the length of truncated text (I/O) and rate of sampling for prompts and completions, respectively.

Motivation

Wanting to add an initial APM integration. MLOB will later build off our own LLMObs integration as well, and continue building this integration & support new features.

Plugin Checklist

Additional Notes

Will be putting up docs PRs separately.

Additionally, there is also an ESM issue with LangChain, specifically with LangSmith, which is not a necessary module for using LangChain but is bundled with LangChain nonetheless. I will be opening a separate issue for this.

I will additionally also be investigating if there is a workaround possible for this scenario. This PR will not include fixes for the ESM usage of langchain, but does include patching (by patching both the js and cjs extensions)

…r/langchain

github-actions · 2024-11-05T15:31:00Z

Overall package size

Self size: 8.09 MB
Deduped: 94.59 MB
No deduping: 94.93 MB

Dependency sizes

| name | version | self size | total size | |------|---------|-----------|------------| | @datadog/libdatadog | 0.2.2 | 29.27 MB | 29.27 MB | | @datadog/native-appsec | 8.3.0 | 19.37 MB | 19.38 MB | | @datadog/native-iast-taint-tracking | 3.2.0 | 13.9 MB | 13.91 MB | | @datadog/pprof | 5.4.1 | 9.76 MB | 10.13 MB | | protobufjs | 7.2.5 | 2.77 MB | 5.16 MB | | @datadog/native-iast-rewriter | 2.5.0 | 2.51 MB | 2.65 MB | | @opentelemetry/core | 1.14.0 | 872.87 kB | 1.47 MB | | @datadog/native-metrics | 3.0.1 | 1.06 MB | 1.46 MB | | @opentelemetry/api | 1.8.0 | 1.21 MB | 1.21 MB | | import-in-the-middle | 1.11.2 | 112.74 kB | 826.22 kB | | msgpack-lite | 0.1.26 | 201.16 kB | 281.59 kB | | opentracing | 0.14.7 | 194.81 kB | 194.81 kB | | lru-cache | 7.18.3 | 133.92 kB | 133.92 kB | | pprof-format | 2.1.0 | 111.69 kB | 111.69 kB | | @datadog/sketches-js | 2.1.0 | 109.9 kB | 109.9 kB | | semver | 7.6.3 | 95.82 kB | 95.82 kB | | lodash.sortby | 4.7.0 | 75.76 kB | 75.76 kB | | ignore | 5.3.1 | 51.46 kB | 51.46 kB | | int64-buffer | 0.1.10 | 49.18 kB | 49.18 kB | | shell-quote | 1.8.1 | 44.96 kB | 44.96 kB | | istanbul-lib-coverage | 3.2.0 | 29.34 kB | 29.34 kB | | rfdc | 1.3.1 | 25.21 kB | 25.21 kB | | @isaacs/ttlcache | 1.4.1 | 25.2 kB | 25.2 kB | | tlhunter-sorted-set | 0.1.0 | 24.94 kB | 24.94 kB | | limiter | 1.1.5 | 23.17 kB | 23.17 kB | | dc-polyfill | 0.1.4 | 23.1 kB | 23.1 kB | | retry | 0.13.1 | 18.85 kB | 18.85 kB | | jest-docblock | 29.7.0 | 8.99 kB | 12.76 kB | | crypto-randomuuid | 1.0.0 | 11.18 kB | 11.18 kB | | koalas | 1.0.2 | 6.47 kB | 6.47 kB | | path-to-regexp | 0.1.10 | 6.38 kB | 6.38 kB | | module-details-from-path | 1.0.3 | 4.47 kB | 4.47 kB |

_{🤖 This report was automatically generated by heaviest-objects-in-the-universe}

codecov · 2024-11-05T15:36:08Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.16%. Comparing base (564795f) to head (3c7dbd4).
Report is 23 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #4860       +/-   ##
===========================================
+ Coverage   79.17%   91.16%   +11.99%     
===========================================
  Files         273      129      -144     
  Lines       12427     4461     -7966     
===========================================
- Hits         9839     4067     -5772     
+ Misses       2588      394     -2194

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pr-commenter · 2024-11-05T15:37:56Z

Benchmarks

Benchmark execution time: 2024-11-20 13:17:15

Comparing candidate commit 34c86a4 in PR branch sabrenner/langchain with baseline commit 1ee8000 in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 261 metrics, 5 unstable metrics.

…space

packages/datadog-instrumentations/src/langchain.js

packages/datadog-plugin-langchain/src/handlers/default.js

…r/langchain

packages/datadog-instrumentations/src/langchain.js

Yun-Kim

Great work so far! Left a couple questions but mostly looks good 👍

packages/datadog-instrumentations/src/langchain.js

Yun-Kim · 2024-11-13T00:09:40Z

packages/datadog-instrumentations/src/langchain.js

+
+    // OpenAI (and Embeddings in general) do not define an lc_namespace
+    const namespace = ['langchain', 'embeddings', 'openai']
+    shimmer.wrap(OpenAIEmbeddings.prototype, 'embedDocuments', embedDocuments =>


In langchain's python implementation they made embedDocuments/embedQuery abstract methods which made it gross when patching the embedDocuments/Query methods (i.e. patching the base class method does not wrap the individual embedding classes). Is that not the case for langchain's node JS library?

It's actually the same case for Node.js library, as these abstract methods in TypeScript are not compiled into properties we can patch in plain JS. So I'm only doing the OpenAI embedding patching here, as I tried to patch the base Embeddings class and couldn't 😞

ref, which is an 'outdated' comment but really isn't, i think it still holds for this point 😅

Understood, we should actually chat more about supporting edge cases like this (let's bring this up in our core-obs sync next week)

typically in patterns like this there is a common place where the objects are instantiated or interacted with early on that we can use as a hook point for patching/wrapping. But yeah it's annoying that it's not generalizable.

packages/datadog-plugin-langchain/src/handlers/chain.js

Yun-Kim · 2024-11-13T00:16:58Z

packages/datadog-plugin-langchain/src/handlers/chat_model.js

+
+    const tags = {}
+
+    // TODO: do we need token tagging


We should have token metrics get tagged, I know Langchain should support token metrics for openai at least

to clarify this (as my TODO isn't super helpful beyond face value 😅), referring to this in our LangChain integration, I don't believe we have an equivalent get_openai_token_cost_for_model available to us through LangChain. Should we just do the langchain.tokens.{prompt, completion, total}_tokens tag on the span here instead and forgo the cost metric?

Does langchain expose token metrics on the returned OpenAI response object? If so, should be as simple as extracting those token metrics and setting them on the span tags. Let me know otherwise 👍

yep we can grab the token metrics themselves. i'll add them as tags!

added this in for openai response objects: 0ecb096. when I do the llmobs integration, i'll try to make it provider-agnostic as we did for the Python LLMObs integration.

packages/datadog-plugin-langchain/src/handlers/default.js

Yun-Kim · 2024-11-13T00:23:52Z

packages/datadog-plugin-langchain/src/handlers/embedding.js

+        tags[`langchain.response.outputs.${idx}.embedding_length`] = output.length
+      }
+    } else {
+      tags['langchain.response.outputs.embedding_length'] = result.length
+    }


This is fine as is, but I wonder if we should just tag one embedding length tag instead of one per entry since every model will output the same dimension embeddings for every input (since dimensions are a model-specific param AFAIK).

ah yes good point. I think I lifted this right from the Python integration, so as long we're ok deviating here, I think it probably does make more sense to just have one tag

Haha yup I noticed we do the same in the Python integration, but wanted to make sure we don't repeat the same mistakes when possible (it'll be harder to remove this from the Python integration but I'd rather we not have this unnecessary tagging to begin with)

yep makes sense. i added it in this commit

packages/datadog-plugin-langchain/src/index.js

Kyle-Verhoog

did a first pass. Really like the handler design for splitting out the span logic between components.

Just a few smaller things but don't see anything majorly wrong.

Nice work 👏 👏

packages/datadog-instrumentations/src/langchain.js

Kyle-Verhoog · 2024-11-18T15:27:07Z

packages/datadog-instrumentations/src/langchain.js

+
+    // OpenAI (and Embeddings in general) do not define an lc_namespace
+    const namespace = ['langchain', 'embeddings', 'openai']
+    shimmer.wrap(OpenAIEmbeddings.prototype, 'embedDocuments', embedDocuments =>


typically in patterns like this there is a common place where the objects are instantiated or interacted with early on that we can use as a hook point for patching/wrapping. But yeah it's annoying that it's not generalizable.

Kyle-Verhoog · 2024-11-18T15:37:38Z

packages/datadog-plugin-langchain/src/handlers/chain.js

+        for (const [key, value] of Object.entries(input)) {
+        // these are mappings to the python client names, ie lc_kwargs
+        // only present on BaseMessage types
+          if (key.includes('lc')) continue


Maybe I'm missing something but I'm not following why we're skipping lc-containing keys here. Are they internal to LangChain?

they are internal fields, and often don't hold relevant information (there's a field called lc_serializable, which is just a boolean not relevant to the chain), or have duplicate information (ie lc_namespace, which we use for the resource name, or lc_kwargs, which are duplicates of some of the input values, which we already tag).

while some of them actually won't pass the truncate/normalize check, since they are not strings, i added this to be safe/not add useless tags. can add them on down the road if there's a use case for them!

ah ok I see! should the string check be startsWith('lc_') then?

i think that's fair!

packages/datadog-plugin-langchain/src/handlers/default.js

packages/datadog-plugin-langchain/src/handlers/embedding.js

Kyle-Verhoog · 2024-11-18T15:59:23Z

packages/datadog-plugin-langchain/src/handlers/language_models/chat_model.js

+
+      for (const messageIndex in messageSet) {
+        const message = messageSet[messageIndex]
+        if (this.isPromptCompletionSampled()) {


is the sampling call cached or really cheap? Might be worth to memoize here

good question - the sampler just uses Math.random to compute sampled. the call itself should be fairly cheap

Kyle-Verhoog

nothing blocking, good stuff!

tlhunter · 2024-11-19T22:13:24Z

packages/datadog-plugin-langchain/src/handlers/default.js

+    const max = this.config.spanCharLimit
+
+    text = text
+      .replace(RE_NEWLINE, '\\n')


Do you ever have to deal with Windows newlines? e.g. \r\n?

Probably not if this is a server response. But if it's data provided from the app or user then it could be a concern.

it could be data from the user, as we'll use this function to truncate input text (and then also output text from the server response)

bm1549 · 2024-11-20T21:59:40Z

packages/datadog-plugin-langchain/src/handlers/default.js

+    if (!text) return
+    if (typeof text !== 'string' || !text || (typeof text === 'string' && text.length === 0)) return
+
+    const max = this.config.spanCharLimit


[q] How is this value initialized? It seems like we have a few different *SpanCharLimit variables, but I'm not clear on how this config got set

yeah good question. we initialize these handlers here, which pass in the tracer config's langchain property, which we set defaults for and read from env to populate.

I'll be refactoring this integration along with OpenAI to have shared logic for this truncation and prompt/completion sampling, so when I do I'll probably rename this variable to be langchainConfig or something 😅

* wip * wip * first pass at chain invoke and chat,llm generate * add langchain openai embeddings * add batch call * change api key logic * testing * ts def changes * codeowners changes * add clarifying issue as reason for skipping esm tests * fix langchain patching for possible esm files vs commonjs files, namespace * configurable truncation and prompt completion sampling * remove unneeded util file * remove some unneeded code * fix patching esm vs cjs issues * json stringify non-string chain outputs * apikey, model, provider should no-op by default * add some token handling logic * review comments * check lc_ for ignored properties

ianwoodfill · 2024-11-26T00:02:43Z

This is great, very excited for this. Do you have any insight into ESM support timelines or workarounds? Thank you!

sabrenner added 9 commits October 22, 2024 23:03

wip

357ae58

wip

1e756f9

Merge branch 'master' of github.com:DataDog/dd-trace-js into sabrenne…

cfcda2c

…r/langchain

first pass at chain invoke and chat,llm generate

d79c59e

add langchain openai embeddings

e83c409

add batch call

15cd03c

change api key logic

27c2f70

testing

b926531

ts def changes

a1ea952

sabrenner added the semver-minor label Nov 5, 2024

codeowners changes

3c7dbd4

sabrenner added 4 commits November 6, 2024 09:55

add clarifying issue as reason for skipping esm tests

b65dfdc

fix langchain patching for possible esm files vs commonjs files, name…

c3c21e4

…space

configurable truncation and prompt completion sampling

b109ffd

remove unneeded util file

0861f19

sabrenner commented Nov 7, 2024

View reviewed changes

packages/datadog-instrumentations/src/langchain.js Show resolved Hide resolved

sabrenner commented Nov 7, 2024

View reviewed changes

packages/datadog-plugin-langchain/src/handlers/default.js Show resolved Hide resolved

sabrenner added 2 commits November 7, 2024 10:18

Merge branch 'master' of github.com:DataDog/dd-trace-js into sabrenne…

26d6605

…r/langchain

remove some unneeded code

94098f9

sabrenner commented Nov 7, 2024

View reviewed changes

packages/datadog-instrumentations/src/langchain.js Outdated Show resolved Hide resolved

fix patching esm vs cjs issues

04d9663

Yun-Kim reviewed Nov 13, 2024

View reviewed changes

sabrenner added 3 commits November 13, 2024 11:44

json stringify non-string chain outputs

4f2d9f5

apikey, model, provider should no-op by default

69d88fb

add some token handling logic

0ecb096

sabrenner marked this pull request as ready for review November 15, 2024 21:47

sabrenner requested a review from a team as a code owner November 15, 2024 21:47

Kyle-Verhoog reviewed Nov 18, 2024

View reviewed changes

review comments

9c2939d

Kyle-Verhoog previously approved these changes Nov 19, 2024

View reviewed changes

tlhunter reviewed Nov 19, 2024

View reviewed changes

tlhunter previously approved these changes Nov 19, 2024

View reviewed changes

check lc_ for ignored properties

34c86a4

sabrenner dismissed stale reviews from tlhunter and Kyle-Verhoog via 34c86a4 November 20, 2024 13:08

Kyle-Verhoog approved these changes Nov 20, 2024

View reviewed changes

sabrenner merged commit c8ab3e4 into master Nov 20, 2024
236 checks passed

sabrenner deleted the sabrenner/langchain branch November 20, 2024 16:12

bm1549 reviewed Nov 20, 2024

View reviewed changes

rochdev mentioned this pull request Nov 21, 2024

v4.51.0 proposal #4924

Merged

rochdev mentioned this pull request Nov 21, 2024

v5.27.0 proposal #4925

Merged

sabrenner mentioned this pull request Nov 25, 2024

[MLOB-1916] Node.js LangChain APM Auto-Instrumentation Compatibility Note DataDog/documentation#26477

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLOB-1804] feat(langchain): add langchain instrumentation #4860

[MLOB-1804] feat(langchain): add langchain instrumentation #4860

sabrenner commented Nov 5, 2024 •

edited

Loading

github-actions bot commented Nov 5, 2024 •

edited

Loading

codecov bot commented Nov 5, 2024

pr-commenter bot commented Nov 5, 2024 •

edited

Loading

Yun-Kim left a comment

Yun-Kim Nov 13, 2024

sabrenner Nov 13, 2024

sabrenner Nov 13, 2024

Yun-Kim Nov 13, 2024

Kyle-Verhoog Nov 18, 2024

Yun-Kim Nov 13, 2024

sabrenner Nov 13, 2024

Yun-Kim Nov 13, 2024

sabrenner Nov 13, 2024

sabrenner Nov 13, 2024

Yun-Kim Nov 13, 2024

sabrenner Nov 13, 2024

Yun-Kim Nov 13, 2024

sabrenner Nov 13, 2024

Kyle-Verhoog left a comment

Kyle-Verhoog Nov 18, 2024

Kyle-Verhoog Nov 18, 2024

sabrenner Nov 18, 2024

Kyle-Verhoog Nov 19, 2024

sabrenner Nov 20, 2024

Kyle-Verhoog Nov 18, 2024

sabrenner Nov 18, 2024

Kyle-Verhoog left a comment

tlhunter Nov 19, 2024

tlhunter Nov 19, 2024

sabrenner Nov 19, 2024

bm1549 Nov 20, 2024

sabrenner Nov 20, 2024

ianwoodfill commented Nov 26, 2024

[MLOB-1804] feat(langchain): add langchain instrumentation #4860

[MLOB-1804] feat(langchain): add langchain instrumentation #4860

Conversation

sabrenner commented Nov 5, 2024 • edited Loading

What does this PR do?

Motivation

Plugin Checklist

Additional Notes

github-actions bot commented Nov 5, 2024 • edited Loading

Overall package size

codecov bot commented Nov 5, 2024

Codecov Report

pr-commenter bot commented Nov 5, 2024 • edited Loading

Benchmarks

Yun-Kim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kyle-Verhoog left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kyle-Verhoog left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ianwoodfill commented Nov 26, 2024

sabrenner commented Nov 5, 2024 •

edited

Loading

github-actions bot commented Nov 5, 2024 •

edited

Loading

pr-commenter bot commented Nov 5, 2024 •

edited

Loading