Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[async hooks] proposal for standard CLS API - request for feedback #345

Closed
puzpuzpuz opened this issue Dec 18, 2019 · 30 comments
Closed

[async hooks] proposal for standard CLS API - request for feedback #345

puzpuzpuz opened this issue Dec 18, 2019 · 30 comments
Labels

Comments

@puzpuzpuz
Copy link
Member

puzpuzpuz commented Dec 18, 2019

Hi guys,

I believe that upcoming executionAsyncResource() function (nodejs/node#30959) will allow building a simple and robust CLS API as a part of async_hooks. I've created an experimental PR that shows how this API could look like: nodejs/node#31016

Early reviews and feedback from you would be very valuable.

@Qard
Copy link
Member

Qard commented Dec 19, 2019

There's also nodejs/node#26540, which is currently under active development too. It's worth comparing the two to consider the flexibility and performance of each. It's a fairly performance-sensitive area, and is prone to memory issues. We should consider carefully what our needs are from this API.

@puzpuzpuz
Copy link
Member Author

puzpuzpuz commented Dec 19, 2019

@Qard
I'm aware of nodejs/node#26540 (and of nodejs/node#27172). I have described design ideas in that PR, but as nobody was interested I decided to implement them myself.

The description section in my PR has an complete overview, but let me compare my implementation with nodejs/node#26540 here.

Pros of proposed API:

  • Safer memory-wise:
    • Does not depend on destroy hook, thus doesn't lead to mem leak in misbehaving AsyncResource scenarios.
    • Allows complete disposal of the underlying hook callback and all values by calling .remove() method. Thus, you don't have to restart your node process to remove CLS or replace it with another one.
  • Should show at least same performance.
    • Deals with less data structures internally and does less operations in overall.
    • If you're interested in benchmark comparison, I'm willing to write a benchmark and compare both implementations. Just point me at a good example of microbenchmark (or benchmark) that is considered good enough by node core team.
    • Update: see [async hooks] proposal for standard CLS API - request for feedback #345 (comment) for benchmark results.
  • Has simpler API, yet provides same features.
    • Does not have .enter() method. API is only 2 mandatory (.get()/.set(value)) and one optional (.remove()) methods.
  • Has copy-on-write semantics.
    • I believe that this is a very important property. As a real world example, without this property implementing a request id tracing library based on CLS API (like this one: https://github.com/puzpuzpuz/cls-rtracer) is very tricky.

Cons of proposed API:

  • The only disadvantage I know is that .set(value) method has to be called outside of application bootstrap (that's inherited from executionAsyncResource()). But it's not critical at all for a CLS API.

Of course, I'm prepared for my PR to be totally ignored, as it was with nodejs/node#27172.

@Flarna
Copy link
Member

Flarna commented Dec 19, 2019

As already mentioned in nodejs/node#26540 the async enter() is a blocker for using it in in my case. Also the missing propagation for async functions is a problem.
As nodejs/node#26540 tends to use propagation using the trigger path it would require a triggeringAsyncResource() api to avoid the complex implementation using weakrefs,...

What I miss in general in both CLS APIs is the ability to stop propagation or avoid it for some case. e.g. it's formal correct that a context is propagated forever in case of e.g. a setInterval but practically this is a leak and not what most users want.

Besides that I miss functionality to allow to trigger other side effects on context change. See for example how domains use async hooks - in principle it's a CLS but they have the need to do something on context change. So both proposed CLS APIs are able to replace the use of async hooks in domains.

@puzpuzpuz
Copy link
Member Author

puzpuzpuz commented Dec 19, 2019

@Flarna

What I miss in general in both CLS APIs is the ability to stop propagation or avoid it for some case. e.g. it's formal correct that a context is propagated forever in case of e.g. a setInterval but practically this is a leak and not what most users want.

In case of setInterval, the .remove() method will help as it stops propagation and removes all stored values in the AsyncLocal. It's not the same as stopping propagation of a particular async call chain branch, but nothing prevents for such API to be added. Say, a separate WeakSet could be introduce. It would store all resources for which the propagation must stop.

Besides that I miss functionality to allow to trigger other side effects on context change. See for example how domains use async hooks - in principle it's a CLS but they have the need to do something on context change. So both proposed CLS APIs are able to replace the use of async hooks in domains.

That sounds like an advanced API to me, which is not required in my use cases. But if there will be multiple votes to include it into the API, I could implement it in a similar manner as in your implementation.

@Flarna
Copy link
Member

Flarna commented Dec 19, 2019

In case of setInterval, the .remove() method will help as it stops propagation and removes all stored values in the AsyncLocal.

This requires an extra, manual call at all these locations which is exactly the opposite of a "just works" CLS as you have to clutter your application code with hints towards CLS instead of one time configuration.

@puzpuzpuz
Copy link
Member Author

puzpuzpuz commented Dec 19, 2019

This requires an extra, manual call at all these locations which is exactly the opposite of a "just works" CLS as you have to clutter your application code with hints towards CLS instead of one time configuration.

I can hardly believe in existence of "just works" stop-context-propagation configuration API. Exact logic heavily depends on the application needs.

But I believe that having a way to completely stop the propagation and free the memory is a must for a CLS API.

@puzpuzpuz
Copy link
Member Author

puzpuzpuz commented Dec 19, 2019

I have modified async-resource-vs-destroy.js benchmark and compared proposed implementation with #26540. Here is the result which I got on my machine:

$ ./node benchmark/async_hooks/async-resource-vs-destroy.js benchmarker=autocannon
async_hooks/async-resource-vs-destroy.js n=1000000 method="callbacks" type="async-local" benchmarker="autocannon": 20,277.6
async_hooks/async-resource-vs-destroy.js n=1000000 method="async" type="async-local" benchmarker="autocannon": 15,877.6
async_hooks/async-resource-vs-destroy.js n=1000000 method="callbacks" type="async-context" benchmarker="autocannon": 16,922.41
async_hooks/async-resource-vs-destroy.js n=1000000 method="async" type="async-context" benchmarker="autocannon": 11,841.2
async_hooks/async-resource-vs-destroy.js n=1000000 method="callbacks" type="async-resource" benchmarker="autocannon": 23,582.4
async_hooks/async-resource-vs-destroy.js n=1000000 method="async" type="async-resource" benchmarker="autocannon": 18,407.2

As expected, AsyncLocal is significantly faster than AsyncContext, but slower than the implementation that stores values as resource object properties.

@vdeturckheim
Copy link
Member

vdeturckheim commented Dec 20, 2019

@puzpuzpuz is the call to destroy the main point of slowlyness in AsyncContext? As mentionned in the PR, it aims at being updated when @Qard 's PR is merged to not use it anymore.

@puzpuzpuz
Copy link
Member Author

puzpuzpuz commented Dec 20, 2019

@puzpuzpuz is the call to destroyz the main point of slowlyness in AsyncContext`? As mentionned in the PR, it aims at being updated when @Qard 's PR is merged to not use it anymore.

Probably, but it's hard to tell for sure. Their public APIs are quite different and AsyncContext seems to be using more data structures and doing more stuff under the hood.

@vdeturckheim
Copy link
Member

vdeturckheim commented Dec 20, 2019

@puzpuzpuz actually this whole part would go away eventuallyhttps://github.com/nodejs/node/pull/26540/files#diff-0bb01a51b135a5f68d93540808bac801R201-R214
https://github.com/nodejs/node/pull/26540/files#diff-0bb01a51b135a5f68d93540808bac801R227-R229

In the end, the only data structure that would remain would be the map used for holding context. I would like to keep it as it is closer to the original CLS implementation that most APM vendors have copied over the years.

I am pretty opinionated on the instanciation of only one Async Hook for the whole process as I find it faster than having multiple ones.

@puzpuzpuz
Copy link
Member Author

puzpuzpuz commented Dec 20, 2019

@vdeturckheim

@puzpuzpuz actually this whole part would go away eventuallyhttps://github.com/nodejs/node/pull/26540/files#diff-0bb01a51b135a5f68d93540808bac801R201-R214
https://github.com/nodejs/node/pull/26540/files#diff-0bb01a51b135a5f68d93540808bac801R227-R229

In the end, the only data structure that would remain would be the map used for holding context.

Yes, it will remove one Map internally and only one WeakMap + one Map + one Set will remain.

However, this change won't add copy-on-write semantics for AsyncContext. Do you happen to know how to implement a request id tracing (which is one of the most popular use cases for CLS APIs) once AsyncContext gets that change?

With AsyncLocal it's as simple as the following example: https://github.com/nodejs/node/pull/31016/files#diff-9a4649f1c3f167b0da2c95fc38953a1fR601

I would like to keep it as it is closer to the original CLS implementation that most APM vendors have copied over the years.

Nothing prevents APM vendors or 3rd-party library authors to build a thin-wrapper that provide a Map-like on top of an AsyncLocal. Such implementations will be simple and straightforward.

On the other hand, users who want to store a single value in CLS API won't have to pay for an additional Map.

I am pretty opinionated on the instanciation of only one Async Hook for the whole process as I find it fater than having multiple ones.

AsyncContext's init hook has a for loop, so it's still O(N), just as with N different AsyncLocals. AsyncLocal's init hook can be made global without any changes in the public API. But most of real world applications have a single AsyncLocal per the whole node process created by an APM agent or a request tracing library. So, keeping AsyncLocals isolated from each other seem to be more valuable to me.

@vdeturckheim
Copy link
Member

Yes, it will remove one Map internally and only one WeakMap + one Map + one Set will remain.

The WeakMap is already gone ;)

Do you happen to know how to implement a request id tracing (which is one of the most popular use cases for CLS APIs) once AsyncContext gets that change?

I am not sure I get the quesiton, I would simply do a run and add the id in the map under a named key

Nothing prevents APM vendors or 3rd-party library authors to build a thin-wrapper that provide a Map-like on top of an AsyncLocal. Such implementations will be simple and straightforward.

Fair point, but a few comments on the PR actually asked how close it is to current existing implementation. IMO this is a design goal to rely on something that is already used.

On the other hand, users who want to store a single value in CLS API won't have to pay for an additional Map.

I don't think this Map is expensive here.

AsyncContext's init hook has a for loop, so it's still O(N), just as with N different AsyncLocals. AsyncLocal's init hook can be made global without any changes in the public API. But most of real world applications have a single AsyncLocal per the whole node process created by an APM agent or a request tracing library. So, keeping AsyncLocals isolated from each other seem to be more valuable to me.

TBH, this choice is mostly based on nodejs/node#16222 (comment).

but this is probably more discussions to have in the respective PRs.

@puzpuzpuz
Copy link
Member Author

@vdeturckheim

The WeakMap is already gone ;)

Oh, I've missed that change.

So, it puts the Map into resources directly now. In that case, it won't be possible to get rid of AsyncContext completely (a la AsyncLocal.remove), except for restarting the node process.

I am not sure I get the quesiton, I would simply do a run and add the id in the map under a named key

A AsyncContext.run per each HTTP request then? Thus, plus one Map and one process.nextTick per each request. Doesn't sound like something that is good for performance.

Fair point, but a few comments on the PR actually asked how close it is to current existing implementation. IMO this is a design goal to rely on something that is already used.

As far as I understand, main goal for core APIs is to provide flexible (and sometimes low-level) APIs that can be used directly or used to build user-land libraries on top of them. But maybe this time it's an exception to the rule.

I don't think this Map is expensive here.

I believe that for core APIs, every data structure counts.

TBH, this choice is mostly based on nodejs/node#16222 (comment).

but this is probably more discussions to have in the respective PRs.

Not sure if that comment is applicable to CLS API. Anyway, it can be easily changed in both directions without any changes in the external behavior.

@Flarna
Copy link
Member

Flarna commented Dec 20, 2019

I would like to keep it as it is closer to the original CLS implementation that most APM vendors have copied over the years.

@vdeturckheim May I ask which APM vendor implemenations you refer to?

My search shows different implemenations:
New Relic: In before hook they set their context and in the after hook they restore the old one.

DataDog: In before hook they set their context by a sync enter call and in after hook the clear it.

Elastic: This is different, they do a lookup based on executionAsyncId() in their map.

OpenTelemetry: There a with() function is used to execut a function synchron after setting the context taken from a map using executionAsyncId() and restore the previous value

I'm working at dynatrace (unfortunatelly our agent is not open source so no link 😢) and we use mostly the same mechanism as NewRelic/DataDog.

There is cls-hooked which has an API which reminds me to AsyncStorage. I think this lib is usually used by application programmers but not APMs. APMs just share the idea and the mapping to async_hooks.


As there was some link above towards the ticket to use async_hooks in domains. Would it be possible to port Domains to use the proposed CLS system to have a proove of concept?

@puzpuzpuz
Copy link
Member Author

puzpuzpuz commented Dec 21, 2019

@Flarna

As there was some link above towards the ticket to use async_hooks in domains. Would it be possible to port Domains to use the proposed CLS system to have a proove of concept?

As far as I know, porting domains to AsyncLocal will require value change callback feature from CLS API. We already discussed this topic here: #345 (comment)

Nothing prevents from adding such advanced API into the proposed implementation, say, as a port of similar functionality from your PR (nodejs/node#27172). That could be easily done in future or in initial implementation - depends on feedback from async_hooks/diagnostics groups.

@Qard
Copy link
Member

Qard commented Dec 22, 2019

I think it should be a goal to have an API that can support multiple consumers and just work. So many times I've seen companies try to run multiple APM vendors in production and it crashes their app because they conflict with each other. It's very difficult coding defensively enough to not break when another APM vendor does something unexpected to the environment--a common thing I've seen is regular properties redefined with Object.defineProperty(...) and no longer being configurable.

@puzpuzpuz
Copy link
Member Author

@Qard

I think it should be a goal to have an API that can support multiple consumers and just work. So many times I've seen companies try to run multiple APM vendors in production and it crashes their app because they conflict with each other.

Thanks for bringing this important concern up. In the proposed implementation AsyncLocals are isolated from each other: they don't have any shared data structures and use their own hooks. But in general, it's very hard to protect from monkey patching of public built-in APIs.

@Qard
Copy link
Member

Qard commented Dec 23, 2019

I think each having their own hooks is actually a minor point against them because of the additional overhead. The less repetitive code paths we can introduce in shared logic the better, while still retaining functional isolation.

@puzpuzpuz
Copy link
Member Author

@Qard

I think each having their own hooks is actually a minor point against them because of the additional overhead.

Could you tell me what is the reason behind that overhead? Extra context switches between native code and JS or it's something different?

It's a trivial change to do, but before doing that I'd like to understand how critical it is.

@Qard
Copy link
Member

Qard commented Dec 23, 2019

Well, multiple hooks should be merged together on JS-side already, in async_hooks internals, but that's an implementation detail. This does however mean that there will be separate hook objects with separate associated event handlers, each with their own closing scope. Given that this is a very hot path, it would be best to minimize that as much as possible.

APM vendors typically want a unique object per-request, not shared globally, which means it should be able to register the hooks once but generate context objects out of those hooks many times. I've also sometimes seen the layered object approach where each async branch gets a new object with its prototype set to that of the parent tick--while that is a powerful design, it's also much more expensive and can lead to undesired scenarios like orphaned branches when activity is not directly in the async call path which leads to the request end. In my experience, the best approach overall in terms of performance and reliability is to be able to simply have an always-running background hook and the ability to at any arbitrary time start a context object which is passed down the async tree as-is.

@puzpuzpuz
Copy link
Member Author

puzpuzpuz commented Dec 23, 2019

@Qard
Thanks for the clarification. I'll introduce a global hook for AsyncLocals later today.

Update. Done that.

@vdeturckheim
Copy link
Member

Thanks @Qard for stepping into this!

@Flarna , I was not talking about implementation but about external API. the original CLS implementation exposes set(key, value) and get(key) that can be called on the result of getNamespace. That is what I had in mind.

Regarding the big picture, I have discussed with multiple TSC members and collaborators during the collaborator summit in Montréal. Everyone seemed to agree with the meed of a high level API to provide such features in core. It also seems that it has been agreed upon during a diagnostic summit (nodejs/node#26540 (comment) and nodejs/node#26540 (comment)).

At this point, I am afraid that we (and I am very responsible for it) end up bikeshedding too much in this current discussion. PR reviews should happen in PRs for better tracability.
Also, at the end of the day, I don't believe it really matters which PR will be merged: this will be an experimental feature, it can be updated or boken-changed a couple time over the next iterations based on feedback from real life.

@Flarna
Copy link
Member

Flarna commented Jan 4, 2020

@vdeturckheim Yes, this was discussed on diag summit last march - I participated there 😁. This was the main reason why I tried and reviewed your AsyncStorage proposal and additionally created a different approach in the hope to get a bigger community to further review and discuss.

Main issue with async hooks is that they are still experimental. On diag summit we assumed that this avoids that user space modules start to use AsyncResource which would be needed that CLS works as one would expect in case of e.g. pooling db connections, queueing,... Usually no one likes to depend on experimental features which may change at any time...

Getting AsyncHooks API out of experimental is hard for several reasons (they expose internals, easy to use wrong,....). The idea was to add a CLS API in core which doesn't expose internals and therefore is easier to get out of experimental and fits most use cases covered currently by async hooks. Async hooks could be used internal and stay experimental to cover the remaining use cases.

So yes, adding something is hopefully enough to reach the target that AsyncResource spreads - as soon as this is out of experimental. It was never my intention to somehow block your PR to add AsyncStorage.

Meanwhile my motivation to further work on CLS in core is not that high anymore. Partly because most of my input was ignored anyway 😢 but besides that I have the feeling there is no unique understanding on what a CLS should actually do.
The current focus seems to be on application programmers viewpoint but my domain is APM which is different so I can't help that much. As you can see in the discussion in the various PRs regarding this it looks like the requirements are even conflicting.

Besides adding CLS to core there were some more points regarding async hooks discussed on diag summit like use of resource instead id. Idea was to get rid of destroy hook - at least for the "main" use case CLS. The destroy hook is quite bad regarding performance in special if await is used a lot. Besides that it helps to avoid leaks. In case the PR from @Qard gets merged I have some follow-up tasks in mind to do there.

@naugtur
Copy link
Contributor

naugtur commented Feb 28, 2020

Now with all the mentioned PRs closed, where does this continue?

@mhdawson
Copy link
Member

mhdawson commented Mar 2, 2020

I think once we are comfortable we are not going to change the API in the short term we should open a new issue to collect input on the API and then do a blitz to promote/ask for feedback on the API pointing to that issue. Make sense?

@wentout
Copy link

wentout commented Mar 3, 2020

I think once we are comfortable we are not going to change the API in the short term we should open a new issue to collect input on the API and then do a blitz to promote/ask for feedback on the API pointing to that issue. Make sense?

Might be then, when CLS will be merged, there must be some relatied description, easily viewable from the Async Hooks part of Documentation... Then, when developer will find an issue, they would be informed where to ask and how to ask properly? (@vdeturckheim) (@puzpuzpuz) what do you think?

@puzpuzpuz
Copy link
Member Author

@wentout

Might be then, when CLS will be merged, there must be some relatied description, easily viewable from the Async Hooks part of Documentation...

Documentation already includes a section for AsyncLocalStorage: https://github.com/nodejs/node/blob/master/doc/api/async_hooks.md#class-asynclocalstorage

I think, this should be enough to point users in the right direction. Do you have any ideas on possible enhancements on your mind?

@wentout
Copy link

wentout commented Mar 3, 2020

@puzpuzpuz

Do you have any ideas on possible enhancements on your mind?

Sorry for misexplanation. I suppose to add note about place where exactly users can post an issue. For example, there could be quote for users:

If you will find an issue in this api, please open an issue for diagnostics group. You can find more related information here: #345

@mhdawson
Copy link
Member

mhdawson commented Mar 4, 2020

@wentout I think it might be that we would use:

"If you will find an issue in this api, please comment on issue #XYZ in the diagnostic repo. You can find more related information here: #345"

This is basically your suggestion, just that instead of new issues we ask for people to comment on the issue that we've opened to capture the input in advance.

@github-actions
Copy link

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants