Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve debug logging performance #5441

Open
CharlieC3 opened this issue Nov 8, 2024 · 7 comments
Open

Improve debug logging performance #5441

CharlieC3 opened this issue Nov 8, 2024 · 7 comments
Assignees

Comments

@CharlieC3
Copy link
Member

This flame graph taken by @kantai shows debug logging has a large impact on node performance.

33% of the sample is in make_tenure_bitvector, but almost entirely invoking the log commands

I've also observed that RPC endpoints return much quicker when the node does not have debug logging enabled; this can have a significant impact if a stacks node with debug logs enabled is broadcasting to a signer or other downstream observer.

One idea from @hstove that can alleviate this is to switch to async logging via slog_async, however this would need to be researched and weighed against any other suggestions, as it may result in not getting critical logs in the event of a panic under the right conditions.
dev-seed-nov1-1200C

@kantai
Copy link
Member

kantai commented Nov 8, 2024

One idea from @hstove that can alleviate this is to switch to async logging via slog_async, however this would need to be researched and weighed against any other suggestions, as it may result in not getting critical logs in the event of a panic under the right conditions.

Yep -- I think if we want to avoid getting aggressive about reducing the debug log volume, the logger will need to support an async mode.

I think if we're opening the "async logging" can of worms, though, it would probably be worthwhile to move to the tracing framework (https://github.com/tokio-rs/tracing). I know that tokio is controversial for some folks (@jcnelson for one), but tracing is a very well supported logging system, and it seems like one that many rust projects are converging around (in fact, slog's readme now asks users to strongly consider tracing instead... https://github.com/slog-rs/slog?tab=readme-ov-file#you-might-consider-using-tracing-instead)

@CharlieC3
Copy link
Member Author

As an aside, I am very interested in getting tracing support added to the stacks node (and potentially signer node) as well. If tokio's tracing lib were used to generate spans (as part of a separate issue), it may make sense to also look into it for logging as well since it may be cleaner and easier than mixing libs.

@jcnelson
Copy link
Member

It seems that 33% of the flame graph is just log output from InvGenerator::make_tenure_bitvector().

Perhaps for now, we can move the inner loop of that to trace!-level.

@kantai Yep -- I think if we want to avoid getting aggressive about reducing the debug log volume,

I'm 100% down to substantially reduce the debug log volume. Some of the things we log in the networking stack really need to be trace!, if nothing else.

Down the road, I'm in favor of a combination of @CharlieC3's request to have per-module or per-aspect logging, and my request to have an RPC interface for toggling logging, so we can selectively activate subsystem logging when diagnosing a problem.

@CharlieC3 as it may result in not getting critical logs in the event of a panic under the right conditions.

This right here is the reason I'm not in favor of asynchronous logging, regardless of the concurrency paradigm used. We want to see crashes or other debug-worthy behaviors reported as they happen, so we shouldn't introduce an arbitrary delay between when the offending logic executes and when the log data is emitted. That is, after all, why we log things.

@saralab saralab added this to the Nakamoto-3.0.x milestone Nov 11, 2024
@saralab saralab assigned jferrant and unassigned jcnelson Nov 11, 2024
@jcnelson
Copy link
Member

It came up on the sprint call today to consider the possibility of using tracing instead of slog.

It is my understanding that the value proposition of tracing over slog is that it has multiple backends. My feedback to this is that this is irrelevant, since ingesting log output into one or more backend services is not only well outside the node's responsibility, but also been a solved problem since well before any of us were born and requires negligible work on our end. All the user needs to do is pipe the log output from the Stacks node into whatever service needs to ingest it. All we need to do is inform users of this, and perhaps provide an example or two if anyone asks.

Given the heterogeneity of node users, there isn't a one-size-fits-all solution to log ingestion. Home users have different needs from developers, who in turn have different needs from exchanges (and each exchange likely has their own internal tooling). It's not up to us to provide solutions to log ingestion, and trying to do so only makes our lives harder in the form of more object code to compile, more threads to trace, and more service tickets to deal with incompatibility issues which may arise. Instead, it's the node's job to give them the log data in an ingestible format (e.g. lines of text, JSON objects over stderr, etc), so it can be sent to the log-processing back-end.

@jcnelson
Copy link
Member

jcnelson commented Nov 11, 2024

If folks feel strongly that we have to help users with log ingestion, then that's all well and good. Let's create one or more log ingestion binaries into which the node's output can be piped. And, let's do so in a separate repository, and let's treat this as a "green field" out-of-scope project that can be taken on by 3rd parties.

@CharlieC3
Copy link
Member Author

CharlieC3 commented Nov 11, 2024

@jcnelson For posterity, do you recall which backends from the tracing lib were being discussed? If it were regarding backends which ship logs, like tracing-loki, then I have to agree. It's often safer, easier, and more performant to configure an external log collector which ships these logs rather than doing it in code.

@jcnelson
Copy link
Member

I don't think any backends in particular were being discussed -- I don't think it got that far. The conversation stopped at the conceptual level of why there are distinct frontends and backends in the tracing crate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Status: 🆕 New
Development

No branches or pull requests

5 participants