Integrate Structured Logging #3

CMCDragonkai · 2021-12-06T04:14:30Z

Specification

Structured logging is to use JSON on the STDERR logs. This exposes more of the data, and allows us to create a standard for the set of fields to be used.

There's no real official standard for structured logging. But journalctl and syslog defines some common fields. The fields can be similar to our formatting that we already use, but this time in a JSON format. Additional data fields can be encoded too.

We could put a json formatting in somewhere that provides the JSON serialisation.

This can be especially useful since we are also looking into adding error chaining, which we have done in a few places. But by creating our own @matrixai/error we can implement error chaining. MatrixAI/Polykey#304

Additional context

Integrate Error Chaining in a js-errors or @matrixai/errors package Polykey#304
https://www.structlog.org/en/stable/
https://github.com/uber-go/zap
Implementing BIP39 recovery, Unattended Bootstrap and Agent Start and Status Polykey#283 (comment) - where an error came up and no context was there, but all in the data which required manual serialisation

Tasks

Research different structured logging libraries
Compare with opentracing or tracing based data formats
Implement simple structured logging capabilities into js-logger

The text was updated successfully, but these errors were encountered:

CMCDragonkai · 2021-12-06T04:18:14Z

Need to consider a couple of things:

Common fields that should align with other structured logging mechanisms
The --format field used for output formatting can double as for logger formatting too
How does error chaining come into play here, and how should these be shown?
How does stack traces come into play here? Stack traces don't have sensitive information, they are just traces of what functions called what functions.

CMCDragonkai · 2021-12-15T02:23:07Z

So basically the problem is this:

Does --format apply to just STDOUT or also to STDERR? I believe it should apply to both. Especially as STDERR will likely be relied on upon quite a bit by pk agent start. So doing something like pk agent start --format=json --verbose will then ensure that INFO messages get logged out to STDERR by with JSON structure rather than human readable messages. This will be important for any kind of long-term debugging of the PK agent.
If --format applies to STDERR as well, this then affects the PK CLI calls which would mean any exception plus INFO messages would also become JSON. This is OK as we can expect that once --format=json is used, we would actually want this to be worked by a computer.

Therefore any usage by a computer should always use --format=json. It would affect exception output on STDERR, logging messages to STDERR, and output to STDOUT.

CMCDragonkai · 2021-12-15T02:31:34Z

One additional issue is the "level of verbosity". This is basically controlled by --verbose. But when we output an exception to STDERR, we don't currently show the stack trace atm. The root exception handler by the PK agent and in PK CLI calls don't show the stack trace. The stack trace is useful for tracing where the error comes from.

Right now verbosity only controls INFO, WARNING... etc. We could do something where one chooses to use logger.debug to show the stacktrace as well if the logger level is appropriate. But this requires a conditional check on the logger's effective level.

Something like:

if (this.logger.getEffectiveLevel() > ...) {
  this.logger.debug(e, ...); // include stack trace
} else {
  this.logger.info(e, ...); // don't include stack trace
}

A similar situation would apply to errors being thrown over the GRPC boundary as discussed here MatrixAI/Polykey#249.

I reckon something can be automated if we integrate the design of js-logger with js-errors library here MatrixAI/Polykey#304, such that when an error is being logged out, we gain additional context depending on what the log level is. Of course what errors is passed over GRPC also depends, some errors we only want to write to the agent's STDERR, other errors we want to propagate to the GRPC client, and in other cases we would want to "map" agent-local errors to GRPC API errors.

This would mean something like catch a local agent error, write that out to STDERR on the agent on the RPC handler, then create a new error to be propagated to the GRPC client. We have to separate client errors from server errors like in HTTP where there is 4xx errors vs 5xx errors. 5xx errors would be caught by the exit handler, and put to STDERR. 4xx errors are not, they would be emitted by the RPC handler. We would only log them out to STDERR if the RPC handler handled the error, and also that this is a problem for the agent itself.

CMCDragonkai · 2022-04-28T03:31:41Z

Node v16 should be reporting the error chain nicely now.

CMCDragonkai · 2022-04-28T03:32:31Z

@emmacasolin FYI this issue addresses the idea of how the exceptions (including the error chaining) will eventually be displayed on the CLI or STDERR logging.

CMCDragonkai · 2022-05-09T03:14:32Z

Note that the cause property is actually an ES2022 feature, but nodev16 does support it, but because we haven't fully updated our standards to it, we will forget that it actually exists on the standard errors.

CMCDragonkai · 2022-05-09T03:15:23Z

Any kind of structured logging will be intended for machine consumption and filtering. That's because any logging of server errors may contain a long stack string, and that information is likely to be quite noisy on basic log readers.

CMCDragonkai · 2022-05-23T06:19:36Z

@emmacasolin this would still be needed to ensure that --format json affects STDERR logging.

CMCDragonkai · 2022-06-17T08:25:27Z

Some thoughts...

We've been hitting various performance and other network buggy issues which are hard to debug because of concurrency and that asynchronous work are happening out of order.

I think we need to improve our structured logging to provide a better observability into the runtime of our program.

To do this, I'm thinking that we need to incorporate some new ideas:

use perf_hooks to get proper timing data: https://blog.logrocket.com/experimental-node-js-testing-the-new-performance-hooks-31fcdd2a747e/
use async_hooks to hook into Node's asynchronous contexts to understand when something asynchronous is running, and the architecture of them together
We should include some new formatting symbols like trace from Integrate optional trace in order to easily identify where logs are being outputted #4 and also time stamp symbols too
We need a log viewer/trace viewer that we can use locally (possibly in the browser, possibly in the terminal) that can understand the structured logs and give us context:
Get some ideas of prior art with respect to opentelemetry and related structured logging systems:
- https://github.com/tokio-rs/tracing
- https://github.com/tersesystems/blindsight
JSON vs Structure logging:
- https://github.com/hynek/structlog (I believe human readable is structured logging)
- https://github.com/sirupsen/logrus
Right now we have a different way of reporting exceptions/errors vs our logging, that is when exceptions are reported we are printing them a certain way in our CLI. But we should also be able to make use of loggers to log out the same exceptions. Perhaps we should be unifying something between js-logger and js-errors
Log viewers - it seems all terminal log viewers only view linear logs https://github.com/tstack/lnav, but you do have browser based viewers that can provide more useful trace understanding: https://github.com/openzipkin/zipkin and https://github.com/jaegertracing/jaeger. However, there are terminal and related tools that visualise via a flamegraph, which usually uses the profiler, instead of distributed trace formats. However this is basically the samething, but nobody has built it.

Point is flamegraph was only invented by Brendan Gregg back in 2013/2014. But the same idea can be applied to view an entire trace of the entire distributed system. But also works for a local system too. But the tooling isn't really solid though and seems heavyweight for local debugging. So something inbetween might be useful.

CMCDragonkai · 2022-06-20T04:36:31Z

Some clarification on terminology.

There's a "standard" for structured logging as exemplified by https://www.structlog.org/.

In that structure, it is not using JSON, but using a human readable structure.

This issue is more about JSON as a structured logging format.

I believe the human readable structure as demonstrated by structlog is a good idea, and we can move our logging format to fit that format eventually. But JSON logging is simply more powerful for any machine analysis of logs.

However atm, I'm not sure what the deal is with information like trace #4, and how structlog would handle information like that. It's not human readable to have information like that on one single line (which is the default for most loggers), it's simply too much information. And if we want to preserve human readability, newlines/tabs/columns on the terminal must be used. The question is, is it really possible to enable easy machine parseability while still using whitespace as a formatting operator for human readability on the terminal? Maybe the goals are in conflict.

Some more info here: hynek/structlog#407

I think we should go as far as we can to maintain structured logging for both human readability AND machine parseability.

In #4 I suggested that if we ever do need to use literal newlines or whitespace operator for formatting, it must always be done with some indentation. That way, one can say that any newline that has ahead of it, is still a continuation of the same log message.

But it's a bit more complicated than that. Because you need a "lookbehind" to say that whatever is before the must not also be a \n.

So this should be fine as that terminates the message:

ABC \n

But this should not be considered an end to the message

ABC \n
  ABC\n

But this would get confused with:

ABC \n
  ABC \n

Where it is 2 messages.

So if there are spaces after \n, then that must be joined up with the first message.

Then machines should be able to parse individual messages.

CMCDragonkai · 2022-06-20T04:39:20Z

However do note that such usage of newlines are unlikely to be parseable easily, since it depends on logging analysers that know that this can occur. I reckon that it should be possible to code a custom parser for lnav that understands such a formatter though.

However it looks like it still isn't entirely a solved problem: tstack/lnav#410 (multi-line matching).

This further shows that any usage of multiple lines for a log can only really be usable on the terminal and custom log parsers.

But at that point the JSON structure should be used instead.

CMCDragonkai · 2022-06-20T04:39:41Z

@tegefaulkes @emmacasolin follow this for more structured logging research.

CMCDragonkai · 2022-06-20T04:42:09Z

Also I noticed that structlog talks about "context", that almost sounds like opentelemetry's idea of spans.

My opinion is that "context" can be done by hooking into async_hooks, however that won't work for distributed spans, which goes across process contexts, so at some point it has to be done through decorators.

It does seem that decorators would be the most expressive way of doing this, as this is basically a sort of AOP (aspect oriented programming) cross-cutting concern. We would want to avoid having to add lots of line-noise, but still be able to drop down to imperative logging for flexible control in edge cases where a decorator just doesn't work.

CMCDragonkai · 2022-06-23T02:40:25Z

Because PK is a decentralised system, we do have calls that propagate from 1 agent to another agent. Debugging this is crucial, but not necessarily possible in production unless we control all the parties to the interaction. Right now there's a bug in the NAT testing PR where signalling involves agent 1 asking the seed node to ask agent 2 to reverse hole punch. If there's a delay anywhere here, we end up with timeout errors.

CMCDragonkai · 2022-06-24T13:28:26Z

Structured graph visualisation with contexts/spans/traces. I think we take a page from the git graph, and use things like: https://github.com/nicoespeon/gitgraph.js.

Along with the ink library: https://github.com/vadimdemedes/ink we may even extend the react components to eventually the GUI too show a trace on what is occurring in the system.

CMCDragonkai · 2022-07-18T11:48:47Z

To solve this issue, I'm just going to incorporate json as formatting style. In terms of the actual data being logged, it'd be similar to what is already available under our formatting. I might incorporate similar keywords to what is already available to align. But the contexts and spans would have to be done in a separate issue.

A JSON format would be one step towards producing traces JSON data that can then be analysed and visualised. There is a relationship between this and "state" simulacra, but somewhat independent. State simulacra is generally specific to the state being modelled, while tracing is very generic and applies to any contextful based systems.

CMCDragonkai · 2022-07-21T03:47:32Z

@emmacasolin please change up PK to start using the formatting.jsonFormatter when --format=json is set on the CLI as per #3 (comment).

CMCDragonkai added the development Standard development label Dec 6, 2021

CMCDragonkai mentioned this issue Dec 15, 2021

Adding QA jobs and expanding test jobs for CI/CD. MatrixAI/Polykey#292

Merged

14 tasks

CMCDragonkai mentioned this issue Dec 15, 2021

Integrate Error Chaining in a js-errors or @matrixai/errors package MatrixAI/Polykey#304

Closed

12 tasks

CMCDragonkai mentioned this issue May 18, 2022

Integrate optional trace in order to easily identify where logs are being outputted #4

Closed

CMCDragonkai mentioned this issue Aug 11, 2023

Control the error reporting verbosity with --verbosity flags MatrixAI/Polykey-CLI#17

Open

5 tasks

CMCDragonkai mentioned this issue Jun 22, 2022

Drilldown noisy logs #7

Closed

CMCDragonkai mentioned this issue Jun 24, 2022

ci: merge staging to master MatrixAI/Polykey#379

Merged

10 tasks

CMCDragonkai self-assigned this Jul 18, 2022

CMCDragonkai mentioned this issue Jul 18, 2022

Structured JSON Logging #14

Merged

15 tasks

CMCDragonkai closed this as completed in #14 Jul 20, 2022

CMCDragonkai mentioned this issue Jul 21, 2022

Integrate Tracing (derived from OpenTelemetry) #15

Open

CMCDragonkai added the r&d:polykey:supporting activity Supporting core activity label Jul 10, 2023

CMCDragonkai mentioned this issue Oct 17, 2024

Structured formatting of errors when converted to string MatrixAI/js-errors#21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Structured Logging #3

Integrate Structured Logging #3

CMCDragonkai commented Dec 6, 2021 •

edited

Loading

CMCDragonkai commented Dec 6, 2021

CMCDragonkai commented Dec 15, 2021

CMCDragonkai commented Dec 15, 2021

CMCDragonkai commented Apr 28, 2022

CMCDragonkai commented Apr 28, 2022

CMCDragonkai commented May 9, 2022

CMCDragonkai commented May 9, 2022

CMCDragonkai commented May 23, 2022

CMCDragonkai commented Jun 17, 2022 •

edited

Loading

CMCDragonkai commented Jun 20, 2022

CMCDragonkai commented Jun 20, 2022

CMCDragonkai commented Jun 20, 2022

CMCDragonkai commented Jun 20, 2022

CMCDragonkai commented Jun 23, 2022

CMCDragonkai commented Jun 24, 2022

CMCDragonkai commented Jul 18, 2022

CMCDragonkai commented Jul 21, 2022

Integrate Structured Logging #3

Integrate Structured Logging #3

Comments

CMCDragonkai commented Dec 6, 2021 • edited Loading

Specification

Additional context

Tasks

CMCDragonkai commented Dec 6, 2021

CMCDragonkai commented Dec 15, 2021

CMCDragonkai commented Dec 15, 2021

CMCDragonkai commented Apr 28, 2022

CMCDragonkai commented Apr 28, 2022

CMCDragonkai commented May 9, 2022

CMCDragonkai commented May 9, 2022

CMCDragonkai commented May 23, 2022

CMCDragonkai commented Jun 17, 2022 • edited Loading

CMCDragonkai commented Jun 20, 2022

CMCDragonkai commented Jun 20, 2022

CMCDragonkai commented Jun 20, 2022

CMCDragonkai commented Jun 20, 2022

CMCDragonkai commented Jun 23, 2022

CMCDragonkai commented Jun 24, 2022

CMCDragonkai commented Jul 18, 2022

CMCDragonkai commented Jul 21, 2022

CMCDragonkai commented Dec 6, 2021 •

edited

Loading

CMCDragonkai commented Jun 17, 2022 •

edited

Loading