-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stabilize structured logging #3359
Comments
todaywe don't really have structure logging today
future
questions
|
Open questions:
|
|
To make this even more concrete, I see a bug like #3603 to be right square within the scope of this issue |
This is a great thread! structlog looks fantastic, and it would be a really good idea to move away from logbook -- do we think we can do that pre-1.0? Here's my braindump on structured logging: Format of json log lines
This log line contains a fragment of information related to the I think that a better approach to structured logging would have contracted fields in each log line. These fields would represent the state of a specific entity. This would be simpler for clients to consume than a kitchen-sink style "extra" field. Log/event names The "message" in each log line is not semantically useful. The example above has a message like
Instead, it would be a lot better if we decoupled these "structured" log lines from typical debug/info-level logs that are intended for humans to read. Each structured log could have a semantically useful name (or message) like Node information completeness Today, structured logs contain partial updates about a given node. In the example above, the log line shows when the node finished, but the information about when that node started is in a different
This would make it possible for a consumer of the structured logs to know the state of a node at any point in time without needing to consume the entirety of a potentially very long log file. Note: I am kind of on the fence about this one. I think it's probably a bad idea to assume that a client could "know" everything about a dbt run without consuming all of the structured logs for that run. Instead, I think that more complete structured logging would be good ergonomics for clients -- it would reduce the amount of "state building" that would need to happen on the client, and ultimately would give us more flexibility around how & when we log node information inside of dbt Core. Completeness There's a lot happening inside of dbt that is not emitted in structured logs today. There are some potentially slow actions like building the adapter cache that are not logged in such a way that clients can render progress information to a user. We should consider new types of structured logging around project parsing/run startup/running hooks (both on-run-start/end, and pre/post-hooks), etc, that communicate the exact status of dbt Core as it operates on projects. Materialization-based logging It would be really good if logs could be emitted inside of materializations. This could help us communicate what's happening when a given model is running. Quick example of state transitions:
This would totally be a stretch goal, but I think the idea of materializations being less imperative and more "planned out" is probably well-aligned with our goals for model building into the future :) That's it! Those are all of the thoughts in my head around structured logging. Lmk if there's anything else I can shine a light on! |
I spoke with @iknox-fa about a plan of action to implement this, and the way we understand it, structured logging in dbt will be a formal contract between dbt and several downstream stakeholders to programmatically rely on. Because it is a user-facing contract we want to code for the correctness concerns it raises, and make sure that all the use cases for it are actually good candidates to be solved by logging. In general we believe we can reasonably implement a future-ready structured logging solution in dbt that satisfies our needs today with the ability to move some use-cases to a more scalable approach when needed. Requirements
Questions
Concerns
Because of the scalability concerns with logging specifically regarding IO and concurrent file access, we want to distinguish use cases that are well suited for logging where the intent is to store information and maybe read it later, such as support, and use cases that need to read information as it is written for immediate use such as displaying progress bars. These cases for immediate use are better suited to an server-based API and will likely need to be implemented as such to scale appropriately. Implementation ProposalStructured logging is a user-facing contract, and as such it should be versioned. Parsers can use this version to determine if it understands the format written. It is on dbt developers to ensure that the version changes when structured logging changes.
This approach has the advantage that mypy will be able to do the vast majority of the correctness work for us without relying on sufficient test coverage. Test coverage will still be required to make sure that the final json output matches what is expected for each event type, but that is easy to write unit tests for. A drawback is that this approach locks down call sites so that it might be less obvious how to trigger a log event as a developer. Possibilities for the Future
|
@nathaniel-may love the thinking here!
I think current use-cases fall into the following buckets:
Tell me if I'm missing the mark in terms of "use cases" -- happy to follow up with different types of info if that would be more helpful!
If I were to break it down further, I'd say:
This is not in scope for dbt Core, but may be implemented by processes that wrap dbt Core in the future. Ideally, one of these processes could tail Core structured logs and disseminate these logs via pub/sub if desired
Additionally, structured log output is inconsistent. It is challenging for consumers of structured logs to disambiguate end states for node executions. Examples:
Ideally, there would be consistent flags/enums for success/skip/error conditions. This would simply logic for consumers of these structured logs Think I hit on everything, but very happy to expand on any and all of this! Let me know what you think :) |
Thanks for the feedback @drewbanin! This helps a lot. Regarding the "programmatic log processing" use case, it seems like the requirements are for core to support a dbt client in reporting the state of a run to a user in live time. I don't think writing these events to a log file then reading them again is actually the right way to achieve this goal, so I want to make sure that even if our initial solution is to stream logs, we have built the necessary plumbing to do it the right way when we need it. The type modeling for events on top of the logger in this proposal would allow us to send events to any number of destinations (potentially an event history in memory for clients to stream via a server layer) independent of the configured log level. If you think all of these events will always need to be logged, that's something I'll need some clarification around, since I'll want to guard against IO contention as best I can. |
What do you think the right way to achieve this goal is? |
Generally, when streaming status updates to a client I'd reach for a server end point. Either polling or http2 streams. It handles concurrency better, doesn't rely on synchronous reads and writes to a file on disk, and can be flexible about the information returned to minimize network load. Some off the shelf frameworks like graphql provide some of this for you because they're meant for this exact kind of pattern. |
ok - i buy that as an overall strategy! I think my point of contention is that I don't think dbt Core should actually do anything like that -- it should instead provide an API for letting a wrapping process tap into the logs and then implement polling / http2 streams / websockets / etc. This is actually a big challenge today that @kwigley and I have been discussing recently -- dbt Core's logger framework makes it very hard for a process that imports dbt to tap into the log stream. You can do it, but jankness abounds! My preference would be to:
Here's an example of the kind of jankness that's necessary today (more or less): import dbt.logger as dbt_logger
import logging
GLOBAL_LOGGER = logging.getLogger(__name__)
GLOBAL_LOGGER.setLevel(logging.DEBUG)
stdout = logging.StreamHandler()
stdout.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
stdout.setFormatter(formatter)
GLOBAL_LOGGER.addHandler(stdout)
logger = GLOBAL_LOGGER
json_formatter = dbt_logger.JsonFormatter(
format_string=dbt_logger.STDOUT_LOG_FORMAT
)
# This could be any kind of log handler -- file, memory, or otherwise
logs_redirect_handler = logbook.FileHandler(
filename=self.log_path,
level=logbook.DEBUG,
bubble=True,
filter=lambda x: x.channel.split(".")[0] == "dbt"
)
# If you don't do this, then Core logs to a file itself
dbt_logger.log_manager.set_path(None)
# Reach into dbt lib and mutate the log managers handler list!! This is bad
dbt_logger.log_manager.objects = [
handler
for handler in dbt_logger.log_manager.objects
if type(handler) is not logbook.NullHandler
]
# Create a new nested logging setup
handlers = [logbook.NullHandler(), logs_redirect_handler]
self.log_context = logbook.NestedSetup(handlers)
self.log_context.push_application()
### This is where you can actually invoke dbt and capture logs
dbt.do_the_thing()
### Pop the log handlers
self.log_context.pop_application() A better approach IMO looks like: import dbt.logger as dbt_logger
# create whatever kind of log handler you want
logs_redirect_handler = logbook.FileHandler(
filename=self.log_path,
level=logbook.DEBUG,
bubble=True,
filter=lambda x: x.channel.split(".")[0] == "dbt"
)
# tell dbt to exclusively write logs to our handler, not its own internal handlers
dbt_logger.override_log_handler(logs_redirect_handler) Again - the FileHandler is just an example - this could alternatively be a StreamHandler that buffers logs in memory and could be connected up to a polling service or http2 streams :) |
Ok great! It sounds like we agree on all the big points, and I'm glad we double checked. Item The confusion I had was around the terminology "logging" which to me implies that logs are going to some persistent storage to be read at an unknown future time. I would call this something like "event streaming" but I don't care to get pedantic about what we should call things here. The way I'm thinking about this, we should be able to mix and match log events and progress events when it comes to what destination/s and the level of detail we want. If you have any thoughts on what kinds of flexibility should be built in from the start, I'd like to hear them. But otherwise I think we're on the same page now. Thanks for the detailed feedback, @drewbanin! |
woohoo! ok, fantastic. I think my only thoughts re: flexibility are that it will be really hard to change / remove log events in the future. We might need to think hard about the contracts we're committing to up-front. Once people start triggering alerts in datadog, or building clients that consume these logs.... we sort lose the ability to mess with those contracts too much :) That said, we can always add things in the future, so my preference would be to start with the existing set of semantic logs that we publish. Those are (AFAIK) the places where we subclass |
I totally agree with this! That's why I want to version the events separately from dbt even if its not something we want users to interact with. For example, MongoDB versions their replication protocol, but you never really have to know that. |
Hey all, this looks like wonderful feature that we'd love to use in OpenLineage dbt integration - especially getting live updates on node state transitions. One question: will we get (some) standarized adapter responses the same way? We'd like to report information like row count and byte size written, which we now get from |
@mobuchowski Glad to hear this will be of help! Adapter responses will continue to live on the |
dbt can output structured JSON logs via:
dbt --log-output json
(CLI)logs
parameter (RPC)These logs include a lot of information similar to dbt's JSON artifacts, with the crucial differences being that logs are written/returned in real time, whereas JSON artifacts are only written at the end of invocations.
As of v0.19, JSON artifacts are contracted and documented, but JSON logs are not contracted and barely documented. We should change this ahead of v1.0. There are other technologies and community members who want to build on top of stable logs (e.g. dbt Cloud IDE, #2915).
As I see it, the scope of this issue is to:
The text was updated successfully, but these errors were encountered: