-
Notifications
You must be signed in to change notification settings - Fork 1.3k
All Sourcegraph components can easily raise runtime errors for debugging #33240
Comments
There's currently some Sentry work in the code related to this, but last I asked nobody uses Sentry for backend errors: https://sourcegraph.slack.com/archives/C02UC4WUX1Q/p1651506189865219 |
Thought: how are errors handled today? do we even need a tool like sentry? Can't error aggregation/stack trace dumping be done with logging, i.e. treat all error-level logs as serious evnts? (the new logging library dumps Would be nice to not need a tool like Sentry since it simplifies the story for customers, and seems aligned with how errors are handled today for backend services (with a log-first and/or trace-oriented approach) cc @jhchabran |
joe mentioned this: https://github.com/orgs/sourcegraph/projects/139 previous initiative around sentry |
I entirely agree, Sentry should be a system that is added on top of the existing code as a way to visualize errors collection, nothing else. Thanks for bringing up Joe's board, I wasn't aware of it. |
Did a first exploration this afternoon and brought @burmudar along. I'll keep going tomorrow :) First principles
Surfacing an error logger.With(log.Error(err)).Error("description") // log an error
logger.With(log.Error(err)).Warn("description") // log a transient error, can be ignored by Sentry but not the metrics for example.
logger.With(log.Error(err)).Info("description") // log but do not instrument the error Instrumenting error ratesFollowing the aforementioned first principle of errors being a signal, we typically want to be able to monitor the error rate of a particular group of errors. This is tangential to the issue, but worth mentioning I believe. As an example, a service has an HTTP API, and you want to monitor the error rate of your API. You add a middleware on your handlers, (like httptrace.go) where you log the errors with
This does not prevent you from logging the errors earlier in the code, with a different bucket or no bucket at all (in which case, the error will be aggregated into the total error rate). This enables users to mix explicit error reporting while still having an error handler that catches errors bubbling up. This solves the problem of having a root error handler that is not aware of context-specific metadata while you actually have this metadata on the original call site, while still being able to compute error rates properly. We don't want the user to have to think about his error being logged further down the line. Capturing the surfaced errorsWhile we're on Cloud, we can use Sentry, but we can't really do that on other instance types. So by plugging the log monitoring system onto the logs, like a middleware, we can send the errors into an external system such as Sentry or use a different approach such as storing them somewhere so we can display them in the UI for example (purely hypothetical, just mentioning it as it was brought forward that getting customers to send us the logs can sometimes be a bit longer due to internal processes on their side). syncLogs := log.Init(log.Resource{
Name: "wjh",
Version: version.Version(),
InstanceID: hostname.Get(),
}, mySentryErrorMiddleware, myMetricsErrorMiddleware) So we could pass an additional option to API approachAt this stage, it makes more sense to just do a first iteration with the primitives provided by the logging API rather than creating new helpers such as About the observation packageNone of the above is incompatible with the observation package and using it with it could simply result in passing additional metadata. Discarded considerationsAnnotating errorsIn the context of having a root error handler such as the one in httptrace.go, it may sound appealing to add helpers to annotate errors with metadata that could be parsed by the logger to be then logged as attributes. This way, you can annotate your errors and just pass them up to get a nicely annotated log line. But that introduces complexity on the error type we use and how to reconcile this when errors are wrapped. I believe it's much simpler to avoid overloading the error themselves and if we really want to use some annotation we could decide to expose the necessary helpers from cockroachdb-errors if we have to. At this stage, we're better keeping it simple and learning from the usage patterns that we'll see emerging. Using the error message as the bucket nameIt's really easy to make change to an error message and messing up the metrics on the errors just because you fixed a typo is dangerous. So having an explicit bucket name is more robust and clearer in its intentions. |
We're done, At this point, it's just continuously polishing this to unlock more actionability. I think it'd be better to close the current issue and to wrap the remaining work in the form of a bet that we think in terms of business impact. Or just finish those while we're on cool down. I'll give some thoughts on this, and will circle back. |
Problem to solve
Overall, we have a low signal to noise ratio that makes debugging tedious. We can take this opportunity, as a secondary objective to improve our sensitive data redaction in error reporting.
Related problem: due to the lack of a unified reporting mechanism, diagnosing creates a lot of communication overhead, which can take days in some cases when those communications are subject to strict internal policies in place in some companies.
Measure of success
Solution summary
We will propose an RFC with a set of conventions for runtime errors, actively seeking feedback from the biggest errors consumers. We're likely to include documentation tasks and content producing work to improve onboarding on the general topic of errors.
We will audit the codebase to understand better the current state to identify possible source of debt as well as proposing changes on the error-related packages to facilitate following the above convention as well as passing additional data.
Artifacts:
What specific customers are we iterating on the problem and solution with?
Internal Sourcegraph developers and SREs.
Impact on use cases
This effort contributes to the company-wide effort to improve Observability.
Delivery plan
Tracked issues
@bobheadxi: 3.00d
Completed: 3.00d
@jhchabran
Completed
#35879)#35879,#36569)#37231)Legend
The text was updated successfully, but these errors were encountered: