All Sourcegraph components can easily raise runtime errors for debugging #33240

bobheadxi · 2022-03-30T20:25:15Z

Problem to solve

Errors are not actionable:
- It's hard to understand without a lot of familiarity with the code that produced the error to understand if an error is a big problem or not. Transient errors are mixed up with big ones.
Noise is making the task difficult: transient errors are generating a lot of volume and makes it harder to diagnose the root cause.
Errors are potentially missing or not being reported: in some cases, an error is reported by a customer, but we're not seeing any relevant errors in the logs.
Getting access to the errors is not always an easy task: depending on the instance type where the error happened, it can be difficult to retrieve the logs.
Error reporting is ad-hoc in various services, leading to implementation fragmentation which makes the problem hard to tackle as a whole.

Overall, we have a low signal to noise ratio that makes debugging tedious. We can take this opportunity, as a secondary objective to improve our sensitive data redaction in error reporting.

Related problem: due to the lack of a unified reporting mechanism, diagnosing creates a lot of communication overhead, which can take days in some cases when those communications are subject to strict internal policies in place in some companies.

Measure of success

The handbook documents a set of conventions on how to write errors that has been collectively approved by both errors consuming and producing teams.
Have at least one service fully migrated to use the new errors conventions to serve as an example to later migrate all other services to the new conventions.
Errors rates are provided with enough metadata to identify what component they are coming from, making them monitorable.
Error reporting works across all instance types.
Have a mechanism in place to access seamlessly access the errors on a given instance.
- TODO: Get in touch with the cloud devops team, as they possibly have related work on this matter.

Solution summary

We will propose an RFC with a set of conventions for runtime errors, actively seeking feedback from the biggest errors consumers. We're likely to include documentation tasks and content producing work to improve onboarding on the general topic of errors.

We will audit the codebase to understand better the current state to identify possible source of debt as well as proposing changes on the error-related packages to facilitate following the above convention as well as passing additional data.

Artifacts:

TO DO GitHub discussion to brainstorm on this topic.

What specific customers are we iterating on the problem and solution with?

Internal Sourcegraph developers and SREs.

Impact on use cases

This effort contributes to the company-wide effort to improve Observability.

Delivery plan

➡️ Investigate existing proposals, issues, and problem space @jhchabran
Prepare a draft RFC with standardized proposal around conventions.
Partner with a given team on improving the errors on a specific domain and evaluate those results with SREs.
Raise RFC as formal standardization proposal

Tracked issues

@bobheadxi: 3.00d

https://github.com/sourcegraph/sourcegraph/pull/33802
https://github.com/sourcegraph/sourcegraph/issues/33241 3.00d 🛠️
- (🏁 57 days ago) https://github.com/sourcegraph/sourcegraph/issues/34217 3.00d

Completed: 3.00d

(🏁 today) https://github.com/sourcegraph/sourcegraph/issues/33192 3.00d 🛠️

@jhchabran

Completed

(🏁 57 days ago) https://github.com/sourcegraph/sourcegraph/pull/34564
(🏁 41 days ago) https://github.com/sourcegraph/sourcegraph/pull/34563
(🏁 15 days ago) https://github.com/sourcegraph/sourcegraph/issues/35844 (PRs: ~~#35879~~)
(🏁 14 days ago) https://github.com/sourcegraph/sourcegraph/issues/35845 (PRs: ~~#35879~~, ~~#36569~~)
(🏁 14 days ago) https://github.com/sourcegraph/sourcegraph/issues/35847
(🏁 7 days ago) https://github.com/sourcegraph/sourcegraph/pull/37212
(🏁 7 days ago) https://github.com/sourcegraph/sourcegraph/issues/35846 (PRs: ~~#37231~~)
(🏁 7 days ago) https://github.com/sourcegraph/sourcegraph/pull/37314
(🏁 today) https://github.com/sourcegraph/sourcegraph/issues/33192 🛠️

Legend

👩 Customer issue
🐛 Bug
🧶 Technical debt
🎩 Quality of life
🛠️ Roadmap
🕵️ Spike
🔒 Security issue
🙆 Stretch goal

bobheadxi · 2022-05-10T18:18:34Z

There's currently some Sentry work in the code related to this, but last I asked nobody uses Sentry for backend errors: https://sourcegraph.slack.com/archives/C02UC4WUX1Q/p1651506189865219

bobheadxi · 2022-05-10T18:40:25Z

Thought: how are errors handled today? do we even need a tool like sentry? Can't error aggregation/stack trace dumping be done with logging, i.e. treat all error-level logs as serious evnts? (the new logging library dumps lib/error stacktraces by default in production)

Would be nice to not need a tool like Sentry since it simplifies the story for customers, and seems aligned with how errors are handled today for backend services (with a log-first and/or trace-oriented approach)

cc @jhchabran

bobheadxi · 2022-05-12T03:58:51Z

joe mentioned this: https://github.com/orgs/sourcegraph/projects/139 previous initiative around sentry

jhchabran · 2022-05-17T09:26:51Z

Would be nice to not need a tool like Sentry since it simplifies the story for customers, and seems aligned with how errors are handled today for backend services (with a log-first and/or trace-oriented approach)

I entirely agree, Sentry should be a system that is added on top of the existing code as a way to visualize errors collection, nothing else. Thanks for bringing up Joe's board, I wasn't aware of it.

jhchabran · 2022-05-17T16:58:48Z

Did a first exploration this afternoon and brought @burmudar along. I'll keep going tomorrow :)

First principles

Surfacing an error means logging it.
A surfaced error always comes with contextual metadata.
Errors are a signal.
Errors work the same across all instance types.

Surfacing an error

	logger.With(log.Error(err)).Error("description") // log an error
    logger.With(log.Error(err)).Warn("description")  // log a transient error, can be ignored by Sentry but not the metrics for example. 
    logger.With(log.Error(err)).Info("description")  // log but do not instrument the error

Instrumenting error rates

Following the aforementioned first principle of errors being a signal, we typically want to be able to monitor the error rate of a particular group of errors. This is tangential to the issue, but worth mentioning I believe.

As an example, a service has an HTTP API, and you want to monitor the error rate of your API. You add a middleware on your handlers, (like httptrace.go) where you log the errors with "http_handlers" as your bucket name.

logger.With(log.Error(err), log.String("error_bucket", "my-bucket").Error("description")

This does not prevent you from logging the errors earlier in the code, with a different bucket or no bucket at all (in which case, the error will be aggregated into the total error rate).

This enables users to mix explicit error reporting while still having an error handler that catches errors bubbling up. This solves the problem of having a root error handler that is not aware of context-specific metadata while you actually have this metadata on the original call site, while still being able to compute error rates properly. We don't want the user to have to think about his error being logged further down the line.

Capturing the surfaced errors

While we're on Cloud, we can use Sentry, but we can't really do that on other instance types. So by plugging the log monitoring system onto the logs, like a middleware, we can send the errors into an external system such as Sentry or use a different approach such as storing them somewhere so we can display them in the UI for example (purely hypothetical, just mentioning it as it was brought forward that getting customers to send us the logs can sometimes be a bit longer due to internal processes on their side).

	syncLogs := log.Init(log.Resource{
		Name:       "wjh",
		Version:    version.Version(),
		InstanceID: hostname.Get(),
	}, mySentryErrorMiddleware, myMetricsErrorMiddleware)

So we could pass an additional option to log.Init to wire the error handler(s), which would guarantee that we're not dynamically changing it, which would be quite surprising behavior.

API approach

At this stage, it makes more sense to just do a first iteration with the primitives provided by the logging API rather than creating new helpers such as logger.CaptureError.

About the observation package

None of the above is incompatible with the observation package and using it with it could simply result in passing additional metadata.

Discarded considerations

Annotating errors

In the context of having a root error handler such as the one in httptrace.go, it may sound appealing to add helpers to annotate errors with metadata that could be parsed by the logger to be then logged as attributes. This way, you can annotate your errors and just pass them up to get a nicely annotated log line.

But that introduces complexity on the error type we use and how to reconcile this when errors are wrapped. I believe it's much simpler to avoid overloading the error themselves and if we really want to use some annotation we could decide to expose the necessary helpers from cockroachdb-errors if we have to.

At this stage, we're better keeping it simple and learning from the usage patterns that we'll see emerging.

Using the error message as the bucket name

It's really easy to make change to an error message and messing up the metrics on the errors just because you fixed a typo is dangerous. So having an explicit bucket name is more robust and clearer in its intentions.

jhchabran · 2022-06-23T16:05:41Z

We're done, At this point, it's just continuously polishing this to unlock more actionability.

I think it'd be better to close the current issue and to wrap the remaining work in the form of a bet that we think in terms of business impact. Or just finish those while we're on cool down. I'll give some thoughts on this, and will circle back.

bobheadxi added roadmap Issue tracking a product-eng roadmap item tracking devx/errors labels Mar 30, 2022

bobheadxi assigned jhchabran Mar 30, 2022

bobheadxi changed the title ~~DevX: application logging roadmap~~ All Sourcegraph services can easily raise runtime errors for debugging Mar 30, 2022

bobheadxi added the team/devx label Mar 30, 2022

bobheadxi mentioned this issue Mar 30, 2022

devx: logging, metrics, and errors standardization planning #30552

Closed

bobheadxi changed the title ~~All Sourcegraph services can easily raise runtime errors for debugging~~ All Sourcegraph components can easily raise runtime errors for debugging Mar 30, 2022

taylorsperry self-assigned this Apr 8, 2022

This was referenced Apr 21, 2022

All Sourcegraph components can export uniform, structured logs #33192

Closed

All Sourcegraph components can export standardized sets of metrics #33241

Closed

This was referenced May 2, 2022

Complete migrations for error monitoring #34795

Closed

All Sourcegraph components can export standardized traces #34796

Closed

bobheadxi unassigned taylorsperry Jun 22, 2022

sourcegraph-bot mentioned this issue Jun 23, 2022

DevX: sourcegraph/log migration tracking issue #37586

Closed

5 tasks

jhchabran closed this as completed Jun 23, 2022

sourcegraph-bot mentioned this issue Jun 28, 2022

DevX: Q2B2 Improve actionability of CI #37826

Closed

12 tasks

This was referenced Jul 7, 2022

DevX proposal: release tool rewrite tracking issue #38346

Closed

DevX: Q2B2 Codeintel OpenTelemetry and s2 observability #37778

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All Sourcegraph components can easily raise runtime errors for debugging #33240

All Sourcegraph components can easily raise runtime errors for debugging #33240

bobheadxi commented Mar 30, 2022 •

edited by sourcegraph-bot

Loading

bobheadxi commented May 10, 2022 •

edited

Loading

bobheadxi commented May 10, 2022 •

edited

Loading

bobheadxi commented May 12, 2022 •

edited

Loading

jhchabran commented May 17, 2022

jhchabran commented May 17, 2022 •

edited

Loading

jhchabran commented Jun 23, 2022

All Sourcegraph components can easily raise runtime errors for debugging #33240

All Sourcegraph components can easily raise runtime errors for debugging #33240

Comments

bobheadxi commented Mar 30, 2022 • edited by sourcegraph-bot Loading

Problem to solve

Measure of success

Solution summary

What specific customers are we iterating on the problem and solution with?

Impact on use cases

Delivery plan

Tracked issues

Legend

bobheadxi commented May 10, 2022 • edited Loading

bobheadxi commented May 10, 2022 • edited Loading

bobheadxi commented May 12, 2022 • edited Loading

jhchabran commented May 17, 2022

jhchabran commented May 17, 2022 • edited Loading

First principles

Surfacing an error

Instrumenting error rates

Capturing the surfaced errors

API approach

About the observation package

Discarded considerations

Annotating errors

Using the error message as the bucket name

jhchabran commented Jun 23, 2022

bobheadxi commented Mar 30, 2022 •

edited by sourcegraph-bot

Loading

bobheadxi commented May 10, 2022 •

edited

Loading

bobheadxi commented May 10, 2022 •

edited

Loading

bobheadxi commented May 12, 2022 •

edited

Loading

jhchabran commented May 17, 2022 •

edited

Loading