Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add telemetry event for uncaught exceptions #203

Merged
merged 1 commit into from
Mar 20, 2024

Conversation

marofke
Copy link
Contributor

@marofke marofke commented Mar 12, 2024

What was the problem/requirement? (What/Why)

We don't have a mechanism to tell if a version of the worker is more error prone than others.

What was the solution? (How)

Capture telemetry on uncaught exceptions, so we can get an idea of the types of errors customers may be encountering that we're not handling properly. Uses changes from aws-deadline/deadline-cloud#205

What is the impact of this change?

We have a better idea if customers are hitting unintended errors using the worker

How was this change tested?

  • ran locally, confirmed it wrote telemetry to backend
  • unit tests pass, using changes from deadline-cloud mainline

Was this change documented?

Updated the README

Is this a breaking change?

No

@@ -180,6 +181,7 @@ def filter(self, record: logging.LogRecord) -> bool:
raise
else:
_logger.critical(e, exc_info=True)
record_uncaught_exception_telemetry_event(exception_type=str(type(e)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good for the main thread, but there are other threads in flight. An exception in one of those may not propagate to the main thread. It'd be worth discussing with @jusiskin to see where else the agent should be recording these events.

Copy link
Contributor Author

@marofke marofke Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did speak with Josh and he figured this was a good place to start, and we can expand on it in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To quote Josh:

there are other places where we handle exceptions and try to be resilient. those we’d have to instrument as SMEs

So essentially, let's get this in for now and we can improve on all of the nitty gritty error cases when we have the time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me. Something's better than nothing.

@marofke marofke force-pushed the marofke/telemetry-updates branch 3 times, most recently from 52a4142 to da2ac0c Compare March 19, 2024 18:48
@marofke marofke marked this pull request as ready for review March 19, 2024 19:21
@marofke marofke requested a review from a team as a code owner March 19, 2024 19:21
Signed-off-by: Caden Marofke <marofke@amazon.com>
@marofke marofke merged commit 9a17a07 into mainline Mar 20, 2024
12 checks passed
@marofke marofke deleted the marofke/telemetry-updates branch March 20, 2024 15:09
@marofke marofke changed the title fix: Add telemetry event for uncaught exceptions feat: Add telemetry event for uncaught exceptions Mar 20, 2024
baxeaz pushed a commit that referenced this pull request Mar 23, 2024
Signed-off-by: Caden Marofke <marofke@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants