-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add telemetry event for uncaught exceptions #203
Conversation
c096d52
to
f069fc5
Compare
@@ -180,6 +181,7 @@ def filter(self, record: logging.LogRecord) -> bool: | |||
raise | |||
else: | |||
_logger.critical(e, exc_info=True) | |||
record_uncaught_exception_telemetry_event(exception_type=str(type(e))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good for the main thread, but there are other threads in flight. An exception in one of those may not propagate to the main thread. It'd be worth discussing with @jusiskin to see where else the agent should be recording these events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did speak with Josh and he figured this was a good place to start, and we can expand on it in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To quote Josh:
there are other places where we handle exceptions and try to be resilient. those we’d have to instrument as SMEs
So essentially, let's get this in for now and we can improve on all of the nitty gritty error cases when we have the time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me. Something's better than nothing.
52a4142
to
da2ac0c
Compare
Signed-off-by: Caden Marofke <marofke@amazon.com>
da2ac0c
to
fcd6af3
Compare
Signed-off-by: Caden Marofke <marofke@amazon.com>
What was the problem/requirement? (What/Why)
We don't have a mechanism to tell if a version of the worker is more error prone than others.
What was the solution? (How)
Capture telemetry on uncaught exceptions, so we can get an idea of the types of errors customers may be encountering that we're not handling properly. Uses changes from aws-deadline/deadline-cloud#205
What is the impact of this change?
We have a better idea if customers are hitting unintended errors using the worker
How was this change tested?
Was this change documented?
Updated the README
Is this a breaking change?
No