Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not catch throwable #19231

Merged
merged 1 commit into from
Jul 4, 2016
Merged

Do not catch throwable #19231

merged 1 commit into from
Jul 4, 2016

Conversation

jasontedor
Copy link
Member

@jasontedor jasontedor commented Jul 3, 2016

Today throughout the codebase, catch throwable is used with reckless
abandon. This is dangerous because the throwable could be a fatal
virtual machine error resulting from an internal error in the JVM, or an
out of memory error or a stack overflow error that leaves the virtual
machine in an unstable and unpredictable state. This commit removes
catch throwable from the codebase and removes the temptation to use it
by modifying listener APIs to receive instances of Exception instead of
the top-level Throwable.

@jasontedor
Copy link
Member Author

This PR is blocked on a release of SecureSM that incorporates elastic/securesm#4, but I think that reviews can start, and the PR also helps clarify the intent of elastic/securesm#4.

@s1monw
Copy link
Contributor

s1monw commented Jul 3, 2016

I am not sure we should block this aweseom cleanup by the securesm changes. I'm likely -1 on having multiple exit points in the JVM but most of this PR is awesome. Can we detach the two?

@jasontedor jasontedor changed the title Die with dignity Do not catch throwable Jul 3, 2016
@jasontedor
Copy link
Member Author

Can we detach the two?

I've detached the uncaught exception handler changes and will open a second PR on top of master (for discussion) when this PR is integrated.

Today throughout the codebase, catch throwable is used with reckless
abandon. This is dangerous because the throwable could be a fatal
virtual machine error resulting from an internal error in the JVM, or an
out of memory error or a stack overflow error that leaves the virtual
machine in an unstable and unpredictable state. This commit removes
catch throwable from the codebase and removes the temptation to use it
by modifying listener APIs to receive instances of Exception instead of
the top-level Throwable.
@s1monw
Copy link
Contributor

s1monw commented Jul 4, 2016

LGTM lets get this in! I really wonder how we can keep this under control maybe we can add some forbidden API magic?

@agirbal
Copy link

agirbal commented Jul 11, 2016

Can't this change potentially make it harder on operations to deal with ES? I am thinking of case where you get a domino effect of nodes just dying, and the fewer nodes are up the more likely the others will die. Factor in the fact that with some customers it can take many minutes for ES to restart.
Back in the day I was working on a Saas, following many incidents of domino effects, we tried really hard to always salvage the JVM, would it be only to troubleshoot it. Upon certain exceptions (IO error from disk, OOM) it would enter a degraded state and report it to rest of cluster to stop receiving requests or just pass through only (i.e. client mode). From there one could flush its caches, check the instance and potentially put it back in normal state.

@nik9000
Copy link
Member

nik9000 commented Jul 11, 2016

If the JVM throws an OOM then salvaging it is likely to put it in an unexpected state. That is scary enough that it is worth crashing I think.

IO errors are still caught and salvaged.

I agree that we need to prevent these OOMs from happening and we do actively work on that.

@jasontedor
Copy link
Member Author

jasontedor commented Jul 11, 2016

Can't this change potentially make it harder on operations to deal with ES?

I think that it makes it easier because there is no safe recovery from virtual machine errors like OOM. In the past, we would silently discard these fatal errors leaving the JVM in an unpredictable state. In that case, operators should be restarting their instances, but they didn't have a way of reliably knowing if the JVM experienced such an error or not. With this change and #19272, we've taken this burden away from operators and removed a potential source of corruption and other resiliency issues.

Back in the day I was working on a Saas, following many incidents of domino effects, we tried really hard to always salvage the JVM, would it be only to troubleshoot it.

When the JVM throws an OOM it enters a questionable and unexpected state, the point is that it can not be salvaged.

From there one could flush its caches, check the instance and potentially put it back in normal state.

We can not reliably do this after the JVM has entered such a state, there are no guarantees that we can safely do anything at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants