Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty catch block in watch stream event handling #593

Closed
dominykas opened this issue Feb 24, 2021 · 17 comments
Closed

Empty catch block in watch stream event handling #593

dominykas opened this issue Feb 24, 2021 · 17 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@dominykas
Copy link
Contributor

dominykas commented Feb 24, 2021

While trying to understand some unrelated issues I spotted the following:

javascript/src/watch.ts

Lines 121 to 123 in 5988253

} catch (ignore) {
// ignore parse errors
}

It seems it got introduced in https://github.com/kubernetes-client/javascript/pull/289/files - but the reasoning is unclear to me. Can someone comment on this please?

I might be wrong of course, but it feels to me that if there can be unparseable messages, then they need to be handled? And if there can't be any then the try/catch might not be necessary, i.e. it is likely better if the app crashes and logs the details for people consuming this, rather than silently failing?

More importantly, this will also silence the user errors inside the callback, so if someone has a bug in their implementation - they would never see it?

@dominykas
Copy link
Contributor Author

OK, I can see there's an explicit test for that:

it('should ignore JSON parse errors', async () => {

Introduced as part of #250.

I'm still not sure whether that's the correct behavior? If the HTTP connection got terminated, and sent an incomplete JSON - it should fail? The original commit also says this:

An alternative approach to this commit would invoke the done callback with parse errors, but that could result in multiple invocations of the done callback.

It seems the done callback now does have a guard around multiple callbacks? So possibly that's a better approach?

@brendandburns
Copy link
Contributor

If the network fails half-way through sending a response, is that really an error? My answer is 'no'

The watch will get closed for other reasons (e.g. the network stream stopping) and I don't think we want to issue multiple done callbacks in that case.

Is this causing problems?

@dominykas
Copy link
Contributor Author

Is this causing problems?

It's a silent catch, so truth be told I don't know whether it is or isn't causing issues. I was looking into some problems which could be ascribed to #559, but it may as well be this.

And more importantly - it also wraps around the callback() so it would silence any errors inside the consumer code - I think the callback() should definitely move out of the try block?

If the network fails half-way through sending a response, is that really an error? My answer is 'no'

Just so I understand this properly... what you're saying is that without this silent catch, it would call done with an error when it fails to parse, as it could (will?) happen before the connection actually closing, meaning that the user would see "failed to parse" instead of "network disconnected"?

@brendandburns
Copy link
Contributor

brendandburns commented Mar 2, 2021

Regarding wrapping the user callback, that is a fair request. I guess it's arguable whether a user callback failing should kill the watch or not. My feeling is that a user should catch all their own errors (e.g. callback() should never throw). This is roughly akin to the way that Java works where if you throw in the middle of a thread Java just eats the exception. Taking out the watch because of an exception is a pretty extreme thing to do imho.

I suppose that we could always log any exceptions that come out of the callback, but keep running. That seems reasonable, would that work for you?

And yes, the sequence goes like this: half a JSON object is streamed over TCP, TCP connection is severed, JSON object is attempted to be parsed, but since it's only half an object, parsing fails, because TCP is severed, network done is issued.

@dominykas
Copy link
Contributor Author

My feeling is that a user should catch all their own errors (e.g. callback() should never throw).

Is "never throw" a reasonable expectation to ask of a user of a library? It's very hard to achieve in JS land, simply because it's very hard to implement a catch() that can't possibly throw itself short of keeping it empty (a no no) or directly process.exit()ing.

There's also the question of responsibilities - if the userland code inside callback throws, then is it up to @kubernetes/client-node to decide what to do with that?

I think overall the advice to "crash on exceptions" has really stood the test of time in Node land - you can't trust the state in your application after you had an error, so it's best to restart from scratch, so an uncaughtException and a process dying is not an unexpected outcome for someone writing code in Node?

One other way to address this would be to call the callback inside a setTimeout(), but that might break some expectations around sync/async behavior so that's probably not very desirable either.

In the end, I'd argue that the callback() should be called outside of the try/catch block. If it's just a log entry (or even an error event somewhere), then you won't see your app is broken unless you're looking.

And yes, the sequence goes like this: half a JSON object is streamed over TCP, TCP connection is severed, JSON object is attempted to be parsed, but since it's only half an object, parsing fails, because TCP is severed, network done is issued.

OK, thanks for clarifying this. I'll try to think about this and I'll try to come back with some suggestions on different ways to achieve this goal if that's ok?

The things I'd like to check / ideas to explore (note to self, mostly):

  • Is this how Node streams work? i.e. if there's an incomplete packet, will it even emit a data event?
  • Is this how byline works? i.e. Node streams do not guarantee package boundaries, iirc, so you could receive a data event without a new line - will it buffer it until the new line? In this case, it's impossible that it would emit the data event with an incomplete JSON (as long as the k8s API does not send an incomplete JSON with a new line inside).
  • I saw there was some history with json-stream - maybe there's alternative modules that can handle it better? If my memory is right about data not being guaranteed, it could also be that a large chunk of JSON arrives via two data packets, so buffering and not emitting until a full packet is received is probably a better approach in terms of how the stream itself should work? I vaguely recall there's modules that can do it...
  • Would changing the callback to an errback (i.e. standard callback with an err first) make sense? If the JSON.parse fails - you callback(parseErr) and then whatever is using the watcher can decide what to do? e.g. the informer could decide to reconnect instead of reporting that upstream?

Sorry for the wall of text / being this particular about it - just looking for the best approach here to avoid surprises long term.

@brendandburns
Copy link
Contributor

I'm definitely open to proposals/ways to make this obey the principle of least surprise.

@brendandburns
Copy link
Contributor

"Crash on exceptions" really assumes that your application is single threaded (and yes, I know that node apps are single threaded, but in practice callbacks and promises create the illusions of multiple different threads)

It's totally unclear to me that you want a watch crashing (especially if it's a background thread) to take down your entire application.

@brendandburns
Copy link
Contributor

But I agree that making it explicit for the user, and not forcing the decision on them is a good thing. So perhaps just a parameter ("abort on exception"?)

@dominykas
Copy link
Contributor Author

dominykas commented Mar 17, 2021

Just to make sure we're discussing the same thing - I'm explicitly talking about calling the callback() inside the try/catch (I still need to follow up the streams issue in more detail).

"Crash on exceptions" really assumes that your application is single threaded (and yes, I know that node apps are single threaded, but in practice callbacks and promises create the illusions of multiple different threads)

It's totally unclear to me that you want a watch crashing (especially if it's a background thread) to take down your entire application.

I'm not sure I agree with this. I can see how this might make sense in Java or .NET, but that's just not how node applications are (or even should be) written canonically?

Libraries, including built-ins, should not attempt to try to handle the user code. That is strictly outside their boundaries? Take this for example:

> require('fs').readFile('./package.json', (err, res) => { throw new Error('here'); })
undefined
> Uncaught Error: here

readFile just can't possibly know what's inside the callback and just rethrows the exception. It's an unfortunate fact that in Javascript there is no clean way to handle such exceptions (aside from process.on('uncaughtException') in Node), but that's just how it works, and I would like to believe that the ecosystem has adapted accordingly years ago? Browsers do the exact same thing with DOM events, etc

So perhaps just a parameter ("abort on exception"?)

I don't think that it's worth it in the context of the callbacks (although worth pondering over as an option for the JSON.parse case).

The default assumed behavior in Node.js is that if user's callback throws, then libraries don't handle that. The user always has an option to explicitly wrap their own callbacks in a try/catch (although I'd personally very strongly advise against that, unless you know what you're doing).


I spent the last week rewriting a large chunk of our code to replace the GoDaddy's kubernetes-client (which no longer seems to be maintained) with this. I'm doing unit tests against kind and informers were a pleasure to work with, except for this try/catch... I stopped even counting the times when my code was broken (i.e. the code inside the callback) and therefore throwing and the tests were just hanging without any output...

@dominykas
Copy link
Contributor Author

dominykas commented Mar 17, 2021

One more example from the Node.js world: if you have a stream (any stream from the core), and it emits an error event, and you do not have an error handler - it will crash the process. It's a developer error. I'd argue that the same should apply to Informer - should probably discuss that in a separate issue though.

But generally, silent failures are the worst.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 15, 2021
@dominykas
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 16, 2021
@dominykas
Copy link
Contributor Author

Can I open a PR with some suggestions to address this? We can continue the discussions there?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 14, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 14, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants