Unnecessary `error` events in `ListWatch` (cache/informer) #881

dominykas · 2022-09-19T10:05:00Z

Describe the bug
As of 0.17.0, there is an increased number of error events emitted from the ListWatch, which results in, if you follow the example code, in unnecessary restarts of the informer.

Client Version
0.17.1

Server Version
1.22.x

To Reproduce

Keep the informer open for long enough and observe the behavior:
- In 0.15.0, you will notice a lot of reconnections with calls to /list to repopulate the cache
- In 0.16.0, you will notice "freezes" (likely Informer example requires fix: infomer restarted after onerror event doesn't dispatch other events #795)
- In 0.17.0, you will notice a lot of error events emitted with ECONNRESET, happening after req.abort() is called, i.e. they're no longer relevant.

Expected behavior

The informer should quietly reconnect to the API, restart watches when necessary, repopulate cache by calling listFn() when neccessary

Example Code

Assuming it's reproducible with cache-example.js.

Environment (please complete the following information):

Likely not relevant.

Additional context

Here's my attempt to reconstruct what happened:

Attempt to start list watch from last resourceVersion #711 was merged and released as part of 0.16.0. It included caching improvements (🎉) which avoid calling listFn() when unnecessary.
The above PR has an issue, because it [likely?] resulted in new ERROR events getting emitted via the watch feed - they were not being handled. fix(cache): watch errors must call done handler #781 addressed this, by calling doneHandler() with an error.
- I believe that Informer example requires fix: infomer restarted after onerror event doesn't dispatch other events #795 describes the issue, as well as this comment on #596, as it reports using 0.16.3 and we've experienced similar problems when we've tried to upgrade to 0.16. The fix was included in 0.17.0
The fix in fix(cache): watch errors must call done handler #781 introduced additional calls to doneHandler(), and by virtue of that - to the _stop() method which calls this.request.abort();. Calling request.abort() occasionally results in an ECONNRESET (depending on who closes the socket first - the client or the server) emitted as request.on('error'). In the case of handling the ERROR event by aborting the existing request, this started happening a lot more often in 0.17.0.
- In <=0.15.x, this was likely being hidden by the fact that watch.ts ensures that done() is only called once - i.e. there's likely the stream close or some such event arriving before the request.on('error') is emitted with the ECONNRESET.

Proposed fix options

I haven't yet made up my mind on what is the best approach for the fix :) Internally, we'll probably monkeypatch the _stop() method with the first option until a proper fix is rolled out.

Option 1. Disable the error handler in the ListWatch._stop(), something along the lines of:

+           this.request.removeAllListeners('error');
+           this.request.on('error', () => { /* void */ })
            this.request.abort();
            this.request = undefined;

I don't like this, as it is using removeAllListeners, as it does not have access or knowledge of the doneHandler to only remove the specific handler that Watch.watch() creates. It's fine for my internal purposes, where we're sure we don't have any other error handlers, but might be less so for generic library code?

That said, it's definitely the easiest one and might just be the most practical one, depending on when the 1.x / switch to fetch() is coming?

Option 2. Wrap the req.abort() method in DefaultRequest to remove the error handlers. I don't particularly like this, as it still has the same issue with removeAllListeners not knowing about which specific handler to remove, let alone monkeypatching the abort().
Option 3. Wrap the req.abort() method in Watch.watch() to call off('error', doneCallOnce) for both - the request and the stream. Same as above, but avoids removeAllListeners in favor of an off for a specific callback.

Another downside of all of the three of the above is that they are effectively request specific - not sure if it will be relevant for the fetch() based approach - it is also technically a breaking change, since the RequestResult will require a new method to be exported (removeAllListeners) for people who provide their own requestImpl.

Option 4. Wrap the req.abort() and set an aborted variable inside the closure, and then ignore any request errors that come after it is set (we probably still need to handle stream closure as doneCallOnce(null)?). Same as above - I don't like monkeypatching the abort(), but at least there is no breaking change in the RequestResult.
Option 5. Refactor Watch.watch() in such a way that we can do a Watch.abort() or something like that, so that at least we avoid monkeypatching request.abort(). This is probably the least disruptive option, but it feels like it's a bit more work than the others...

Are there any better ideas for implementation? Happy to take a stab at a PR, if there's an agreement on the approach.

For longer term, I would think that maybe the Watch interface needs a re-haul anyways - it takes two callbacks and returns a promise - that alone feels like it should be an event emitter of its own?

The text was updated successfully, but these errors were encountered:

brendandburns · 2022-09-20T15:26:05Z

@dominykas thanks for the detailed exploration and options!

Given that this.request is private and is obtained via a call to Watch.watch(...) which doesn't expose any way to add additional callbacks to the request, I think that your first option is ok to implement.

Thanks!

dominykas mentioned this issue Sep 23, 2022

Ignore post-abort request errors #884

Merged

k8s-ci-robot closed this as completed in #884 Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unnecessary `error` events in `ListWatch` (cache/informer) #881

Unnecessary `error` events in `ListWatch` (cache/informer) #881

dominykas commented Sep 19, 2022 •

edited

Loading

brendandburns commented Sep 20, 2022

Unnecessary error events in ListWatch (cache/informer) #881

Unnecessary error events in ListWatch (cache/informer) #881

Comments

dominykas commented Sep 19, 2022 • edited Loading

brendandburns commented Sep 20, 2022

Unnecessary `error` events in `ListWatch` (cache/informer) #881

Unnecessary `error` events in `ListWatch` (cache/informer) #881

dominykas commented Sep 19, 2022 •

edited

Loading