Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeouts can not reconnect if event in requested index has been cleared #31

Open
shannon opened this issue Aug 27, 2014 · 9 comments
Open

Comments

@shannon
Copy link

shannon commented Aug 27, 2014

If there are no changes in the directory you are watching and etcd index has changed more than 1000 times the reconnection will continuously fail. This can happen quite often when using Fleet as it writes heartbeats a lot.

{ [Error: Connection error, reconnecting.]
  error: 
   { [Error: The event in requested index is outdated and cleared]
     errorCode: 401,
     error: 
      { errorCode: 401,
        message: 'The event in requested index is outdated and cleared',
        cause: 'the requested history has been cleared [1300446/1297481]',
        index: 1301445 } },
  reconnectCount: 1 }

{ [Error: Connection error, reconnecting.]
  error: 
   { [Error: The event in requested index is outdated and cleared]
     errorCode: 401,
     error: 
      { errorCode: 401,
        message: 'The event in requested index is outdated and cleared',
        cause: 'the requested history has been cleared [1300466/1297481]',
        index: 1301465 } },
  reconnectCount: 2 }


...

{ [Error: Connection error, reconnecting.]
  error: 
   { [Error: The event in requested index is outdated and cleared]
     errorCode: 401,
     error: 
      { errorCode: 401,
        message: 'The event in requested index is outdated and cleared',
        cause: 'the requested history has been cleared [1316622/1297481]',
        index: 1317621 } },
  reconnectCount: 12 }

I think the reconnection should take the index returned in the 401 error.

@stianeikeland
Copy link
Owner

Oh, that's not good.

I wonder what the best way to solve this is. We could go directly for the index in the error, but then risk missing some updates.

It seems like fleet just does reconnects with index + 1 until it resyncs. That could cause thousands of requests though..
https://github.com/bcwaldon/fleet/blob/8a269e880ee561ad1722e9de94d6ccb4e5656562/registry/event.go#L87-L91

I guess we could do (error.index - 999), that might work.

@shannon
Copy link
Author

shannon commented Sep 3, 2014

Sorry, I was away for the weekend.

Yea I wondered how they were dealing with it. I think your (-999) approach would work but only if something doesn't happen before you get the next request out.

I almost feel as though there are two separate cases here. One where you don't care about the stuff in between and you just need to get the watcher back in place and two where you are trying to watch every event as it happens. The second case is sort of a lost cause if the watcher is in this state already. Maybe its best to just make sure the connection is reestablished and then emit a warning event of some kind?

@tjanczuk
Copy link

tjanczuk commented Oct 2, 2014

We could go directly for the index in the error, but then risk missing some updates.

@stianeikeland Why would this approach miss any updates? If any changes had been made up to the index specified in the bug, etcd would respond with these changes. It should therefore be safe to restart polling from that index.

Below is a workaround of the issue I applied in my application. Note that if I detect etcd provided a new index in the error response, I am issuing a new poll immediately by substituting _watch method for _retry which uses a backoff timer. Let me know if you think this approach has a conceptual issue.

watch.on('reconnect', function (error) {
    // Work around a bug in node-etcd where the watcher etcd index 
    // becomes so old it is out of scope of the history etcd maintains.
    // Advance the watch index based on information supplied from etcd in the 
    // error response.
    var newIndex;
    if (typeof error.error === 'object' 
        && typeof error.error.error === 'object'
        && !isNaN(error.error.error.index))
        newIndex = error.error.error.index;
    else if (typeof error.headers === 'object'
        && !isNaN(error.headers['x-etcd-index']))
        newIndex = +error.headers['x-etcd-index'];

    if (newIndex !== undefined) {
        // Instead of _retry with a backoff timer, make the
        // watcher issue the request immediately with _watch.
        watch.index = newIndex;
        var oldRetry = watch._retry;
        watch._retry = function () {
            watch._retry = oldRetry;
            watch._watch();
        }                        
    }
});

@stianeikeland
Copy link
Owner

I've now added basic resync to trunk.

@tjanczuk I'm thinking about something like the following timeline:

  • client A connects and starts watching 'key' at index 50, key is currently 'value 1'
  • lots of updates happens on the server, index reaches 2000
  • client A is disconnected because of network issues
  • client B updates 'key' to 'value 2'
  • client A network issues over, reconnects, tries watching from 50 again (it has never gotten an index update)
  • server says index 50 is outdated and cleared, current index: 2100
  • client A retries, this time watching 2100.

client A is now resynced, but missed the key='value 2' update.

Now I just emit a 'resync' event from the watcher, so that you can be sure to check manually for missed values if needed

Maybe we could do the following on outdated and cleared index:

  • Do a regular get .. check the current key/values index.
  • if our tracking index is lower than current key/value index - emit the new value.
  • watch from error index

Or am I overthinking things? :)

@stianeikeland
Copy link
Owner

Published, could you try out version 2.1.5?

@tjanczuk
Copy link

tjanczuk commented Oct 2, 2014

I think you are right; what I proposed above would not address the cases of a race condition between broken network connections and updates on the server.

I would suggest slight optimization over what you describe above:

  1. If the response from the server contains a new etcd index in any shape or form, the next long poll should start listening from that index, without delay. This will be the case when there were no updates to the value in etcd before the server decided to gracefully terminate the long poll.
  2. If the long poll results in an error in which the server does not send a new etcd index (e.g. TCP connection broken), node-etcd should do a regular get on the value listened to, compare that value to the last value seen and issue a changed event if it is different, and then resume long polls from the etcd index in the get response.

As an aside, a useful feature in node-etcd would be a modified event that only fires when the value has actually changed.

@tjanczuk
Copy link

tjanczuk commented Oct 2, 2014

Actually I think I am overthinking it now. If there was an error due to network issues, and the index has become out of scope in the meantime, the next successful request will error with a 401 code, and your case here: https://github.com/stianeikeland/node-etcd/blob/master/src/watcher.coffee#L71-L72 should cover it.

So looking at your code it seems everything should work correctly. The one slight optimization that can save a roundtrip to etcd is to also advance the index in case of the empty long poll response (https://github.com/stianeikeland/node-etcd/blob/master/src/watcher.coffee#L75-L76) - this is to save an occasional 401 for infrequently changing values.

@tjanczuk
Copy link

tjanczuk commented Oct 2, 2014

The 2.1.5 appears to perform well for me in a long-running/infrequent-changing scenario. Thanks for the quick turnaround!

@stianeikeland
Copy link
Owner

Thanks! Also agree with your optimizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants