Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WatchResponse.Canceled is not triggered to true #8231

Closed
mwf opened this issue Jul 7, 2017 · 18 comments · Fixed by #8283
Closed

WatchResponse.Canceled is not triggered to true #8231

mwf opened this issue Jul 7, 2017 · 18 comments · Fixed by #8283
Milestone

Comments

@mwf
Copy link

mwf commented Jul 7, 2017

Hi team!

You have some inconsistency between clientv3 Watch documentation and real behavior!

Docs says:

    // Canceled is used to indicate watch failure.
    // If the watch failed and the stream was about to close, before the channel is closed,
    // the channel sends a final response that has Canceled set to true with a non-nil Err().
    Canceled bool

But you receive response with Canceled = false in case of revision compaction! And then the channel is unexpectedly closed!

Try this (consider 1 as some compacted revision):

prefix := "foo"
watchCh := client.Watch(context.Background(), prefix, clientv3.WithPrefix(), clientv3.WithRev(1))
for resp := range watchCh {
	if resp.Canceled {
		log.Printf("watch channel is closing")
		panic(resp.Err())
	}
	if err := resp.Err(); err != nil {
		// there is some non-critical error without channel close, just log it
		log.Printf("error watching '%s': %s", prefix, err)
		log.Printf("canceled: %t", resp.Canceled)
		continue
	}
	<do something with event>
}
log.Println("exit")

The result will be:

2017/07/07 19:19:10 error watching 'foo': etcdserver: mvcc: required revision has been compacted
2017/07/07 19:19:10 canceled: false
2017/07/07 19:19:10 exit

According to documentation I expect Canceled to be set to true every time the response channel is going to be closed.

The problem exists both for etcd 3.1.x and 3.2.x clients.

What do you think about it?

@xiang90
Copy link
Contributor

xiang90 commented Jul 8, 2017

@heyitsanthony

I think the comment is wrong? User should rely on error instead? Or we switch to always set canceled to true when there is an error?

@mwf
Copy link
Author

mwf commented Jul 8, 2017

And also the question is, can there be some errors, which don't lead to channel close? If can - the Canceled flag is usefull.
If any error leads to channel close - then what is the purpose for Canceled flag? User can just check Err() and that's all.
Maybe the original thought was that there could be some "recoverable errors" without channel close, but how should the user react to them? Just log and continue the event stream processing?

@heyitsanthony
Copy link
Contributor

There aren't any errors that won't also close the watch, but it might be nice to have them in the future. The server doesn't set Canceled when closing the channel on compact, it only sets the compact revision, and so the client watch response won't set cancel either.

It seems like the server should set both Canceled and the compact revision on watch compaction. This won't break any existing clients afaik, so that's probably the fix here.

@mwf
Copy link
Author

mwf commented Jul 10, 2017

BTW, Anthony, I've got another concern. You say it's server who closes the channel on compact. When does it do this? We just hit this issue several days ago, we had some long-lasting watches for service discovery and etcd compaction once per 24 hours. And suddenly services received watch channel close - though I played around and could not reproduce this case.

If you watch starting from some revision, then make compact to the server - the watch channel is still alive and receives the events! I couldn't make it closed by compact.

And what's the purpose to close the channel on compact? I don't care if the revision I start watching from is compacted or not - I just need the kv updates to start from particular revision.

So could you please describe the mechanics behind watch + compact in details?

@mwf
Copy link
Author

mwf commented Jul 11, 2017

@heyitsanthony ping :)

@xiang90
Copy link
Contributor

xiang90 commented Jul 11, 2017

service discovery and etcd compaction once per 24 hours. And

if closes the connection if the watcher wants to watch (or watching on) compacted revisions.

@mwf
Copy link
Author

mwf commented Jul 11, 2017

if closes the connection if the watcher wants to watch (or watching on) compacted revisions.

Yes, it's OK to close the connection with error if you try to start the watch on compacted revision. And it does so.
But as I told above, it closes the connection after you already started the watch.

  1. I could not reproduce it - I started watch on rev. 3000 for example, then made some inserts, compacted store to 3010, made another insert - and client received the event.
  2. But the fact is that we received etcdserver: mvcc: required revision has been compacted error while the watch has already been started for several days. We saw errors in logs. The watch was not reinitialized - in that case we would see some additional lines about restarting watch on some reason.

That's why I'm asking, if such situation with compaction error during some active watch could possibly occur. Is it intended behaviour or kind of a bug? It's double strange, that we use this code for several months, and we hit such behaviour only now.

@mwf
Copy link
Author

mwf commented Jul 11, 2017

Because it's strange for me to close a watch just because the revision you start watching from is compacted. What's the point?

@xiang90
Copy link
Contributor

xiang90 commented Jul 11, 2017

Because it's strange for me to close a watch just because the revision you start watching from is compacted. What's the point?

can you reproduce this? if yes, then it is a bug.

That's why I'm asking, if such situation with compaction error during some active watch could possibly occur.

it can happen if your watcher is slow and lagging. and you compact at the current head. if you believe this is not the case, either reproduce it or look into the code to fix it?

@mwf
Copy link
Author

mwf commented Jul 11, 2017

Yeap, I agree, it should be reproducible to be considered a bug. If I find a way to reproduce - I'll tell you. It was important for me only to ensure that you don't consider it as expected behaviour either :)

and you compact at the current head

But there is no other out-of-box way to compact, only at current head with auto-compaction-retention. BTW, do you have any plans to make it more flexible? For example some other flag auto-compaction-grace=24, to clean the revisions with timestamp >=24 hours from now? To keep last 24 hours history. Along with retention of 24, the period of 24-48 hours will be kept.
If you like the idea, I could create a separate issue for this and maybe implement it.

@xiang90
Copy link
Contributor

xiang90 commented Jul 11, 2017

to clean the revisions with timestamp >=24 hours from now?

this is what etcd does not. it compacts 24 hours from now. keep 24 hours history basically.

@mwf
Copy link
Author

mwf commented Jul 11, 2017

Oh, great then. I thought it's just a time period for starting head compaction. Current documentation doesn't explain it well enough :)

@xiang90
Copy link
Contributor

xiang90 commented Jul 11, 2017

@mwf maybe help us improve the documentation? thanks :P

@mwf
Copy link
Author

mwf commented Jul 11, 2017

Easily :)
I'll check tomorrow that it really keeps the historical revisions, and if it does - will improve the doc.

@heyitsanthony heyitsanthony added this to the v3.3.0 milestone Jul 18, 2017
@heyitsanthony heyitsanthony self-assigned this Jul 19, 2017
heyitsanthony added a commit to heyitsanthony/etcd that referenced this issue Jul 19, 2017
heyitsanthony added a commit to heyitsanthony/etcd that referenced this issue Jul 25, 2017
visheshnp pushed a commit to visheshnp/etcd that referenced this issue Aug 3, 2017
@newhook
Copy link

newhook commented Aug 22, 2017

I received this error today I believe -- I got an error:

Aug 21 19:14:34 staging-shard disco_daemonize[13226]: daemonize 2017/08/21 19:14:34.847623 etcd error: etcdserver: mvcc: required revision has been compacted

on a watch that had been active for several hours.

Here is the prior compaction in the etcd logs. I would have started the watch on revision 9.

2017-08-22 18:30:06.830156 N | compactor: Starting auto-compaction at revision 51
2017-08-22 18:30:06.831045 N | compactor: Finished auto-compaction at revision 51

Is this then not expected? The server is running v3.1.9, the client API is the latest code.

@heyitsanthony
Copy link
Contributor

@newhook is there an easy way to reproduce this? I think what happened was the watch was waiting for an event with rev >= 9, the kv store was compacted to rev=51, the watch disconnected, then reconnected into a compacted error. Using WithProgressNotify() might help

@newhook
Copy link

newhook commented Aug 22, 2017

That makes sense, is that expected to occur? This is occurring in a GCE private network so I doubt there are underlying network issues.

I can reproduce pretty easily I think.

If I don't use WithProgressNotify what is the best way to fix this issue?

What this code is doing is:

  • fetch configuration data.
  • watch for changes on the data.
  • if the data changes reconfigure.

So its implemented like so:

  1. Fetch key xyz. Returns data, revision.
  2. Load configuration data using data.
  3. Watch key xyz starting at given revision+1.
  4. If the data changes reload configuration.

The revision here ensures that I don't miss any changes after (1) and before (3).

A couple of ideas to fix the problem:

  • Refetch the initial set of keys and then re-create the watcher with the new revision.
  • Re-create the watcher using the CompactRevision in the previous WatchResponse.

@heyitsanthony
Copy link
Contributor

Refetch the initial set of keys and then re-create the watcher with the new revision.

This is how it's usually done. It also covers the case where the client is partitioned and there is a compaction, which isn't helped by WithProgressNotify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

4 participants