Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using watches and handling timeouts #92

Closed
Eric-Fontana-Bose opened this issue Jan 9, 2019 · 15 comments
Closed

Using watches and handling timeouts #92

Eric-Fontana-Bose opened this issue Jan 9, 2019 · 15 comments
Labels
question Further information is requested

Comments

@Eric-Fontana-Bose
Copy link

I've been using https://github.com/abonas/kubeclient library, but I like the flavor of this library better.
I'm a heavy user of the watch API, and when testing out this code:

 begin
    MyLog.log.info "Starting watcher..."
    client.api('v1').resource('namespaces').watch() do |watch_event|
      o = watch_event[:object]
      puts "type=#{watch_event.type} namespace=#{watch_event.resource.metadata.name}"
    end
    MyLog.log.info "Exited normally."
  rescue Exception => e
    MyLog.log.error "Watcher error: #{e}"
  end

  MyLog.log.info "Finished watcher..."

After roughly 5 minutes the exception handler reports:

ERROR -- : Watcher error: end of file reached (EOFError)

I sort of expected this, if you issue kubectl namespaces -w it will timeout in about the same
amount of time.

We are stuck on an older version of Kubernetes (1.89) and the Kube API server was getting hammered because the abonas client was not handling the closing of the terminated connection and it was causing the API server to backup and affect the cluster.

What is the proper way to handle the timeout/EOFError ?

@jakolehm
Copy link
Contributor

jakolehm commented Jan 9, 2019

I have not seen EOFError with recent k8s-client/kubernetes versions.. What if you set timeout, for example: watch(timeout: 600).

@jakolehm jakolehm added the question Further information is requested label Jan 9, 2019
@Eric-Fontana-Bose
Copy link
Author

Changing the timeout does not help, it is the Kube API server which terminates the connection.

@jakolehm
Copy link
Contributor

@kke any ideas?

@jnummelin
Copy link
Contributor

Why not just catch EOFError and restart the watch?

K8s-client could of course handle that automatically internally...

@kke
Copy link
Contributor

kke commented Aug 8, 2019

If this is still an issue, feel free to reopen / report again.

@kke kke closed this as completed Aug 8, 2019
@vitobotta
Copy link

Hi, I am playing with the watch feature. I got it working for my use case but the code quits automatically after a while. What is the best way of keeping it alive? Thanks

@cben
Copy link

cben commented Oct 18, 2019

[brain dump, not sure all relevant for k8s-client, HTH]

There are several scenarios for restarting watches.

  • you can restart from last seen resourceVersion. This probably results in exactly-once delivery, most of the time.

    • a small complication for watching collections: the API takes resourceVersion for the collection, but individual watch events only carry individual object's resourceVersion. However I asked some questions and was told the last seen version of last touched object inside collection is good for use as version of whole collection.
  • When that collection doesn't change much, OR when you resume significant time later, the server might refuse to resume from that resourceVersion (iirc the window depends on etcd version, 5min / 1000 events but might include events from other collections).
    I think that gives you 410 Gone http status (?)
    Now you have 2 choices:

    • Watch from current moment, without specifying resourceVersion. There will be an unknown gap you've missed.

    • Get/List and watch from fresh resourceVersion. Again there will be a gap, but you'll get a fresh state to track from...

I suspect a client lib could automatically retry from last seen version, but when that's too old it better surface the error to caller ?

Compare python client discussion kubernetes-client/python#972, kubernetes-client/python-base#133.
Specifically comment about official Go client
kubernetes-client/python-base#133 (comment)

@vitobotta
Copy link

Hi @cben ! How do I restart from the last seen resourceVersion? At the moment I am first saving the list of the existing resources (Velero backups) in an array, and then when the watch starts I ignore those backups that already existed. But it sounds like it would be much better if I could restart the watcher from where it left. How can I do this with this gem? Thanks!

@vitobotta
Copy link

OK I see that I can set the resourceVersion as parameter for the watch method. But what value do I specify initially, so that I don't just get the whole list of the existing resources?

@jnummelin
Copy link
Contributor

jnummelin commented Oct 18, 2019

@vitobotta Not really sure what you are trying to achieve, but usually these are handled with something like:

last_seen_resource = 0
begin
  client.api('v1').resource('pods', namespace: 'default').watch(resourceVersion: last_seen_resource) do |watch_event|
    puts "type=#{watch_event.type} pod=#{watch_event.resource.metadata.name}"
    last_seen_resource = watch_event.metadata.resourceVersion
  end
rescue  EOFError # or something bit more specific maybe :)
  retry  # makes the watch start again from last seen resource
end

So yes, initially when your app starts, you need to get all the resources through watch. If you are "syncing" the status with something external, your app needs to decide what to do in case the resource has been already seen in the past and possibly exists in the external thingy.

@vitobotta
Copy link

I ended up doing a list first, getting the max resource version and using that. Seems to work. Thanks! :)

@cben
Copy link

cben commented Oct 22, 2019

Small correction: If doing a list, you should use the whole list's resourceVersion.

kubernetes devs are pretty adamant that resourceVersion "MUST be treated as opaque" string. While so far it's been a number, you shouldn't assume that, shouldn't interpret it, and thus can't compute "max" of several versions. But that's why any FooList response, in addition to each item having a resourceVersion, also has a top-level resourceVersion — use that for initial watch.
(https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md)


But your instinct was a good one, pointing to a subtle distinction between list and watch:

  • list items are ordered is arbitrary order, so you can't just take version from last item. you have to take "logical max", which as said above you're not supposed to compute yourself but is provided to you at the top level of the list.
    (in many cases several collections share a single version, so you'll get top-level version greater than any item in the list. that's ok, use the top-level version.)
  • watch is normally in increasing order, sent as they happen, and you can take last seen value from an item
  • annoyingly, when you start a watch without specifying initial resourceVersion, it first lists all existing items, returning them as fake "ADD" events, and apparently that initial burst can come in arbitrary order :-(
    See discussion Old events from the past yielded due to remembered resource_version kubernetes-client/python#819.

@vitobotta
Copy link

Hi @cben ! How do I get the resourceVersion for the list? I tried with resource_version = k8s_client.api("velero.io/v1").resource(resource_type.to_s, namespace: velero_namespace).list.resourceVersion but it gives me undefined method for the array. Thanks

@vitobotta
Copy link

Found it! It's meta_list.metadata.resourceVersion isn't it? I made that change and it seems to work, it no longer returns the existing events when I start the watch and only returns new events. Thanks! :)

@cben
Copy link

cben commented Oct 23, 2019

Yes, that's the one I meant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants