Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new approach for plugin refresh #371

Closed
edaniszewski opened this issue Mar 2, 2020 · 3 comments
Closed

A new approach for plugin refresh #371

edaniszewski opened this issue Mar 2, 2020 · 3 comments

Comments

@edaniszewski
Copy link
Contributor

The current approach to plugin refresh is fairly naive. Synse notes that a plugin failed a request and marks it inactive. Later (either on a timer or when forced by user), it will attempt to reconnect/re-issue a command to the plugin to re-establish that it is "active".

I believe the premise of using active/inactive is still useful, as it does help to keep response times low for various requests, but the approach for having plugins be refreshed seems like it could use improvement.

My idea is that instead of having a retry all plugins at once on interval, retry them individually with backoff in a background task. Some details on this approach:

  • Only plugins that are configured should be retried. When defined explicitly in YAML, they will always be retried. When defined via a Discovery method, only plugins which are identified in the latest discovery will be retried (this prevents synse from continuing to try and retry plugin connections for a plugin that is not even there anymore).

  • Each plugin's retry runs in its own Task. Successful retry will put the plugin back into an active state and will terminate the task. The task will continue to run until either a retry succeeds, or the task is cancelled (e.g. if the plugin no longer shows up in discovery).

  • Retry will be done using an exponential backoff algorithm with jitter.

  • Additional data can be captured for the plugin, which can be returned on calls for plugin info. This data could include:

    • downtime: the total time a plugin has been inactive for (could count last and total)
    • uptime: the total time a plugin has been active for (could count last and total)
    • last_disconnect: the last time the plugin disconnected
    • disconnects: the number of times a plugin has transitioned from "active" to "inactive"
    • reconnects: the number of times a plugin has transitioned from "inactive" to "active"

    Not all of the above are data points that I feel are required, but they are data points that we could (and probably therefore should) capture.

This relates to:

@lazypower
Copy link
Contributor

I think this is a good first pass at resolving the issue. Collecting the additional data would also give us metrics to key off of for range based queries/alerts such as multiple plugins disconnecting in < 5m, or disconnects/reconnects inflating. Things like that, which would prompt us to take a look.

As it stands right now, its very much a silent failure and relies on humans to notice the plugin misbehaving. Step one feels like adding observability to it, and then going from there with a more informed decision by the data.

@MatthewHink
Copy link
Contributor

This makes sense to me. Perhaps consider a hard upper limit on exponential backoff since exponential can grow rapidly.

@edaniszewski
Copy link
Contributor Author

For sure.

I'll try to get started on this today. It is a fairly sizable change, so it'll take a few days probably, but I think in the end it'll be worth it both for performance and observability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants