A new approach for plugin refresh #371

edaniszewski · 2020-03-02T16:04:15Z

The current approach to plugin refresh is fairly naive. Synse notes that a plugin failed a request and marks it inactive. Later (either on a timer or when forced by user), it will attempt to reconnect/re-issue a command to the plugin to re-establish that it is "active".

I believe the premise of using active/inactive is still useful, as it does help to keep response times low for various requests, but the approach for having plugins be refreshed seems like it could use improvement.

My idea is that instead of having a retry all plugins at once on interval, retry them individually with backoff in a background task. Some details on this approach:

Only plugins that are configured should be retried. When defined explicitly in YAML, they will always be retried. When defined via a Discovery method, only plugins which are identified in the latest discovery will be retried (this prevents synse from continuing to try and retry plugin connections for a plugin that is not even there anymore).
Each plugin's retry runs in its own Task. Successful retry will put the plugin back into an active state and will terminate the task. The task will continue to run until either a retry succeeds, or the task is cancelled (e.g. if the plugin no longer shows up in discovery).
Retry will be done using an exponential backoff algorithm with jitter.
Additional data can be captured for the plugin, which can be returned on calls for plugin info. This data could include:
- downtime: the total time a plugin has been inactive for (could count last and total)
- uptime: the total time a plugin has been active for (could count last and total)
- last_disconnect: the last time the plugin disconnected
- disconnects: the number of times a plugin has transitioned from "active" to "inactive"
- reconnects: the number of times a plugin has transitioned from "inactive" to "active"
Not all of the above are data points that I feel are required, but they are data points that we could (and probably therefore should) capture.

This relates to:

Synse v3 - Devices drop out of scan, then reappear. #362

The text was updated successfully, but these errors were encountered:

lazypower · 2020-03-02T16:12:33Z

I think this is a good first pass at resolving the issue. Collecting the additional data would also give us metrics to key off of for range based queries/alerts such as multiple plugins disconnecting in < 5m, or disconnects/reconnects inflating. Things like that, which would prompt us to take a look.

As it stands right now, its very much a silent failure and relies on humans to notice the plugin misbehaving. Step one feels like adding observability to it, and then going from there with a more informed decision by the data.

MatthewHink · 2020-03-02T16:43:26Z

This makes sense to me. Perhaps consider a hard upper limit on exponential backoff since exponential can grow rapidly.

edaniszewski · 2020-03-02T16:48:38Z

For sure.

I'll try to get started on this today. It is a fairly sizable change, so it'll take a few days probably, but I think in the end it'll be worth it both for performance and observability.

edaniszewski added enhancement question labels Mar 2, 2020

edaniszewski mentioned this issue Mar 5, 2020

feat: update to plugin reconnect mechanics #372

Merged

edaniszewski closed this as completed Mar 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A new approach for plugin refresh #371

A new approach for plugin refresh #371

edaniszewski commented Mar 2, 2020

lazypower commented Mar 2, 2020

MatthewHink commented Mar 2, 2020

edaniszewski commented Mar 2, 2020

A new approach for plugin refresh #371

A new approach for plugin refresh #371

Comments

edaniszewski commented Mar 2, 2020

lazypower commented Mar 2, 2020

MatthewHink commented Mar 2, 2020

edaniszewski commented Mar 2, 2020