[inputs/modbus] Implement retry when slave is busy #7271

srebhan · 2020-04-02T13:01:25Z

Required for all PRs:

Signed CLA.
Associated README.md updated.
Has appropriate unit tests.

Currently, when the slave device is busy, e.g. because it already serves a request of another client, we get an error and the measurement is lost for the gathering interval. This potentially leads to many lost measurements for very busy devices.

With this PR you can specify the number of retry attempts (and a wait time) to be performed if the device is busy. This achives a higher probability to get a measurement on busy devices for each query interval.

ssoroka

Looks pretty good. Will need to decide what to do if the RetriesWaitTime * Retries is too long. See comments.

plugins/inputs/modbus/modbus.go

…oroka. Co-Authored-By: Steven Soroka <ssoroka78@gmail.com>

plugins/inputs/modbus/modbus.go

…rval.

plugins/inputs/modbus/modbus.go

danielnelson · 2020-04-17T18:49:06Z

We have similar logic in the SNMP plugin, and it has been somewhat problematic. The big issues we have there are:

Total request time can be very hard to predict, since its not clear how many requests will be made, and so it is hard to set the correct retry settings.
May cause intervals to be skipped which can be worse than skipping especially if polling multiple devices (less of a concern due to this plugin connecting to a single controller).
Can make it so that Telegraf becomes unresponsive to shutdown signals. As Telegraf will wait for the plugin to complete Gather.
Results often get spread out over the interval more than expected, not necessarily an issue as the timestamp is set correctly, but usually not desired and can make aggregation difficult.
Causes increased load on the device.

Some of these issues could be addressed by having a total Gather timeout.

We should consider if it would be enough to ignore ExceptionCodeServerDeviceBusy errors, continuing to request the remaining data instead of the current stop behavior, and leaving "retry" to be handled by Telegraf at the next interval.

srebhan · 2020-04-18T17:07:27Z

@danielnelson while I do understand your concerns things here are a bit more complicated. First of all the "stop behavior" only applies for one device and one plugin-process, so influences on other devices are not to be expected. Furthermore, we use the information gathered to do some control-decisions and in this case a missing datum is one of the worst things that can happen. So while a jitter of 1 or 2 seconds is perfectly fine, having a whole missing interval is bad and we cannot reduce the interval too much as otherwise the concurring process will not get access to the device any longer.

So while I agree that the implementation is sub-optimal I think there are two possible solutions:

The plugin gets to know the interval and can take care to finish before the interval passed. While this is also not optimal as you need to trust the plugin to do the right thing, you at least allow the plugin to do the right thing. ;-)
The perfect solution is that you allow Gather() to return the wish to retry and handle the things in the agent. This way you can fix the shutdown as well as the missed interval problems...

danielnelson

Okay we can proceed with this solution. It seems to me that it comes down to how frequently you expect these errors to occur. If you get busy errors every interval, you need retries in plugin like this. If it only happens once an hour it probably wouldn't be the end of the world to miss an interval.

Furthermore, we use the information gathered to do some control-decisions and in this case a missing datum is one of the worst things that can happen.

I do suggest then that you don't take control decisions if all data is not present, just to be extra safe.

danielnelson · 2020-04-20T22:25:59Z

plugins/inputs/modbus/README.md

+ ## NOTE: Please make sure that the overall retry time (#retries * wait time)
+ ##       is always smaller than the query interval as otherwise you will get
+ ##       an "did not complete within its interval" warning.


Let's remove this note, I think this isn't the right advice. It may actually be desired to configure the plugin to exceed the interval. This way you are more likely to get a response, and not get into a situation where the device never fully reply in time.

Hmm, this note was requested earlier by @ssoroka:

Agh. we might not be exposing that setting to the plugin. May not be possible.. Ok maybe just a note in the settings not to pick a value that would end up retrying longer than your interval.

Could you guy please resolve this conflict and tell me if I should leave that comment in or not... ;-)

We can resolve this post merge.

srebhan · 2020-04-21T08:31:38Z

I do suggest then that you don't take control decisions if all data is not present, just to be extra safe.

That's exactly what we do and that's why we need the retries. If another process (that might not be under our control) is reading by chance at the same time, we will never get any data without retrying.

Sven Rebhan added 4 commits April 2, 2020 13:30

Add support for retrying modbus reads if the queried device is busy.

59ff63d

Add tests for modbus retry.

642b943

Add README for modbus retry options.

8b5b6f1

Remove debug message in-order to be compatible to golang 1.12.

cd43cfd

ssoroka suggested changes Apr 9, 2020

View reviewed changes

plugins/inputs/modbus/modbus.go Outdated Show resolved Hide resolved

plugins/inputs/modbus/modbus.go Outdated Show resolved Hide resolved

Apply simplification of the retry logic according to suggestion of ss…

f456b11

…oroka. Co-Authored-By: Steven Soroka <ssoroka78@gmail.com>

ssoroka approved these changes Apr 13, 2020

View reviewed changes

ssoroka reviewed Apr 13, 2020

View reviewed changes

plugins/inputs/modbus/modbus.go Outdated Show resolved Hide resolved

Sven Rebhan added 2 commits April 14, 2020 19:44

Move sleep for modbus retry to the correct location.

fa3b86b

Add note that the overall retry time should not exceed the query inte…

5572249

…rval.

reimda suggested changes Apr 16, 2020

View reviewed changes

plugins/inputs/modbus/modbus.go Outdated Show resolved Hide resolved

Remove check for negative retry-time as requested by ssoroka.

501c45e

srebhan requested a review from reimda April 17, 2020 14:46

danielnelson reviewed Apr 20, 2020

View reviewed changes

reimda approved these changes Apr 21, 2020

View reviewed changes

danielnelson added this to the 1.15.0 milestone Apr 21, 2020

danielnelson added area/modbus feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin labels Apr 21, 2020

danielnelson merged commit 1006c65 into influxdata:master Apr 21, 2020

srebhan deleted the modbus_retry branch April 21, 2020 18:47

idohalevi pushed a commit to idohalevi/telegraf that referenced this pull request Sep 29, 2020

Add retry when slave is busy to modbus input (influxdata#7271)

2ef0338

arstercz pushed a commit to arstercz/telegraf that referenced this pull request Mar 5, 2023

Add retry when slave is busy to modbus input (influxdata#7271)

6a08805

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inputs/modbus] Implement retry when slave is busy #7271

[inputs/modbus] Implement retry when slave is busy #7271

srebhan commented Apr 2, 2020

ssoroka left a comment

danielnelson commented Apr 17, 2020

srebhan commented Apr 18, 2020

danielnelson left a comment

danielnelson Apr 20, 2020

srebhan Apr 21, 2020

danielnelson Apr 21, 2020

srebhan commented Apr 21, 2020 •

edited

Loading

[inputs/modbus] Implement retry when slave is busy #7271

[inputs/modbus] Implement retry when slave is busy #7271

Conversation

srebhan commented Apr 2, 2020

Required for all PRs:

ssoroka left a comment

Choose a reason for hiding this comment

danielnelson commented Apr 17, 2020

srebhan commented Apr 18, 2020

danielnelson left a comment

Choose a reason for hiding this comment

danielnelson Apr 20, 2020

Choose a reason for hiding this comment

srebhan Apr 21, 2020

Choose a reason for hiding this comment

danielnelson Apr 21, 2020

Choose a reason for hiding this comment

srebhan commented Apr 21, 2020 • edited Loading

srebhan commented Apr 21, 2020 •

edited

Loading