-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[inputs/modbus] Implement retry when slave is busy #7271
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good. Will need to decide what to do if the RetriesWaitTime * Retries is too long. See comments.
…oroka. Co-Authored-By: Steven Soroka <ssoroka78@gmail.com>
We have similar logic in the SNMP plugin, and it has been somewhat problematic. The big issues we have there are:
Some of these issues could be addressed by having a total Gather timeout. We should consider if it would be enough to ignore |
@danielnelson while I do understand your concerns things here are a bit more complicated. First of all the "stop behavior" only applies for one device and one plugin-process, so influences on other devices are not to be expected. Furthermore, we use the information gathered to do some control-decisions and in this case a missing datum is one of the worst things that can happen. So while a jitter of 1 or 2 seconds is perfectly fine, having a whole missing interval is bad and we cannot reduce the interval too much as otherwise the concurring process will not get access to the device any longer. So while I agree that the implementation is sub-optimal I think there are two possible solutions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay we can proceed with this solution. It seems to me that it comes down to how frequently you expect these errors to occur. If you get busy errors every interval, you need retries in plugin like this. If it only happens once an hour it probably wouldn't be the end of the world to miss an interval.
Furthermore, we use the information gathered to do some control-decisions and in this case a missing datum is one of the worst things that can happen.
I do suggest then that you don't take control decisions if all data is not present, just to be extra safe.
## NOTE: Please make sure that the overall retry time (#retries * wait time) | ||
## is always smaller than the query interval as otherwise you will get | ||
## an "did not complete within its interval" warning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this note, I think this isn't the right advice. It may actually be desired to configure the plugin to exceed the interval. This way you are more likely to get a response, and not get into a situation where the device never fully reply in time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this note was requested earlier by @ssoroka:
Agh. we might not be exposing that setting to the plugin. May not be possible.. Ok maybe just a note in the settings not to pick a value that would end up retrying longer than your interval.
Could you guy please resolve this conflict and tell me if I should leave that comment in or not... ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can resolve this post merge.
That's exactly what we do and that's why we need the retries. If another process (that might not be under our control) is reading by chance at the same time, we will never get any data without retrying. |
Required for all PRs:
Currently, when the slave device is busy, e.g. because it already serves a request of another client, we get an error and the measurement is lost for the gathering interval. This potentially leads to many lost measurements for very busy devices.
With this PR you can specify the number of retry attempts (and a wait time) to be performed if the device is busy. This achives a higher probability to get a measurement on busy devices for each query interval.