-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
str2str doesn't always reconnect to the caster when its ip address changes #166
Comments
I should get a chance to look into this sometime in the next week or two but I don't have much expertise in IP network communications so if anyone else has a chance to dig into this and either provide more information or even better, a pull request for a code fix, that would be very helpful. |
Just an idea: I've never created an issue about this because the problem could be outside of RTKLib, but from time to time, a str2str instance with serial input and local tcp output doesn't work correctly. In these cases, there are no str2str output on the terminal nor in the log file created with the options |
As long as the socket is open you do not need to reopen a new one hence you do not need to resolve the name. |
This is indeed what I observe as behavior, but why is it annoying to redo a DNS request each time the TTL expires? The flow will not be cut since in 99.9999% of cases the IP will be similar. However, it is the breakdown, the special case, the loss of remote administration which can force us to abandon the "MASTER" caster To do this, you must be able to "notify" all clients of the change of IP address on the DNS, without having to cut the flow. This can create a "split brain" (term used in DRBD) and we end up with clients who couldn't see the new IP address and are on the old MASTER. Is that clearer ? |
A good rule of thumb is to never do whatever is not needed. it will cause problem one way or another. Don't forget that if you think of a scenario where you have lost your server AND the network layer (FW, LB, WAF if any, ...) you should consider you also have lost your DNS. |
I am not very familiar with this part of the RTKLIB code (I've focused more on the GNSS algorithm side of things), but as far as I can tell, RTKLIB is not explicitly trying to resolve the DNS address itself and is relying on the calls to the operating system to do this. This might explain why some systems are able to resolve the change and others are not. I am open to implementing a solution if anyone has a specific suggestion but otherwise I don't believe I can resolve this on my own. |
I think the easiest way to handle this is to on connection drop, try to resolve via dns again. Notate the available ip's, and compare it to the previously connected ip. If there is one available that isn't the one you're on, then fail over to that new ip. I think it would also be smart on initial connection to add some sort of latency checker for all the ips that a dns name resolves. And choose the one with lowest latency by default. Otherwise, you've basically only got random and round-robin as alternatives for selecting ip's. The latency check would also be good for determining if a server is down or not. But not all host support this, so it would need an option to be disabled as well. |
I track down a little bit more the issue. When I add latency/packet error between str2str and the caster, then it start to show something similar to what we had in december. I think the main cause is here: Lines 1088 to 1089 in d0b5993
now I do not know exactly what happens in december, so is some select were ok ? I do not know. Is on a common day some select fail ? I do not know. So what would be a trigger to decide to kill the connection there? X failures one after the other ? a failure rate on a short duration ? |
Hi !
Last night, the centipede caster which received the signal from about 500 base stations was disconnected (security problem in the datacenter). A backup server was started, with another caster instance and the dns entry was updated.
Most base stations run RTKBase, which use
str2str
. (Big! big! thank you to @tomojitakasu @rtklibexplorer and al for this tool)Only half of the bases stations have reconnected to the new caster on their own, the others, including mine, are still trying to send the rtcm stream to the ip address that isn't working. str2str outputs messages like these:
If I ping
caster.centipede.fr
from the base station, the correct ip address is returned. So the dns propagation is ok.Even more strange : There is a base station with 2 gnss receivers (F9P + Mosaic X5), and 1 str2str instances for each receiver which send the rtcm streams on 2 mount points on the same caster. One str2str instance has switched to the new caster, but not the other one.
It's not the first time I noticed this problem with str2str.
As a workaround, I could write a tool to parse the str2str output and restart it in case of too many 'recv error' messages, but I think it would be better to update str2str to better manage this problem.
The text was updated successfully, but these errors were encountered: