-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ask About Cluster Design] #28
Comments
Hi Riandy, Yes, that's indeed the case. I have not implemented failover yet, just clustering. Failover is straightforward with one caveat: when the server fails it's usually restarted. So the sessions would be migrated twice. Is failover important for your use case? If so I can look into it. |
Hmm…, I don’t particularly understand this line, Gene. Would you please elaborate the idea of failover on Tinode?
Yes, we’re trying to use Tinode for handling high number of connections. There will be times when we most likely need to turnoff our server & it could take several minutes to complete. Since Tinode uses consistent hashing, while this event occurs, a lot of users which bound to this server won’t be able to access their topics. So failover mechanism is important in our case. |
Suppose there are three servers A, B, C, three user sessions U1, U2, U3, six topics T1, T2.. T6: U1 is connected to A, U2 to B, U3 to C. A handles topics T1 and T2, B - T3 and T4, C - T5, T6.
The dotted line ... means in-cluster topic forwarding. Suppose server B goes down.
(a) U2 connection is lost. U2 must reconnect to another server. Tinode does not provide facility for that. You need something infront of the cluster, like HAProxy or nginex.
(a) U2 connection is lost, the same as 1(a). Technically it's possible to avoid the disruption in 2(b) and 2(c) but it's a bit tricky. |
I understand that you want to use it in production. My question was more around how important is to avoid any disruption. Suppose you have 5 servers and restart one of them. 20% of your users will have to be reconnected to other servers anyway, with or without the failover. Without failover 20% of topics will become unavailable for the duration of the outage. If the server is down for a few minutes it may be an acceptable disruption. |
Please take a look: #30 |
Hello, Gene Sorry for the delay, I need to take on another tasks first. I’ll take a look at #30 today. |
Hello, Gene I encountered two (which likely to be) bugs when exploring failover feature. The first one is endless rehashing bug & the second one is leadership competition between each node bug. All of them resulted in freezed topic interaction in all nodes. Here are the steps which I was using to produce the endless rehashing bug:
You could see how I produce the bug here: To produce the leadership competition bug, we just simply start node 1, 2, 3 quickly after another. You could see how I produce the bug here: Let me know your opinion. Thanks |
Looking into it. Thanks. |
It should work now: 12017b2 |
Hello, Gene I found another 2 bugs. The first one is super tricky. Sometimes it happens, sometimes it doesn’t. But a lot of time it’s happen. Suppose following node-topic bounds happens with each user never had any p2p connection with each other (I used new users to ensure this property):
When Node 1 dies, failover occurs, then suppose the result would be following:
When User C try to initiate p2p connection with User A, it would get The second bug was occurred when I delete p2p subscription which topic is bound to another node then recreate it. So for example User B successfully created p2p connection with User C, then User B delete the subscription, then recreate it. There possible outcome would be 3 following cases:
Let me know your opinion. Thanks |
Looking into it. Thanks. |
I just fixed one bug which probably caused all of the above, specifically line 379: |
Looking into it. Thanks. |
The first bug is fixed with 0b68257 The second one is clear, I can make it go away, but I want to make it right. So it would probably take me a bit of time to refactor some code. |
No problem, Gene. I think it’s better for the project (y). |
I think it's fixed now: de54966 If you don't mind, please file bugs separately instead of adding to this thread. Thanks! |
Hello, Gene
I have question regarding current cluster design. I found that user’s topic always sticked to a certain cluster node. Why?
In the event of node failure this will lead to users (which their topics sticked to that node) won’t be able to access their data nor contacted by others, right?
The text was updated successfully, but these errors were encountered: