Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packet losses in reliable RPCs #79487

Closed
TheYellowArchitect opened this issue Jul 14, 2023 · 11 comments · Fixed by #80293
Closed

Packet losses in reliable RPCs #79487

TheYellowArchitect opened this issue Jul 14, 2023 · 11 comments · Fixed by #80293

Comments

@TheYellowArchitect
Copy link

TheYellowArchitect commented Jul 14, 2023

Godot version

4.1 stable

System information

Godot v4.1.stable - Artix Linux #1 SMP PREEMPT_DYNAMIC Thu, 03 Nov 2022 21:10:08 +0000 - Vulkan (Forward+) - dedicated NVIDIA GeForce GTX 1050 Ti (nvidia; 520.56.06) - AMD Ryzen 5 2600 Six-Core Processor (12 Threads)

Issue description

I apologize in advance for the wall of text, you will lose 10 minutes of your life to even understand what's going on.
Context: I was making an online fully turn-based tactical card game (no timers)

Over the past months I have noticed some packet losses here and there, on reliable RPCs. This is a critical bug because it causes a desync I cannot catch (or must make a bloated wrapper to detect desyncs and resync, hence wasting months of work). Yet, these packet drops were not rare as they happened on less than 4 minutes of gameplay to get 1 packet loss (1 move desynced ggwp), as displayed in the screen recordings below.

So, the server sends an RPC to a client, but the message never arrives to client and ofc a desync between server and client(s) is achieved. I would love to debug it but the 2 debugging methods (console log printing + watchdog variables) are both bugged and I am also a pleb at wireshark.

Now, I tried to strip down my project heavily to get it to MRP form, I did my best, and focused exclusively on the scenario of the very start of the game, where enemy deck shows his second deck to the other player, where the cards are placed on the appropriate deck slot. (i apologize for the boring context, but its required for the screen recording of the bug)

The annoying thing and why I am writing this wall of text is that this bug's MRP is not deterministic. It just happens once in around 40 use-cases with this MRP, so I doubt anyone will figure out the root of the problem, but with the following screen recordings, its at least proven that packets do get dropped on reliable RPCs.
See start and skip to 2:40
Skip to 3:21
(The print messages shown^ are out of order because of another bug with RPCs)

Ideally, I would have provided an MRP or use-case scenario where I automate RPCs via timer and basically count how many were sent from the client, and arrived to server, but I thought of this too late, and did it all by hand.

Note that I am using localhost, without any packet jumbler of sorts so it makes no sense packets drop. Ofc, I have also tested on a real network via UPNP/Port Forwarding, and packets drop there too.

Steps to reproduce

See both videos above, and also endure that you have to run the below scenario around 40 times to get a visible packet loss. Spam F5/F8.

  1. Debug -> Run 3 Instances
  2. Host with one of the instances and ignore this window
  3. Join with player1
  4. Join with player2

At this step, the old player must have the cards of the second deck of the opponent on the board. If the slot where these cards go is empty, then that's a packet loss as displayed via the print messages.

Minimal reproduction project

somewhat-rare-packet-loss.zip

@Sauermann Sauermann added this to the 4.2 milestone Jul 14, 2023
@TheYellowArchitect
Copy link
Author

cc @Faless

@TheYellowArchitect
Copy link
Author

TheYellowArchitect commented Jul 15, 2023

I am baffled. Today, the uploaded (M)RP seems to have no packet loss for me when I do ~40 reproductions. Anyway, I suspect its the RPC channels conflicting with each other. Because my original project kept having this bug today, and on conversion of all RPCs to 0, I could not replicate this bug. I will playtest on a real network to see if desync happens. I will update if I see packet loss on this MRP again or my original project converted to RPC channels 0.

@Faless
Copy link
Collaborator

Faless commented Jul 16, 2023

Today, the uploaded (M)RP seems to have no packet loss for me when I do ~40 reproductions. Anyway, I suspect its the RPC channels conflicting with each other.

Keep in mind, that reliable RPC order is enforced per channel, i.e. sending reliable message A on channel 1, and reliable message B on channel 2, might result in B being received before A (that's the whole point of channels).

So if your game makes assumptions on them being received in order even when using different channels, you will experience unexpected behaviors.

It's really hard to tell if this is the case given the size of the reproduction project posted, though I can see it uses multiple channels, so this might be something to look into.

It seems really weird that ENet is dropping reliable packets, as that's supposed to cause a disconnection, and I honestly have never experienced something like that.

@TheYellowArchitect
Copy link
Author

TheYellowArchitect commented Jul 18, 2023

Keep in mind, that reliable RPC order is enforced per channel, i.e. sending reliable message A on channel 1, and reliable message B on channel 2, might result in B being received before A (that's the whole point of channels).

So if your game makes assumptions on them being received in order even when using different channels, you will experience unexpected behaviors.

I made the channels like this:
1 -> Core game logic (exclusively reliable RPCs)
2 -> Game chat (exclusively unreliable RPCs)
3 -> Enemy hovering UI (e.g. what card he is looking at, or what tile, exclusively unreliable RPCs)

There are no RPCs at channel 0, because this is a turn-based game and all core game logic requires reliable RPCs, whereas channel 0 is a hybrid of both reliable and unreliable from what I know.

Now, the curious thing is that the project's multiplayer peer has a channel variable with 0 as the default (it doesn't change), yet why did my project work and send RPCs without problems, when no RPCs are at 0? Shouldn't a disconnect happen, or the RPC to not be sent?
By extension, is there a guide on channels? Do I have to set multiple multiplayer peers per player to utilize multiple channels? Most importantly, what is the bandwidth cap for each channel (after all, this is the main reason to use channels - to not clog your reliable RPCs from unreliable RPCs of movement and other stuff), or at least the recommended setup by bandwidth?

That said, the bug shown in the videos above definitely shows a packet drop, so assuming the channels are the cause, they somehow conflict. I even have a MultiplayerSynchronizer somewhere in there, which I recently removed in my original project (whose RPCs I converted all to channel 0), so that may be the culprit too.

I have playtested for 1 hour on my original project on a real network without any packet losses. Before I did the aforementioned changes (RPC channels to 0, and removed MultiplayerSynchronizer under HandCard), packet losses occured around every 5 minutes on a real network. But I wish to do 1 more playtest to confirm that the RPC channels were somehow the cause of the packet losses. By weekend I should achieve this, and confirm

@TheYellowArchitect
Copy link
Author

TheYellowArchitect commented Jul 23, 2023

The (M)RP attached here is my original project stripped down heavily, cleaned off all art, and most systems are gone. Obviously, it had the packet loss bug aforementioned.

By moving all RPC channels to 0 (instead of having 3), it works perfectly, there are no packet losses. I tested my original project (which had packet losses every ~3 minutes) on a real network for an 1 hour session and another 1 hour, so it's safe to say the bug was caused by having 3 channels.
Yet, I still don't understand the roots of this bug.

Theoritically, this issue can be closed, but to be satisfied I would like to learn about the ENet channels, since I want to utilize them, yet I cannot... What is the recommended setup? When you use more than 1 channel, should you spawn 1 more peer? (since the peer has a channel variable which is by default 0)
Also, when you have reliable RPCs and unreliable RPCs, unreliable RPCs will likely clog up the bandwidth since they flood the channel without acks, and it would be unwise for an overflowed channel (basically connection lost there) to stop all reliable RPCs (hence the seperation of channels)
Is there some resource/guide explaining these? For example, when it is wise to use more channels? (which I assume is some bandwidth value per channel)
Also, the packets sent from/to Multiplayer Synchronizer&Spawner, are on which channel?
With the above answered, freely close this issue.


Off-topic: When an RPC function runs, e.g. rpc("function_inside_this_class", first_parameter_value), what does the packet contain? Does it contain a string path of the entire object to call the specific function? If so, that is extremely bloated for some projects, and I would request to at least confirm if the header packets are often more than the packet content, or they are mapped to functions or enumerated somehow

@Faless
Copy link
Collaborator

Faless commented Jul 23, 2023

I would like to learn about the ENet channels

For ENet specifically, refer to the official ENet documenation: http://enet.bespin.org/Features.html

You can read about multiplayer channels in godot in the official documentation: https://docs.godotengine.org/en/latest/classes/class_multiplayerpeer.html#class-multiplayerpeer-property-transfer-channel

In general, as stated in the Godot documentation, the default channel (0) is transparently handled as 3 different channels, depending on the transfer mode (unreliable/reliable/unreliable_ordered).

Also, the packets sent from/to Multiplayer Synchronizer&Spawner, are on which channel?

They are all on the default channel (0).
I'm hesitant to add an option for it, as the chance of getting the wrong ordering is high if people start messing with channels without understanding them. But we can discuss this in a proposal I think.

When an RPC function runs, e.g. rpc("function_inside_this_class", first_parameter_value), what does the packet contain? Does it contain a string path of the entire object to call the specific function?

No, it contains an ID, up to 16 bits. This is the reason why both clients and server must define the same RPCs.

@TheYellowArchitect
Copy link
Author

TheYellowArchitect commented Jul 24, 2023

@Faless

This is the World node (it exists in the uploaded MRP of the OP)
peer
Once I start the scene, and click on the remote tab, even though the peer variable does not have the @export annotation, it shows on the node (while a bug, I find it handy/useful)
peer2
I click the peer and it says RPC 0 (not true) and unreliable (probably not true) ((if it's true, that's a worse bug than the OP))
peer3

Some feedback on the above. On the inspector, the peer should have sub-properties on the rpc channels, kinda like an array or dictionary has items within. So you can have multiple channels, instead of showing one hardcoded (idk if its just an inspector bug or also in the ENet binder)
So each item has a channel int and a dropdown. Though for channel 0 might as well select all of them?
The above suggestion of seperation of channels in a singular peer applies if you need only 1 multiplayer peer for any amount of channels. Even now, I still don't know if I need to have more than 1 peer per player to have more than 1 channel.

As for feedback related to OP: My RPCs using channels 1,2,3 should not have been sent at all, and there should be a warning at least. I genuinely have no idea how they even sent (perhaps the existing multiplayer spawner/synchronizer which uses channel 0 somehow affected them? idk but I can make an MRP if its really a mystery)

For ENet specifically, refer to the official ENet documenation: http://enet.bespin.org/Features.html

I am aware of it, but that is just function API, there is no best practices or anything. For example, on the link you refer https://docs.godotengine.org/en/latest/classes/class_multiplayerpeer.html#class-multiplayerpeer-property-transfer-channel , me and anyone reading it, is instantly convinced to spread messages to multiple channels (like I did in the OP)
Yet, on the text Refer to the specific network API documentation (e.g. ENet or WebRTC) to learn how to set up channels correctly. there is no hyperlink, and on what I did find http://enet.bespin.org/structENetChannel.html there is no info on how to set up channels correctly, just barebones API docs (showing function name, parameter names and their types), (with the only exception in that website being http://enet.bespin.org/Tutorial.html)
For example, what stops me from opening 5 channels? I believe that is a bad practice, but I cannot find any info on RPC channels. Likewise on "should I have more than 1 peer per player, in order to support multiple channels?"
And surely I believe it must have been hard work to decipher these yourself and bind them to Godot, so I ask if you had stumbled upon any resources like a forum or videos? Anything helps

They are all on the default channel (0).
I'm hesitant to add an option for it, as the chance of getting the wrong ordering is high if people start messing with channels without understanding them. But we can discuss this in a proposal I think.

I don't think they should change either. 0 is good for that. Very niche use-case in a real game, in an already complex system. If someone gets one synchronizer on channel 0, and another on channel 1, who can guarantee bugs will not come up? Especially given channels 0+ support exclusively 1 ordering type (e.g. unreliable) which means lots more development and polishing for an extremely niche use-case. And that is aside of the misc UI work to alert/notify the developer accordingly. If you feel like it, open a proposal but I think that while the high-level multiplayer docs are yet unfinished, it would be wasteful to put effort into such a niche use-case.

No, it contains an ID, up to 16 bits. This is the reason why both clients and server must define the same RPCs.

Is there any documentation or link I can read on the generation of this ID? For example, does this ID refer to the node or the function or a combination of both? Especially given RPCs are based on a nodepath which can be made realtime, this does seem interesting (e.g. if you select a single node which is at the root of all your scenes, and only that receives RPCs, could it be optimized? Alternatively, when nodes move constantly on the tree hierarchy, how does this ID still preserve itself within 16 bits?)

@Mathis-Z
Copy link

Mathis-Z commented Aug 3, 2023

I ran into the same issue and created a minimal reproduction project: Demo.zip

There are two shell scripts in the project folder that you can use on linux to simulate certain network conditions. I set it to 10% packet loss. This way the issue is always reproducable.

@TheYellowArchitect
Copy link
Author

TheYellowArchitect commented Aug 4, 2023

Outstanding. You made an actual MRP, deterministic and all, out of nowhere. And it's simple and clean. I didn't need to use either shell scripts. I expected it to be a lot harder to automate this bug reproduction (I would have tried via timestamp lol), hence I kept delaying it. And if Faless answered me about RPC channels, I would have closed this issue, as I would think the packet loss is a result of bugged channels...

With the above MRP, the tag "Needs Testing" is no longer true.

@Mathis-Z
Copy link

Mathis-Z commented Aug 4, 2023

An interesting observation I made with my MRP is that

  • @rpc("any_peer", "call_local", "reliable", 0) produces no error (as expected)
  • @rpc("any_peer", "call_local", "unreliable", 0) produces both missed rpcs and rpcs in changed order (as expected)
  • @rpc("any_peer", "call_local", "unreliable_ordered", 0) produces missed rpcs but no rpcs in incorrect order (as expected)
  • @rpc("any_peer", "call_local", "reliable", 1) produces missed rpcs (topic of this issue) but no rpcs in incorrect order (weird)
  • @rpc("any_peer", "call_local", "unreliable", 1) produces missed rpcs but also no rpcs in incorrect order (very weird)
  • @rpc("any_peer", "call_local", "unreliable_ordered", 1) produces missed rpcs but no rpcs in incorrect order (as expected)

So basically, when I use channel 1 all RPCs behave "unreliable_ordered" no matter what transport mode I actually specified.

@Faless
Copy link
Collaborator

Faless commented Aug 5, 2023

@Mathis-Z thank you for making the MRP, I've opened #80293 which fixes the issue.

The transfer flags were not being set when using custom channels, resulting in the default ENet behavior (unreliable ordered).

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants