-
Notifications
You must be signed in to change notification settings - Fork 985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significant delay or missing messages on 4G in community (without re-login or app closure) #21172
Comments
I've looking into this issue and I thought I should share some notes on what I've been seeing. Summary
LogsNotable logs that happen in sequence
Here's a snippet of the kind of logs that I'm seeing in the app geth.logLog Snippet
|
updating after some analysis:
it would help to know which messages were missed, if you have any messageID, contentTopic , time and date to look in more details what exactly is happening. |
Hey @chaitanyaprem thanks for analysis, great stuff 🙌 I've attempted to reproduce this glitch again, but this time I reproduced the .glitch using a combination of an iOS simulator for sending messages to another iOS device. This was done because it was easier for me to access the logs of the simulator and I was having some issues running the dev build of the desktop app. If it's important to reproduce this glitch lmk and I can try again. That being said, it did take me a while to finally capture the same glitch, but I was able to get a copy of the logs from the sending device and the receiver device so we can compare them. More details here: #21172 (comment) To respond to your question about the disconnecting peers when backgrounding the app:
Supposedly when we background the application there could be some behaviour where the operating system will "idle" the application if it's waiting in the background. This could have an affect on network activity or network resources, but I cannot say for certain. Perhaps @ilmotta or @igor-sirotin may have some more context about this behaviour. Though what I've been able to notice is that this issue seems to mainly occur when the app returns to the foreground (the app was reopened), and then the network reconnects. So potentially on a low or missing data connection (mobile lte) the app may go through a connection state change. When that happens perhaps there's some logic that handling resources based on whether the connection is offline or online, but I'm not sure so I'd need to do some more digging through the code to find logic related to connection state. I do know that there is logic for checking the connection state when that is going from the background to the foreground, but I'm not sure if there's any logic for re-examining the connection state when the app is already running. This seems to be an important detail since I'm mainly able to reproduce this issue when reconnecting to the network after opening the app. To respond to your question about missing messages:
I do notice from the logs that we are attempting to retrieve missing messages, and it seems we've configured status-go to retrieve these missing message, but I haven't found any conditions for not doing retrieving the messages based on the connection state atm. I'll need to look deeper into the codebase to verify what's happening, but I think we're setup to to retrieve missing messages. |
Here's some updated logs and context from reproducing this glitch:
Logs from the device: ios-device-logs.zip Unfortunately I was only able to extract the |
Some follow-up notes:
|
@seanstrom FYI, could be related: status-im/status-go#5659 |
Thanks a lot for detailed tests and analysis, it helped me look at specific areas in the logs.
this could be a reason why
while going through logs, noticed that this could happen because filter subscriptions are not successful due to peers that are not reachable and taking time to stabilize. will analyze further to see if there is some other issue. but in the meantime i have made a simple logic to remove bad peers if dial fails twice to them and included it #21458. can you try to simulate the issue with this version.
this is odd, wondering what might be the reason for this content-topic not being chosen for missing messages. maybe @richard-ramos has an idea. |
@chaitanyaprem is the filter for this content topic is ephemeral? those are ignored for retrieving missing messages: https://github.com/status-im/status-go/blob/e611b1e5131fff706151661d15cc2904c20a7f71/protocol/transport/transport.go#L699C19-L699C24 |
ah, that might make sense...i have no idea though. |
It seems that a contentTopic filter is marked as Ephemeral here: https://github.com/status-im/status-go/blob/3179532b645549c103266e007694d2c81a7091b4/protocol/messenger_store_node_request_manager.go#L284 I've checked around and added some logs debug logs for a contentTopic being Ephemeral, and it doesn't seem in this case the contentTopic for the community channel is considered ephemeral. |
I've reproduced the issue again and this the logs will focus on:
So far it seems like there are some issues with reconnecting to the peer after the network connection is changed from offline to online.
From the ios-device logs (the receiver device)
|
oh, that is interesting. |
I think I found the cause although I'm working on 21452 / 21394, but they might be related, I'll create the status-go PR later @seanstrom @chaitanyaprem |
I don't think this fixes the missing messages as that is caused due to an issue explained here. Also the root-cause for the issue is not peer connectivity, because store nodes are pre-configured and missing messages check will try to connect to them if we are not connected to the nodes. |
what I found is not the root cause, but might be related. I tried your draft PR, it does reduced the delay for re-sending message once user is back online. but it won't fix 21452 / 21394
what if user login without enable network and enable after login? from my debug result, it seems didn't work the way you mentioned. @chaitanyaprem |
hmm, interesting..do you have logs where you tested this in scenarios 21452 and 21394? would like to take a look to see what is actually happening.
this is indeed an interesting scenario, but i am surprised that store nodes were not connected automatically. |
stop only depend on logs, start to debug 🚀 status-go |
Sorry, the link I provided is outdated, this should be the latest. @chaitanyaprem |
@seanstrom could you validate my changes and see if the issue still occurs? I would like these improvements to go in as well as they help in various other scenarios. |
@qfrank thank you for these status-go changes: status-im/status-go#6153 🙏 I had some time today to test those changes, and I was still able to reproduce the issue with the missing messages not being retrieved when receiving a message after coming back online (while having been offline for over a few minutes). Your changes could definitely solve some issues with re-connecting to peers after coming back online, though from my testing peers are re-connecting (on @chaitanyaprem branch) after a small window of time (roughly less than a minute). But based on the logs, there are some issues with store node requests failing because deadlines have exceeded. Atm, I'm looking at this code: https://github.com/status-im/status-go/blob/18469e98e67fc82a3f5da078f6aae9e35d0d9f9e/vendor/github.com/waku-org/go-waku/waku/v2/api/missing/missing_messages.go#L190-L211 I think this might be where I would attempt to debug next, but I haven't started that process yet. I'm wondering if there's some weird errors there with how we're passing I had collected the logs from @qfrank branch, but it seems my logs had rolled over and became inconsistent, so I'll re-collect them soon. But for now, here's my recent logs from reproducing the issue on @chaitanyaprem branch. In @chaitanyaprem branch I was still able reproduce the issue by sending a message to device a device that has very recently come back online after being offline for over a few minutes. In this case the missing message-id: receiver-device-dec-2-2024.zip |
Could you probably get me logs that are not rolling over? Would like to see what is happening since you start the test. |
@chaitanyaprem ah sorry about that, I re-ran the tests on my side and extracted some fresh logs. Let me know if these are better 🙏 In this case, missing-message id: receiver-device-logs-v3-dec-2-2024.zip |
I ran the code locally with status-desktop and realized that i needed to add more logic to missing-message-verifier code for restart to make it work properly. I did that and pushed few fixes into the same PR now. Do note that some more go-waku changes had got pulled in now but those should not affect these flows. @seanstrom Can you please retest? I am hoping it will work now 🤞 @qfrank can you test your scenario with this new build especially missing_messages related issue? I think i had figured out the cause of why it did not work. |
If the scenario you meant was keep offline before login, and enable network after login, I just tried with your PR, it didn't fix the issues 21452 / 21394 @chaitanyaprem you can see that messages but PR fixed this kind of issues |
Can you share the logs? Odd that missing messages after coming back from offline did not work. Also hope you have tested with my latest pushed version. |
I think we can use PR as a quick fix if we get good news from mobile QA? @chaitanyaprem |
Yup, I checked out your branch |
Emm.. I guess maybe it's because |
if you are talking about |
Oh, sorry, I checked out your branch and then did a rebase to include one of my commit which fixed one issue of backend server, so the latest commit was not the same with |
will take a look at them..can you also help by indicating the messageID of the message that was not received? |
for message `90': 0x0a1b88182c29157a4b56c1f2e7d4b40daf1eafc07e11da55b2cab3220005cf8f |
…restored from disabled state (#6153) Fix: - status-im/status-mobile#21452 - status-im/status-mobile#21394 Might also fix part (missing messages) of status-im/status-mobile#21172 Related mobile PR status-im/status-mobile#21730
Okie dokie, I've re-ran the tests using @chaitanyaprem latest changes (pulled the branch today), and those changes were also rebased on top of the latest develop since we've recently merged status-im/status-go#6153 During my testing, I noticed that missing-messages retrieval would usually work, and instant messaging between the clients would become re-connected in less than a minute, and missing messages would be fetched in roughly a minute. I would say that the system seems much more robust than where we started. However, I was still able to reproduce the issue with missing messages not being retrieved after toggling the network connection multiple times on the receiver device. I had to try multiple times to get this result, so it feels harder to reproduce but still possible. That being said here's the logs from my tests and some useful bits of info for debugging: contentTopic: Snippet from the receiver logs:
|
On thing I noticed in the sender logs is that it's having issues with make store node requests too:
|
That is really good to hear.
Went through the logs and noticed that |
@seanstrom i added some logs to help identify why store query context is getting cancelled. I had also rebased with latest develop branch in status-go. can you rerun the test with updated PR and share the logs? |
@seanstrom i found another issue and pushed a fix in status-go which was causing some of the logs which i had noticed before. Can you please retest with latest pushes and share the results? |
Follow up: #20730
Problem
Users experience issues with delayed or missing messages in community on both iOS and Android devices when reconnecting from offline to online (4G) .
On iOS: There is a noticeable delay in receiving messages when switching from offline to online (4G).
On Android: Messages either do not arrive at all (waited up to 10 minutes) or only a portion of the messages is received immediately. In 6 tests conducted: 4 cases of missing messages and 2 message partly deliveries.
Reproduction
Expected behavior
Messages should be received promptly after the user goes back online, even if the app was in the background.
Actual behavior
iOS: Messages are delayed upon reconnecting to the 4G network after being offline.
Android: Either no messages are received or only a subset of messages arrives instantly. The issue was observed in 4 out of 6 tests.
Additional Information
Comments:
Examples of tests on Android (4G enabled):
13:53 - device is offline
13:55 - device returns online
13:55 - 10 messages are sent from the desktop app
14:05 - no messages have arrived in the mobile app
14:06 - device is offline
14:07 - device returns online
14:09 - 10 messages are sent from the desktop app
14:09 - 5 out of 10 messages are received
14:21 - the remaining 5 messages still haven't arrived
Android:
Mobile log:
logcat.zip
Desktop log:
geth.zip
IOS:
screencast.2024-09-04.12-51-23.mp4
Mobile log:
logs.zip
Desktop log:
geth.log
The text was updated successfully, but these errors were encountered: