-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: grpc timeout segment data loss #116
Conversation
From my understanding, the timeout should make the segments abandoned with keeping the agent memory cost as less as possible. |
Well there is already a mechanism to reject segments if the queue is full, in agent/__init__.py:archive() the queue is explicitly checked and segment is rejected if full. All I am trying to do is if queue is not full to not lose the segments that have already made it into the queue. |
OK, this is another case, but also a tricky case in the product environment. |
If you want the new segments to take precedence then that def archive(segment: 'Segment'):
if __queue.full():
__queue.get()
logger.warning('the queue is full, the oldest segment was abandoned')
__queue.put(segment) The benefit here being that if you are not running the queue full then you will not lose segments at all. |
I know we could dequeue, but still, highly recommend you to take a look at Satellite project. It provides a much bettet way to resolve this case by leveraging file system. |
Well, we will probably eventually be looking at that but for now it is overkill for the small problem I am tasked with which is just not to drop sporadic segment data on silly timeouts. If you don't want this fix in your codebase its no problem I will just wind up doing it in our own fork then. So do you want me to add the dequeueing and keep this or just do on our own side? Apart from that, like I said, I still don't understand why the grpc connection stopped timing out. The old code would time out constantly regardless of the heartbeat and this code works more like I expect was the intention which is that it does not actually time out as long as the server is up. Would be good to have an explanation for this since even if I don't add this fix to your codebase you could limit data losses just by fixing the constant timeout problem. |
I will leave the decision to @kezhenxu94 and you, once the memory cost is limited as much as the queue side, I am good for either re-push segment into the queue if there is available slot, or expire the oldest data in the queue.
|
Ok, well I am also curious for @kezhenxu94's opinion. Specifically I haven't looked at how this grpc works under the hood so maybe he can tell me if this new code isn't doing something incredibly stupid like creating a new connection for each segment... |
@tom-pytel I'm +1 to put the segments back to the queue when the it is not full and abandon the oldest segments if it's full. As for the timeout issue, I had a quick glance of the codes and guess (don't have too much time ATM to look thoroughly and test myself) the reason may be the changes from |
On the other hand, it's not that different to abandon the oldest segments or the latest segments when the queue is full, maybe abandon the latest segments so that we don't need to |
OK just did a quick experiment, seems to be the case, I just changed a shorter timeout intentionally and the codes before this PR throws timeout exception constantly, while those after this PR don't. diff --git a/skywalking/client/grpc.py b/skywalking/client/grpc.py
index 734d3bb..81de589 100644
--- a/skywalking/client/grpc.py
+++ b/skywalking/client/grpc.py
@@ -55,4 +55,4 @@ class GrpcTraceSegmentReportService(TraceSegmentReportService):
self.report_stub = TraceSegmentReportServiceStub(channel)
def report(self, generator):
- self.report_stub.collect(generator)
+ self.report_stub.collect(generator, timeout=3)
|
This can be changed very easily in code so you could leave it as it is and change if needed at some point. It could even be configurable at runtime if you want. |
Now for some thoughts. I changed from a continuous send to send one segment at a time because with a continuous send I could not be guaranteed to know where the error happened and would not know which segment(s) to resend. For example does the grpc go one by one pulling then sending to stream or does it build up blocks of segments before trying to stream them? How efficient is it to send one segment at a time on a client stream function? What is the overhead of setting up a stream per-segment? Do you have a single unitary segment send function defined in your protocol which could do this more efficiently or is it only streams? |
The streaming itself returns a generator of responses so that may be a clue (not sure).
I think we only provide stream for this case (segment). |
Ok, well then don't merge until I dig a little deeper to see if it can be made more efficient.
Any thoughts about introducing one in the future? |
Ok, so I was uncomfortable with the idea of sending segments one by one using a streaming call so did a different way. There are now two mechanisms to prevent data loss here.
|
This looks better than the previous method. Thanks |
Please review this thoroughly because all I wanted to do was resend segments which failed to send due to a grpc timeout. But unexpectedly the behavior of grpc timeout changed, it doesn't time out anymore, the heartbeat seems to keep it alive now. I don't understand why this behavior changed so that is why I am requesting a thorough looking at this.