-
Notifications
You must be signed in to change notification settings - Fork 907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stalled Publication/Subscription #281
Comments
The sender and rcv-hwm are setting at position 1,627,656,896. However, the subscription has only consumed (and the rcv-pos) is 2,176. How big is the term length? Any other settings in use from the defaults? Also, you mention this is low traffic. But did you really send something like 1.5GB of data on that channel? |
On stream 1 it's just a few heartbeat messages of a few hundred bytes each per second. It is stuck almost immediately (only few messages get through). Reproducing is not deterministic (about 1 in 10 it happens). On stream 2 it's a lot more messages. 1.5 GB on the sender side. How can that be possible? I use Term sizes not changed, using default. All settings default except increased a few timeouts:
|
The positions don't make a lot of sense, honestly. Are you sure that all drivers and clients are using the same 1.0.1 version? it might be worth a shot to use the latest snapshot/head also just to see. |
@mjpt777 anything come to mind for you? |
I think it would be worth looking at the publication and image buffers if you still have them. Also what at the first 30 counters so we can see the aggregate picture? @tmontgomery I think it might be worth putting the Aeron version AeronStat output. |
I'm sure about the version number. Running this on fresh ec2 instances from our sbt build and we have been using 1.0.1 since long back. Here are all files: https://dl.dropboxusercontent.com/u/14579199/test35.zip |
If the positions are correct, then looking at the images is going to be mostly fruitless. The hwm and sub-pos, for example, will be many term buffers in difference. Surely violating the cleaning assumptions. |
The stats show only 20MB sent/received, so the high position value is definitely off for some reason. The only reason I can think for this occurring would be corruption of the header in some way that infers a different position. This seems to be happening as the message is put into the term as the snd-pos and pub-lmt are also corrupted. A possibility might be setting the reservedValue with an incorrect offset or a tryClaim that has messed with the offset incorrectly. node-1
node-2
|
I don't see how the rcv-pos or sub-pos can get so far behind the rcv-hwm. No flow control over or under runs so this would have been rejected. It is like something corrupted the counters. The image should have the latest contents. |
@mjpt777 agreed. The only way snd-pos increments is via sending data. And as there is no way I can see sending that much data because of the byte counts, the snd-pos, hwm, and pub-lmt must be invalid somehow. |
For the Publication logbuffer. Data is 0 until offset 2176 (position 2176). Which happens to be the sub-pos and rcv-pos. For a publication, this seems very very unusual normally. For the Image logbuffer, there is data up to position 2176, then it is 0. One possibility is that snd-pos was corrupted. If it got corrupted, then it would update pub-lmt with a bad value. Also, sending would use that position for scanning and might send junk (depending where it was). Wouldn't quite explain why rcv-hwm is at that value, though. But whatever hit snd-pos could hit rcv-hwm and corrupt it separately.... but it is a separate machine and file. The 0 of the data in the publication might be explained by the snd-pos. Cleaning might do that if snd-pos was corrupted. |
@patriknw is the stat values being read (or written to) by the app? Anything that could touch the cnc file? |
@mjpt777 BTW, the rcv-hwm is incremented on a heartbeat irregardless of overrun. So, it would be possible to have a heartbeat message come in that sets the rcv-hwm up to some value if (and especially if) the snd-pos was corrupted. The snd-pos is used to fill the heartbeat. |
We should probably filter that out as an overrun. |
Probably. But it's an orthogonal issue. The corruption of snd-pos is problematic. |
Looking at the code I think we should validate status messages for flow control to be in a valid range. |
Sure. Can validate that the position is one that was sent. I assume you mean on the Publication side processing an SM. Although, not seeing how a bad SM could effect the snd-pos like what is observed. The Publication logbuffer has data from 2,176 to 317,632 and is 0 afterward. A bad SM could allow more to be sent, but not seeing how it could change snd-pos directly. |
There is no counter for NAKs that are filtered by being outside of a termLength. Most of the NAKs received in the stats are filtered because of snd-pos being off by so much. |
something else that is "odd" is that there is no rcv-pos on node-1 for the image with sessionId 434696915. The image logbuffer is there, though. And it has data up to the hwm, which is 170,432. Now, it could be that the snapshot was taken just at the moment that the image was being closed and the rcv-pos was removed. But it is odd that it is not there. Edit: a supposition is that perhaps the allocation of counters used slot 24 for both snd-pos and rcv-pos. However, this wouldn't account for the actual value and how they are handled. Also, all allocation and freeing of counters is done on the driver conductor thread and appears to normally work just fine. |
I don't use reservedValue or tryClaim. Only offer and poll. I read the stats from a separate process with aeron-samples AeronStats. We are not touching the cnc file, apart from that we poll it periodically for log errors using (a copy) of AeronErrorLog. The media driver is started embedded, if that matters. Could it be something with that I continue to offer (in busy-wait loop) even though I get NOT_CONNECTED? The returned positions from offer on node1 looks good (to me):
|
@patriknw in the test, does the application create and release/close any other Publications or Subscriptions not shown in the stats? |
I didn't use any filters apart from type=[1-9] with AeronStat. However, node1 is trying to send to node2 before node2 has been started and after a while we give up that and close the Publication. I see in the logs that this happened in this test run. At 03:38:57. node2 is started 03:39:21 (clocks are in sync within 1 sec). So there was one Publication before the things you see in the stats here. |
* important bug fix in Aeron: aeron-io/aeron#281
* important bug fix in Aeron: aeron-io/aeron#281
Version 1.0.1
I'm debugging a case where
sub.poll
suddenly doesn't receive any more messages, and it stays that way. On the sending sidepub.offer
is successful (positive return value and connected=true closed=false).Only two nodes.
node1 tries to send to node2, but node2 is not started yet. ~30 seconds later node2 is started and they successfully exchange a few messages (I have application level logs for that), and then node2 stops receiving more messages in
sub.poll
.onFragment
in the FragmentAssembler is not invoked even though I repeatedly callpoll
. Those messages are below mtu size.The systems are rather loaded in this scenario but not overloaded, and the load is stopped after a while.
AeronStat from node1 172-31-10-77:
AeronStat from node2 172-31-8-204:
The problematic session is 1287205435. The other streams seem to progress. I kept it running for minutes after the stall.
Stream 1 is our control stream and it's low traffic, a few messages per second on this stream.
I have all files, if you need more information.
The text was updated successfully, but these errors were encountered: