Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected heartbeat behavior #19

Closed
liamstask opened this issue Feb 16, 2016 · 4 comments
Closed

unexpected heartbeat behavior #19

liamstask opened this issue Feb 16, 2016 · 4 comments

Comments

@liamstask
Copy link
Contributor

Given a trivial program like https://gist.github.com/liamstask/e76f07ed86d4345ac6ac (abbreviated not to include the headers, but let me know and I can send complete source), I see the following output:

missed some seqs: 99 : 2421, interval: 3119
missed some seqs: 2519 : 4849, interval: 3002
missed some seqs: 4947 : 7285, interval: 3008
missed some seqs: 7383 : 9602, interval: 3002
missed some seqs: 9700 : 11918, interval: 3001
missed some seqs: 12016 : 14247, interval: 2999
missed some seqs: 14345 : 16578, interval: 3001
missed some seqs: 16676 : 18924, interval: 3005
missed some seqs: 19022 : 21301, interval: 3000
missed some seqs: 21399 : 23649, interval: 3003
missed some seqs: 23747 : 26057, interval: 2998
missed some seqs: 26155 : 28518, interval: 3004
missed some seqs: 28616 : 30892, interval: 3003
missed some seqs: 30990 : 33252, interval: 2998
missed some seqs: 33350 : 35624, interval: 3006
missed some seqs: 35722 : 37962, interval: 2999
missed some seqs: 38060 : 40310, interval: 3004
missed some seqs: 40408 : 42692, interval: 2998
missed some seqs: 42790 : 45107, interval: 3008
missed some seqs: 45205 : 47434, interval: 3000
missed some seqs: 47532 : 49874, interval: 2999
missed some seqs: 49972 : 52303, interval: 3000
missed some seqs: 52401 : 54794, interval: 2999
missed some seqs: 54892 : 57190, interval: 3001
missed some seqs: 57288 : 59516, interval: 3003
missed some seqs: 59614 : 61847, interval: 3006
missed some seqs: 61945 : 64204, interval: 3002
missed some seqs: 64302 : 66554, interval: 3000
missed some seqs: 66652 : 68938, interval: 3004
missed some seqs: 69036 : 71381, interval: 3002
missed some seqs: 71479 : 73775, interval: 2996
missed some seqs: 73873 : 76161, interval: 3002

The interval appears to correspond with the heartbeat period. i.e., if I adjust the PublisherAttributes for the talker to include pa.times.heartbeatPeriod.seconds = 5; instead of the default 3 seconds, I see the reported interval change to approximately 5 seconds.

I would not expect the heartbeat to affect the delivery of data packets in this way. Is this a bug, or do I have something misconfigured?

@richiprosima
Copy link
Contributor

Hi Liam,

I've tested your example. What you are seeing is a normal behavior. I will try to explain it.

Data are not been lost, but they are not been published. If you capture the returned value from write function, then you will see a lot of writings are failing. This is because writer history is full. Data is removed from the writer history when it is acknowledged by all readers. This acknowledge occurs when the writer sends a HEARTBEAT to notify what data it sent, and the reader respond with an ACK to notify what data it has.

In your example you are sending data very fast. 1000 samples per second. The default heartbeat period is 3 seconds. In three seconds the writer is not sending the next heartbeat, then the readers don't acknowledge. As the writer history is 100, from 3000 samples written in 3 seconds, only 100 are really sent. There are two solutions:

  • Increase the writer history:
/* Talker::init() */
...
pa.topic.historyQos.kind = KEEP_ALL_HISTORY_QOS;
pa.topic.historyQos.depth = 5000;
pa.topic.resourceLimitsQos.max_samples = 5000;
pa.topic.resourceLimitsQos.allocated_samples = 5000;
...

  • Decrease the heartbeat period in the writer and the heartbeat response period in the reader:
/* Talker::init() */
...
pa.topic.resourceLimitsQos.allocated_samples = 100;
// 10 milliseconds period
pa.times.heartbeatPeriod.seconds = 0;
pa.times.heartbeatPeriod.fraction = 42949673;
pa.qos.m_reliability.kind = RELIABLE_RELIABILITY_QOS;
...

/* Listener::init() */
...
sa.qos.m_reliability.kind = RELIABLE_RELIABILITY_QOS;
// 10 milliseconds period
sa.times.heartbeatResponseDelay.seconds = 0;
sa.times.heartbeatResponseDelay.fraction = 42949673;
...

@liamstask
Copy link
Contributor Author

@richiprosima thank you for the quick response, that makes sense.

Does this mean that it's possible for subscribers with very long heartbeat intervals to prevent delivery of data to other subscribers on the same topic, by forcing the publisher's history to fill up?

@JaimeMartin
Copy link
Member

Hi Liam,

The heartbeat period is a parameter on the publishing side, therefore is common for all the subscribers. You are right, a very long heartbeat period could fill your history if you are writing too fast.

The solution is either increase your history buffer, either decrease the heartbeat period, or a combination of both. It depends on your use case.

@liamstask
Copy link
Contributor Author

great - i had not seen that the default heartbeatResponseDelay is ~116ms, which I imagine should usually be OK without additional configuration. thanks - will close this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants