Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[14921] Lose message since WriterProxy::changes_from_writer_low_mark_ is used in StatefulReader::begin_sample_access_nts() #2760

Closed
1 task done
Barry-Xu-2018 opened this issue Jun 20, 2022 · 11 comments
Labels
need more info Issue that requires more info from contributor

Comments

@Barry-Xu-2018
Copy link
Contributor

Barry-Xu-2018 commented Jun 20, 2022

Is there an already existing issue for this?

  • I have searched the existing issues

Expected behavior

No message loss

Current behavior

Some messages loss

Steps to reproduce

This can be reproduced by our codes. But I cannot paste these codes.
So I will describe found the problem based on my debug log output.

QoS setting:

  • history: KEEP_LAST_HISTORY_QOS
  • durability: TRANSIENT_LOCAL_DURABILITY_QOS
  • reliability: RELIABLE_RELIABILITY_QOS

At first, StatefulReader::change_received() receive messages (sequence number from 21 to 346 in turn). At this time, WriterProxy::changes_from_writer_low_mark_ is 0 and these sequence numbers are added to WriterProxy::changes_received_.

Then, a message with sequence number 1 is received. WriterProxy::changes_from_writer_low_mark_ is increased to 1. And WriterProxy::NotifyChanges() is called to send notification. ReadTakeCommand::add_instance() is called to take a message and remove taken message.

From sequence number 1 to 19, the process works well.
While receiving a message with sequence number 20, the problem occurs.
In StatefulReader::change_received(), WriterProxy::received_change_set() is called

if (a_change->is_fully_assembled())
{
ret = prox->received_change_set(a_change->sequenceNumber);
}

After receiving a message with sequence number 19, changes_from_writer_low_mark_ is 19 and the sequence number of current message is 20. So execute below code

if (changes_from_writer_low_mark_ + 1 == seq_num)
{
changes_from_writer_low_mark_ = seq_num;
cleanup();
}

The problem occurs at cleanup()

void WriterProxy::cleanup()
{
ChangeIterator chit = changes_received_.begin();
// Jump over all consecutive received changes starting on the next to low_mark
while (chit != changes_received_.end() && *chit == changes_from_writer_low_mark_ + 1)
{
chit++;
changes_from_writer_low_mark_++;
}
// Remove all those changes
changes_received_.erase(changes_received_.begin(), chit);
}

At this time, changes_received_ include sequence number from 21 to 346.
So changes_from_writer_low_mark_ will be increased from 20 to 346.

While taking message by ReadTakeCommand::add_instance(), StatefulReader::begin_sample_access_nts() is used to check if this message is future message based on changes_from_writer_low_mark_ which is gotten by WriterProxy::available_changes_max();

if (reader_->begin_sample_access_nts(change, wp, is_future_change))

Since changes_from_writer_low_mark_ is big (346), message is always taken and removed. Note that we want to take message with sequence number 21, but it will take a message with other sequence number and remove it.

Fast DDS version/commit

v2.6.0

Platform/Architecture

Ubuntu Focal 20.04 amd64

Transport layer

Default configuration, UDPv4 & SHM

Additional context

No response

XML configuration file

No response

Relevant log output

No response

Network traffic capture

No response

@Barry-Xu-2018 Barry-Xu-2018 added the triage Issue pending classification label Jun 20, 2022
@Barry-Xu-2018
Copy link
Contributor Author

About solution, whether seq = wp->available_changes_max(); can be replaced by wp->last_notified() (last_notified() is new interface and used to export WriterProxy::last_notified_) ?

bool StatefulReader::begin_sample_access_nts(
CacheChange_t* change,
WriterProxy*& wp,
bool& is_future_change)
{
const GUID_t& writer_guid = change->writerGUID;
is_future_change = false;
if (matched_writer_lookup(writer_guid, &wp))
{
SequenceNumber_t seq;
seq = wp->available_changes_max();
if (seq < change->sequenceNumber)
{
is_future_change = true;
}
}
return true;
}

@MiguelCompany MiguelCompany changed the title Lose message since WriterProxy::changes_from_writer_low_mark_ is used in StatefulReader::begin_sample_access_nts() [14921] Lose message since WriterProxy::changes_from_writer_low_mark_ is used in StatefulReader::begin_sample_access_nts() Jun 20, 2022
@MiguelCompany
Copy link
Member

@Barry-Xu-2018 I have several questions on this:

  1. Does this happen on a topic with keys?
  2. You say you are using KEEP_LAST_HISTORY_QOS. What is the configured depth?
  3. Is the message with sequence number 20 ever returned?

@MiguelCompany MiguelCompany added need more info Issue that requires more info from contributor and removed triage Issue pending classification labels Jun 20, 2022
@Barry-Xu-2018
Copy link
Contributor Author

Barry-Xu-2018 commented Jun 21, 2022

@MiguelCompany

  1. Does this happen on a topic with keys?

Yes.

  1. You say you are using KEEP_LAST_HISTORY_QOS. What is the configured depth?

Depth is 1.

  1. Is the message with sequence number 20 ever returned?

Yes.
My description has a mistake. After a message with sequence number 21, take a message with other sequence number.

I add log at here

history_.remove_change_sub(change, it);

Output:
[Fast-DDS][add_instance] remove_change_sub : seq_no = 339
[Fast-DDS][add_instance] remove_change_sub : seq_no = 102
[Fast-DDS][add_instance] remove_change_sub : seq_no = 126
[Fast-DDS][add_instance] remove_change_sub : seq_no = 174
[Fast-DDS][add_instance] remove_change_sub : seq_no = 230
...

@MiguelCompany
Copy link
Member

@Barry-Xu-2018

What you are seeing is expected behavior. Let me try to explain it.

Using KEEP_LAST with depth=1 means that, for each instance (i.e. for each different key), only the last sample is kept in the history.

Then, when calling DataReader::take(), as indicated in the documentation, the returned collection is a ‘list’ where samples belonging to the same data instance are consecutive. The order of the instances is not specified, only the order between the samples on the same instance is guaranteed.

I guess that the sequence numbers shown on your debug output correspond to different instances, and they are removed because they are being returned to the user on a take call (i.e. (added && take_samples) is true)

@Barry-Xu-2018
Copy link
Contributor Author

Barry-Xu-2018 commented Jun 21, 2022

@MiguelCompany

Thanks for your explanation.
I understand a little more deeply.

I modified DDS/HelloWorldExample to reproduce the problem.
Please look at Barry-Xu-2018@6e91807 (in branch topic-order-issues-with-key-topic)
And I make a script run.sh to execute because this problem cannot be easy to be reproduced 100%. (In my environment, 1/7)
Please look at README.txt on how to run and how to check result.

Subscriber log without log --- sub.log
Subscriber log with error --- sub_fail.log

@MiguelCompany
Copy link
Member

@Barry-Xu-2018 Again. I do not see any unexpected behavior.

As we explain on the documentation, you can see the topic as a single DB table, where each instance is a row. The samples taken from the DataReader are subsequent updates to one or more rows.

When there are several samples ready to be taken (which is what you are seeing on sub_fail.log), the order of the instances is not specified. In the case of Fast DDS, they will be ordered by their instance handle (which in your example is the value of the index field).

If what you're trying to achieve is to receive the samples in exactly the same order in which they are added to the DataWriter, that is the purpose of the PRESENTATION QoS ordered_access flag. This is, unfortunately, not currently implemented on Fast DDS.

If using keys on your topic is not a hard requirement, you could change to a non-keyed topic and increase the depth field on the HISTORY QoS.

@Barry-Xu-2018
Copy link
Contributor Author

@MiguelCompany

When there are several samples ready to be taken (which is what you are seeing on sub_fail.log), the order of the instances is not specified.

Yes. The order of message with different key cannot be guaranteed.
I have a question.
Why subscriber cannot receive all sent messages ? (In sub_fail.log, about more than 50 messages are received. The number of sent message is 100)

@MiguelCompany
Copy link
Member

@Barry-Xu-2018 As on_data_available() is not required to be called once per sample, you need to take all possible samples from the reader.

See my changes over your branch for a solution

@Barry-Xu-2018
Copy link
Contributor Author

@MiguelCompany

As on_data_available() is not required to be called once per sample, you need to take all possible samples from the reader.

See my changes over your branch for a solution

Thank you for your support.
After applying your change, message loss doesn't occur.

@Barry-Xu-2018
Copy link
Contributor Author

@MiguelCompany

I have another question.
After receiving message (these message should be removed in remove_change_sub(), NotifyChanges is called contiously.
Is it expected behavior ?

[Fast-DDS][add_instance] loop change : seq_no = 9
[Fast-DDS][add_instance] remove_change_sub : seq_no = 9
id:9, index:491 RECEIVED
[Fast-DDS][add_instance] loop change : seq_no = 8
[Fast-DDS][add_instance] remove_change_sub : seq_no = 8
id:8, index:492 RECEIVED   <== Here, all message are recevied. And then NotifyChanges  is called contiously.
[Fast-DDS][NotifyChanges] start to get change GUID 01.0f.ed.8d.87.41.52.c6.01.00.00.00|0.0.1.2 : seq_no = 9
[Fast-DDS][NotifyChanges] start to get change GUID 01.0f.ed.8d.87.41.52.c6.01.00.00.00|0.0.1.2 : seq_no = 10
[Fast-DDS][NotifyChanges] start to get change GUID 01.0f.ed.8d.87.41.52.c6.01.00.00.00|0.0.1.2 : seq_no = 11                                                [Fast-DDS][NotifyChanges] start to get change GUID 01.0f.ed.8d.87.41.52.c6.01.00.00.00|0.0.1.2 : seq_no = 12
[Fast-DDS][NotifyChanges] start to get change GUID 01.0f.ed.8d.87.41.52.c6.01.00.00.00|0.0.1.2 : seq_no = 13
[Fast-DDS][NotifyChanges] start to get change GUID 01.0f.ed.8d.87.41.52.c6.01.00.00.00|0.0.1.2 : seq_no = 14
[Fast-DDS][NotifyChanges] start to get change GUID 01.0f.ed.8d.87.41.52.c6.01.00.00.00|0.0.1.2 : seq_no = 15
[Fast-DDS][NotifyChanges] start to get change GUID 01.0f.ed.8d.87.41.52.c6.01.00.00.00|0.0.1.2 : seq_no = 16
[Fast-DDS][NotifyChanges] start to get change GUID 01.0f.ed.8d.87.41.52.c6.01.00.00.00|0.0.1.2 : seq_no = 17
[Fast-DDS][NotifyChanges] start to get change GUID 01.0f.ed.8d.87.41.52.c6.01.00.00.00|0.0.1.2 : seq_no = 18
[Fast-DDS][NotifyChanges] start to get change GUID 01.0f.ed.8d.87.41.52.c6.01.00.00.00|0.0.1.2 : seq_no = 19
[Fast-DDS][NotifyChanges] start to get change GUID 01.0f.ed.8d.87.41.52.c6.01.00.00.00|0.0.1.2 : seq_no = 20
[Fast-DDS][NotifyChanges] start to get change GUID 01.0f.ed.8d.87.41.52.c6.01.00.00.00|0.0.1.2 : seq_no = 21
...

Test code is located at https://github.com/Barry-Xu-2018/Fast-DDS/tree/topic-order-issues-with-key-topic.

@MiguelCompany
Copy link
Member

@Barry-Xu-2018 NotifyChanges is just finishing the loop to keep the consistency of the WriterProxy.

We have a refactor of this on the roadmap so that NotifyChanges will call a new method on the RTPSReaderListener, so we can notify all changes at the same time. This way, we will get rid of that loop.

fujitatomoya added a commit to fujitatomoya/Fast-DDS that referenced this issue Jun 24, 2022
  eProsima#2760

Signed-off-by: Tomoya Fujita <Tomoya.Fujita@sony.com>
EduPonz pushed a commit that referenced this issue Jul 1, 2022
#2760

Signed-off-by: Tomoya Fujita <Tomoya.Fujita@sony.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need more info Issue that requires more info from contributor
Projects
None yet
Development

No branches or pull requests

2 participants