Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SharedFuture from async_send_request never becomes valid #2039

Closed
jefferyyjhsu opened this issue Nov 1, 2022 · 19 comments
Closed

SharedFuture from async_send_request never becomes valid #2039

jefferyyjhsu opened this issue Nov 1, 2022 · 19 comments
Labels
help wanted Extra attention is needed

Comments

@jefferyyjhsu
Copy link
Contributor

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04.1 LTS (osrf/ros:humble-desktop docker image)
  • Installation type:
    • from source
  • Version or commit hash:
  • DDS implementation:
    • both fastrtps and cyclonedds
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

Build and run the unit test below

#include <gtest/gtest.h>
#include <rclcpp/rclcpp.hpp>

#include <std_srvs/srv/set_bool.hpp>

#include <string>

using namespace std::chrono;

class TestRegularService : public ::testing::Test
{
public:
    void SetUp() override
    {
        rclcpp::init(0, nullptr);

        m_node_executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();

        // Create a node
        auto opt = rclcpp::NodeOptions();
        auto node = rclcpp::Node::make_shared("server_node", opt);
        m_nodes.push_back(node);
        m_node_executor->add_node(node->get_node_base_interface());
        m_service = rclcpp::create_service<std_srvs::srv::SetBool>(
            node->get_node_base_interface(),
            node->get_node_services_interface(),
            "example_service",
            [node](const std::shared_ptr<std_srvs::srv::SetBool::Request> request, 
                const std::shared_ptr<std_srvs::srv::SetBool::Response> response) {
                static unsigned int serial_num = 1;
                (void)request;
                RCLCPP_INFO(node->get_logger(), "Received service client request... Sending response: %d", serial_num);
                response->success = true;
                response->message = std::to_string(serial_num++);
            },
            rclcpp::ServicesQoS().get_rmw_qos_profile(),
            nullptr);

        node = rclcpp::Node::make_shared("client_node", opt);
        m_nodes.push_back(node);
        m_node_executor->add_node(node);
        m_client = node->create_client<std_srvs::srv::SetBool>("example_service");

        

        m_node_thread = std::thread(std::bind([this]()
        {
            this->m_node_executor->spin();
        }));
    }

    ~TestRegularService()
    {
        if (rclcpp::ok())
        {
            rclcpp::shutdown();
        }
        m_node_thread.join();
    }

    std::vector<std::shared_ptr<rclcpp::Node>> m_nodes;
    std::shared_ptr<rclcpp::executors::SingleThreadedExecutor> m_node_executor;
    rclcpp::ServiceBase::SharedPtr m_service;
    rclcpp::Client<std_srvs::srv::SetBool>::SharedPtr m_client;
    std::thread m_node_thread;

};

TEST_F(TestRegularService, test_regular_service)
{
    auto request = std::make_shared<std_srvs::srv::SetBool::Request>();
    request->data = true;

    while (!m_client->wait_for_service(1s)) {
        if (!rclcpp::ok()) {
        RCLCPP_ERROR(m_nodes.front()->get_logger(), "Interrupted while waiting for the service. Exiting.");
        return;
        }
        RCLCPP_INFO(m_nodes.front()->get_logger(), "service not available, waiting again...");
    }
    uint counter = 1;
    do {
        RCLCPP_INFO(m_nodes.front()->get_logger(), "Sending request: %d" , counter);
        auto result = m_client->async_send_request(request);
        RCLCPP_INFO(m_nodes.front()->get_logger(), "Waiting for response: %d" , counter);
        auto answer = result.get();
        ASSERT_TRUE(answer->success);
        RCLCPP_INFO(m_nodes.front()->get_logger(), "Got response: %s", answer->message.c_str());
    }while(++counter<1000000);
}

int main(int argc, char **argv)
{
    ::testing::InitGoogleTest(&argc, argv);
    return RUN_ALL_TESTS();
}

Expected behavior

The unit test is expected to send an async request and a valid result will be retrieved with SharedFuture::get() later. The unit should repeat the process 1000000 times without issues.

Actual behavior

The unit test hangs randomly regardless of the underline DDS.
Example snippet:

Start testing: Oct 31 20:35 UTC
----------------------------------------------------------
71/105 Testing: test-regular-ros2-service
71/105 Test: test-regular-ros2-service
Command: "/usr/bin/python3.10" "-u" "/ros2_humble/install/ament_cmake_test/share/ament_cmake_test/cmake/run_test.py" "/ros2_humble/build/rclcpp/test_results/rclcpp/test-regular-ros2-service.gtest.xml" "--package-name" "rclcpp" "--output-file" "/ros2_humble/build/rclcpp/ament_cmake_gtest/test-regular-ros2-service.txt" "--command" "/ros2_humble/build/rclcpp/test/rclcpp/test-regular-ros2-service" "--gtest_output=xml:/ros2_humble/build/rclcpp/test_results/rclcpp/test-regular-ros2-service.gtest.xml"
Directory: /ros2_humble/build/rclcpp/test/rclcpp
"test-regular-ros2-service" start time: Oct 31 20:35 UTC
Output:
----------------------------------------------------------
-- run_test.py: invoking following command in '/ros2_humble/build/rclcpp/test/rclcpp':
 - /ros2_humble/build/rclcpp/test/rclcpp/test-regular-ros2-service --gtest_output=xml:/ros2_humble/build/rclcpp/test_results/rclcpp/test-regular-ros2-service.gtest.xml
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestRegularService
[ RUN      ] TestRegularService.test_regular_service
[INFO] [1667248559.329850983] [server_node]: Sending request: 1
[INFO] [1667248559.329986353] [server_node]: Waiting for response: 1
[INFO] [1667248559.330659463] [server_node]: Received service client request... Sending response: 1
[INFO] [1667248559.330995192] [server_node]: Got response: 1
[INFO] [1667248559.331019237] [server_node]: Sending request: 2
[INFO] [1667248559.331041372] [server_node]: Waiting for response: 2
[INFO] [1667248559.331233045] [server_node]: Received service client request... Sending response: 2
[INFO] [1667248559.331454604] [server_node]: Got response: 2
[INFO] [1667248559.331478869] [server_node]: Sending request: 3
[INFO] [1667248559.331500609] [server_node]: Waiting for response: 3
[INFO] [1667248559.331623666] [server_node]: Received service client request... Sending response: 3
[INFO] [1667248559.331833534] [server_node]: Got response: 3
[INFO] [1667248559.331858362] [server_node]: Sending request: 4
[INFO] [1667248559.331879729] [server_node]: Waiting for response: 4
[INFO] [1667248559.331984129] [server_node]: Received service client request... Sending response: 4
[INFO] [1667248559.332344850] [server_node]: Got response: 4
.......
.......
.......
[INFO] [1667248559.887715670] [server_node]: Sending request: 2704
[INFO] [1667248559.887733614] [server_node]: Waiting for response: 2704
[INFO] [1667248559.887767825] [server_node]: Received service client request... Sending response: 2704
[INFO] [1667248559.887899850] [server_node]: Got response: 2704
[INFO] [1667248559.887913152] [server_node]: Sending request: 2705
[INFO] [1667248559.887931926] [server_node]: Waiting for response: 2705
[INFO] [1667248559.887967492] [server_node]: Received service client request... Sending response: 2705
[INFO] [1667248559.888097595] [server_node]: Got response: 2705
[INFO] [1667248559.888119401] [server_node]: Sending request: 2706
[INFO] [1667248559.888141074] [server_node]: Waiting for response: 2706
<end of output>
Test time =  60.06 sec
----------------------------------------------------------
Test Failed.
"test-regular-ros2-service" end time: Oct 31 20:36 UTC
"test-regular-ros2-service" time elapsed: 00:01:00
----------------------------------------------------------

End testing: Oct 31 20:36 UTC

gtest =  60.06 sec*proc

@alsora
Copy link
Collaborator

alsora commented Nov 1, 2022

Using the Intra-Process services and clients from #1847, it's possible to hot-fix the issue for single-process applications (i.e. enabling IPC on all clients/servers pairs).
However it's unclear what's the cause of this issue.

@llapx
Copy link
Contributor

llapx commented Nov 4, 2022

It's hanged at result.get(), I have update the code with rclcpp::spin_until_future_complete to wait for incoming messages, and it works well, the code looks like bellow:

...
        node = rclcpp::Node::make_shared("client_node", opt);
        m_nodes.push_back(node);
        //m_node_executor->add_node(node);
        m_client = node->create_client<std_srvs::srv::SetBool>("example_service");
...
    do {
        RCLCPP_INFO(m_nodes.front()->get_logger(), "Sending request: %d" , counter);
        auto result = m_client->async_send_request(request);
        RCLCPP_INFO(m_nodes.front()->get_logger(), "Waiting for response: %d" , counter);
        if (rclcpp::spin_until_future_complete(m_nodes.back(), result) ==
            rclcpp::FutureReturnCode::SUCCESS)
        {
            auto answer = result.get();
            RCLCPP_INFO(m_nodes.front()->get_logger(), "Got response: %s", answer->message.c_str());
        }
...

@iuhilnehc-ynos
Copy link
Collaborator

The case makes sense to me.

The reason why future.get() is blocked is that promise does not set_value in handle_response.

If the request from client to service and the response from service to client is quick enough that done before putting the sequence_number into the pending_requests_, calling get_and_erase_pending_request in the handle_response can't get that there was the sequence_number requested before.

I think that

std::lock_guard<std::mutex> lock(pending_requests_mutex_);
should put before calling rcl_send_request.

@llapx
Copy link
Contributor

llapx commented Nov 7, 2022

@iuhilnehc-ynos
Thanks for your tip, I have created PR #2044 to fix this issue.

@fujitatomoya
Copy link
Collaborator

@llapx #2044 looks good to me, as @iuhilnehc-ynos mentioned, this could happen in theory.

just in case, can you make sure #2039 (comment) is the case for this issue with using gdb or additional logging message? i think we want to know the root cause before fixing the problem.

@llapx
Copy link
Contributor

llapx commented Nov 8, 2022

@fujitatomoya

can you make sure #2039 (comment) is the case for this issue

Confirmed, and it works well (the test case passed, no hang happend).

with using gdb or additional logging message?

Not yet, I will have a try.

@jefferyyjhsu
Copy link
Contributor Author

@llapx, I cherry-picked your fix commit ece7a0eb1b1a656aa2241d71a06c3904f4bcb448 to humble but couldn't get the same result. The example still locks up for me. Please see attached a gdb backtrace. Thanks!
gdb.txt

@llapx
Copy link
Contributor

llapx commented Nov 9, 2022

@jefferyyjhsu

I retest it on humble branch, and it works well (both on latest rolling & humble branch):

...
100: [INFO] [1667963086.523220110] [server_node]: Received service client request... Sending response: 999999
100: [INFO] [1667963086.523367484] [server_node]: Got response: 999999
100: [       OK ] TestRegularService.test_regular_service (285302 ms)
100: [----------] 1 test from TestRegularService (285302 ms total)
100: 
100: [----------] Global test environment tear-down
100: [==========] 1 test from 1 test suite ran. (285302 ms total)
100: [  PASSED  ] 1 test.
100: -- run_test.py: return code 0
100: -- run_test.py: inject classname prefix into gtest result file '/work/rclcpp_issues_2039/build/rclcpp/test_results/rclcpp/test_regular_service.gtest.xml'
100: -- run_test.py: verify result file '/work/rclcpp_issues_2039/build/rclcpp/test_results/rclcpp/test_regular_service.gtest.xml'
1/1 Test #100: test_regular_service .............   Passed  285.49 sec
                                               
The following tests passed:
        test_regular_service

100% tests passed, 0 tests failed out of 1

Label Time Summary:
gtest    = 285.49 sec*proc (1 test)

Total Test time (real) = 286.08 sec
Finished <<< rclcpp [8min 4s]

Summary: 1 package finished [8min 4s]

My test env:

ros:humble on docker
rclcpp humble branch with latest commit ae8b033

@jefferyyjhsu
Copy link
Contributor Author

Hi @llapx,
I did some more tests today. It turns out that the test now passes when using cyclonedds. However, it still fails with fastrtps.
Do you happen to be using cyclonedds as well?

@llapx
Copy link
Contributor

llapx commented Nov 10, 2022

@jefferyyjhsu

I used default env of latest ros:humble docker (with some extra packages installed for compiling rclcpp):

root@csc:/work/rclcpp_issues_2039# dpkg -l | grep fastrtps
ii  ros-humble-fastrtps                               2.6.1-1jammy.20220620.173333            amd64        *eprosima Fast DDS* (formerly Fast RTPS) is a C++ implementation of the DDS (Data Distribution Service) standard of the OMG (Object Management Group).
ii  ros-humble-fastrtps-cmake-module                  2.2.0-2jammy.20220520.001034            amd64        Provide CMake module to find eProsima FastRTPS.
ii  ros-humble-rmw-fastrtps-cpp                       6.2.1-2jammy.20220620.175641            amd64        Implement the ROS middleware interface using eProsima FastRTPS static code generation in C++.
ii  ros-humble-rmw-fastrtps-shared-cpp                6.2.1-2jammy.20220620.175144            amd64        Code shared on static and dynamic type support of rmw_fastrtps_cpp.
ii  ros-humble-rosidl-typesupport-fastrtps-c          2.2.0-2jammy.20220520.005614            amd64        Generate the C interfaces for eProsima FastRTPS.
ii  ros-humble-rosidl-typesupport-fastrtps-cpp        2.2.0-2jammy.20220520.005259            amd64        Generate the C++ interfaces for eProsima FastRTPS.
root@csc:/work/rclcpp_issues_2039# dpkg -l | grep cyclone

@iuhilnehc-ynos
Copy link
Collaborator

@jefferyyjhsu, thanks for the confirmation to let me know there is still an issue for rmw_fastrtps. I did the test yesterday, and I can reproduce the issue using rmw_fastrtps in whatever the humble or rolling (I didn't use docker to test).

From the #2039 (comment), I think that the issue for rmw_cyclonedds_cpp is fixed by #2044. (of course, it's a rclcpp's bug)

I am afraid that another bug exists in the Fast-DDS, but it needs to confirm by eProsima.
I'd like to share my understanding below,

https://github.com/eProsima/Fast-DDS/blob/cafd896e0e786e1af49f8f953e0843cc10780d29/src/cpp/fastdds/core/condition/WaitSetImpl.cpp#L159 doesn't hold WaitSetImpl::mutex_, but called in https://github.com/eProsima/Fast-DDS/blob/cafd896e0e786e1af49f8f953e0843cc10780d29/src/cpp/fastdds/core/condition/ConditionNotifier.cpp#L58.

https://github.com/eProsima/Fast-DDS/blob/cafd896e0e786e1af49f8f953e0843cc10780d29/src/cpp/fastdds/core/condition/WaitSetImpl.cpp#L131 might lose a condition notification when condition_variable is waiting for the fill_active_conditions.

A case such as https://github.com/eProsima/Fast-DDS/blob/cafd896e0e786e1af49f8f953e0843cc10780d29/src/cpp/fastdds/core/condition/WaitSetImpl.cpp#L123 with ret_val(false), and then there is a notification triggered at https://github.com/eProsima/Fast-DDS/blob/cafd896e0e786e1af49f8f953e0843cc10780d29/src/cpp/fastdds/core/condition/StatusConditionImpl.cpp#L90.

A possible solution,

diff --git a/src/cpp/fastdds/core/condition/WaitSetImpl.cpp b/src/cpp/fastdds/core/condition/WaitSetImpl.cpp
index 73390e35d..ecd74b735 100644
--- a/src/cpp/fastdds/core/condition/WaitSetImpl.cpp
+++ b/src/cpp/fastdds/core/condition/WaitSetImpl.cpp
@@ -68,7 +68,7 @@ ReturnCode_t WaitSetImpl::attach_condition(
             // Should wake_up when adding a new triggered condition
             if (is_waiting_ && condition.get_trigger_value())
             {
-                wake_up();
+                cond_.notify_one();
             }
         }
     }
@@ -156,6 +156,7 @@ ReturnCode_t WaitSetImpl::get_conditions(
 
 void WaitSetImpl::wake_up()
 {
+    std::lock_guard<std::mutex> guard(mutex_);
     cond_.notify_one();
 }

could you help double-check if the issue still happened after applying the above patch for Fast-DDS?

@MiguelCompany @fujitatomoya @llapx Do you have any suggestions?

@MiguelCompany
Copy link
Contributor

@iuhilnehc-ynos I think you may be right. The fix seems an easy one, but adding a regression test might be difficult. I'll try to think of something ...

@jefferyyjhsu
Copy link
Contributor Author

Thanks @iuhilnehc-ynos, I tested the proposed fix and the example now passes reliably.

@fujitatomoya
Copy link
Collaborator

@iuhilnehc-ynos #2039 (comment) is deep, but i think the proposed change makes sense and actually fixes the problem reported by this issue.

@fujitatomoya
Copy link
Collaborator

@iuhilnehc-ynos can you make PR for Fast-DDS?

@iuhilnehc-ynos
Copy link
Collaborator

@fujitatomoya

Based on #2039 (comment), I think @MiguelCompany will create a PR with a regression test for it by himself.

@MiguelCompany
Copy link
Contributor

@iuhilnehc-ynos Please open the PR, and we'll add the regression test on it

@clalancette clalancette added the help wanted Extra attention is needed label Dec 1, 2022
@MarcoMatteoBassa
Copy link

MarcoMatteoBassa commented Jan 2, 2023

Just a note: this seems to be the same issue reported here osrf/docker_images#628
Seems to be currently fixed if using osrf/ros2:nightly with teh default DDS but not on the rolling docker image yet.

@fujitatomoya
Copy link
Collaborator

I think we can close this since,

are merged now. please reopen the issue or another one if anything missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

8 participants