Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimpleActionClient sometimes not subscribe result topic #119

Open
k-sawa opened this issue Nov 5, 2018 · 10 comments
Open

SimpleActionClient sometimes not subscribe result topic #119

k-sawa opened this issue Nov 5, 2018 · 10 comments

Comments

@k-sawa
Copy link

k-sawa commented Nov 5, 2018

I posted this issue in ROS Answers, but I have not got a clear answer. I think that it is a new bug in ROS, so raising an issue here. The original post is here.

Reproduction procedure

I have a simple reproduction code here, it tries action communication in the loop. In my environment, waitForResult() will time out as a result of miss subscribe result by approximately 500 - 10000 tries. I'm running ROS Kinetic, actionlib 1.11.13, ros-comm 1.12.14, and Ubuntu 16.04.5 LTS AMD64, with kernel 4.15.0.

Possible Sources of Error

To investigate further, I checked result_pub_.getNumSubscribes() in action_server_impl.h publishResult(). The number of subscribers is one in the first execution, and the become two. When a timeout occurred, the number of subscribers on the resulting topic was decreasing. Perhaps when running in the loop, one of the subscribers is the previous subscriber and the other is the current subscriber.
I suppose it is a bug that sometimes fails to register subscriber, so maybe it will be occur not only the loop but also first execution.

Result of rostest: catkin_make run_tests

[ INFO] [1541405691.737194538]: getNumSubscribers() = 1
[ INFO] [1541405691.737274368]: setSucceeded [1]
[ INFO] [1541405692.039136920]: getNumSubscribers() = 2
[ INFO] [1541405692.039179377]: setSucceeded [2]
[ INFO] [1541405692.339483435]: getNumSubscribers() = 2
[ INFO] [1541405692.339545305]: setSucceeded [3]
...
[ INFO] [1541405847.010855785]: getNumSubscribers() = 2
[ INFO] [1541405847.010914241]: setSucceeded [511]
[ INFO] [1541405847.311017810]: getNumSubscribers() = 2
[ INFO] [1541405847.311081457]: setSucceeded [512]
[ INFO] [1541405847.611815542]: getNumSubscribers() = 1
[ INFO] [1541405847.611892230]: setSucceeded [513]
testSimpleClient ... ok
@k-okada
Copy link

k-okada commented Nov 8, 2018

Thank you for detailed report. I just noticed that your example code works, if we changed to

  ros::init(argc, argv, "TestClient");
 // SecondArgument of SimpleActionClient means "spin_thread", so we do not need this http://docs.ros.org/melodic/api/actionlib/html/classactionlib_1_1SimpleActionClient.html#a47b97ba81c538372b6f128ed8b285fbc
 //  boost::thread spin_thread(&spinThread);

 // we may not need to instantiate ActionClient every loop
    actionlib::SimpleActionClient<control_msgs::FollowJointTrajectoryAction> ac("follow_joint_trajectory_action", true);
    ac.waitForServer();
  while (ros::ok())
  {
    ac.sendGoal(goal);
...
  }
  ros::shutdown();
 //  spin_thread.join();

Do you have any good reason to start your spin_thread and initialize ac every time?

@k-sawa
Copy link
Author

k-sawa commented Nov 8, 2018

Thank you for testing and comment my code.
I'm concerned that the topic subscribing error occurs at node initializing or not.
Creation of an actionlib client instance, in the loop every time is not this issue's Essense.
Is it better for the testing cycle of the test node launching and exit?

@k-sawa k-sawa closed this as completed Nov 8, 2018
@k-sawa
Copy link
Author

k-sawa commented Nov 8, 2018

Sorry I accidentally closed this, so I reopen one.

@k-sawa k-sawa reopened this Nov 8, 2018
@k-okada
Copy link

k-okada commented Nov 8, 2018 via email

@k-sawa
Copy link
Author

k-sawa commented Nov 9, 2018

There is no reasonable reason to initializing action client every loop, in the normal use.
I intended to make the loop test code for investigating the initial subscribing behavior of action client.

@k-sawa
Copy link
Author

k-sawa commented Nov 15, 2018

As the result of additional investigation, I think it is a reconnection issue of action client.

I made other reproduce program that loops launch of action server and client, and the issue was not reproduced as 13,000 tries. Then tried action client launch and exit only loop, as keep launching action server, it was reproduced twice in 8,609 tries. The code is here.

Reconnection of the action client is assumed to occur occasionally. For example, it may be due to the respawn option of launch file.

@fujitatomoya
Copy link
Contributor

+1

@jschleicher
Copy link
Contributor

This issue probably makes the tests currently flaky, see unstable build for PR #158

14:12:55 [Testcase: testtest_cpp_simple_client] ... ERROR!
14:12:55 ERROR: max time [60.0s] allotted for test [test_cpp_simple_client] of type [actionlib/actionlib-exercise_simple_client]

With additional debug output, it seems like the server receives the goal, but the test node doesn't get the result.

@ndepal
Copy link

ndepal commented Aug 5, 2021

I have a roscpp node that has an actionlib SimpleActionClient. During start up, my node calls ac.waitForServer(). This sometimes hangs indefinitely, sometimes it just takes several minutes, despite the ROS node serving the action already being up.

The debug output of the ros.actionlib.ConnectionMonitor logger of the client node shows:

isServerConnected: Client has not yet connected to feedback topic of server

Which is printed here.

Doing a rostopic echo /my_action/feedback does show the server node as the publisher as well as the client node as a subscriber.

Doing a rostopic pub /my_action/feedback ... will get the ac.waitForServer() to complete right away.

Environment

I am running this on ROS melodic. Both server and client are C++ nodes.
roscore, the server and client are each running inside their own Docker containers and communicating over a bridged Docker network. I can rosnode ping each node from every other container, and publishing/subscribing works just fine, so I know the communication between the nodes is fine in principle. The only thing that hangs are the ac.waitForServer() calls.

I have never seen this issue when running the nodes outside of Docker.

@ndepal
Copy link

ndepal commented Aug 6, 2021

I think I found out what was causing the issue for me. The issue was that my Docker containers had a much higher nofiles ulimit. Setting this to 1024 (which is what the Ubuntu host OS also has) seems to resolve the issue.

I found the solution thanks to #93 (comment).

I'm guessing something like this is happening moby/moby#38814:

In particular, RLIMIT_NOFILE, a number of open files limit, which is set to 2^20 (aka 1048576), causes a slowdown in a number of programs, as they use the upper limit value to iterate over all potentially opened file descriptors

See also ros/ros_comm#1122

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants