forked from autowarefoundation/autoware.universe
-
Notifications
You must be signed in to change notification settings - Fork 34
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: add autoware_node_death_monitor package for monitoring node cra…
…shes Signed-off-by: Kyoichi Sugahara <kyoichi.sugahara@tier4.jp>
- Loading branch information
1 parent
c5f0a24
commit 8ef11c7
Showing
7 changed files
with
526 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
cmake_minimum_required(VERSION 3.14) | ||
project(autoware_node_death_monitor) | ||
|
||
find_package(autoware_cmake REQUIRED) | ||
autoware_package() | ||
|
||
ament_auto_add_library(${PROJECT_NAME} SHARED | ||
src/autoware_node_death_monitor.cpp | ||
) | ||
|
||
rclcpp_components_register_node(${PROJECT_NAME} | ||
PLUGIN "autoware::node_death_monitor::NodeDeathMonitor" | ||
EXECUTABLE ${PROJECT_NAME}_node) | ||
|
||
ament_auto_package(INSTALL_TO_SHARE | ||
config | ||
launch | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
# autoware_node_death_monitor | ||
|
||
This package provides a monitoring node that detects ROS 2 node crashes by analyzing `launch.log` files, rather than subscribing to `/rosout` logs. | ||
|
||
--- | ||
|
||
## Overview | ||
|
||
- **Node name**: `autoware_node_death_monitor` | ||
- **Monitored file**: `launch.log` | ||
- **Detected event**: Looks for lines containing the substring `"process has died"` and extracts the node name and exit code. | ||
|
||
When a crash or unexpected shutdown occurs, `ros2 launch` typically outputs a line in `launch.log` such as: | ||
|
||
```bash | ||
[ERROR] [node_name-1]: process has died [pid 12345, exit code 139, cmd '...'] | ||
``` | ||
|
||
The `autoware_node_death_monitor` node continuously reads the latest `launch.log` file, detects these messages, and logs a warning or marks the node as "dead." | ||
|
||
--- | ||
|
||
## How it Works | ||
|
||
1. **Find `launch.log`**: | ||
- First, checks the `ROS_LOG_DIR` environment variable. | ||
- If not set, falls back to `~/.ros/log`. | ||
- Identifies the latest log directory based on modification time. | ||
2. **Monitor `launch.log`**: | ||
- Reads the file from the last known position to detect new log entries. | ||
- Looks for lines containing `"process has died"`. | ||
- Extracts the node name and exit code. | ||
3. **Filtering**: | ||
- **Ignored node names**: Nodes matching patterns in `ignore_node_names` are skipped. | ||
- **Ignored exit codes**: Logs with ignored exit codes are not flagged as errors. | ||
4. **Regular Updates**: | ||
- A timer periodically reads new entries from `launch.log`. | ||
- Dead nodes are reported in the logs. (will be changed to publish diagnostics) | ||
|
||
--- | ||
|
||
## Parameters | ||
|
||
| Parameter Name | Type | Default | Description | | ||
| ------------------- | ---------- | ----------------- | ---------------------------------------------------------- | | ||
| `ignore_node_names` | `string[]` | `[]` (empty list) | Node name patterns to ignore. E.g., `['rviz2']`. | | ||
| `ignore_exit_codes` | `int[]` | `[]` (empty list) | Exit codes to ignore (e.g., `0` or `130` for normal exit). | | ||
| `check_interval` | `double` | `1.0` | Timer interval (seconds) for scanning the log file. | | ||
| `enable_debug` | `bool` | `false` | Enables debug logging for detailed output. | | ||
|
||
Example **`autoware_node_death_monitor.param.yaml`**: | ||
|
||
```yaml | ||
autoware_node_death_monitor: | ||
ros__parameters: | ||
ignore_node_names: | ||
- rviz2 | ||
- teleop_twist_joy | ||
ignore_exit_codes: | ||
- 0 | ||
- 130 | ||
check_interval: 1.0 | ||
enable_debug: false | ||
``` | ||
--- | ||
## Limitations | ||
- **後で書く**: TBD. | ||
- **Robust Monitoring**: Works alongside systemd, supervisord, or other process supervisors for enhanced fault detection. | ||
--- |
18 changes: 18 additions & 0 deletions
18
system/autoware_node_death_monitor/config/autoware_node_death_monitor.param.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
/**: | ||
ros__parameters: | ||
# Node names to exclude from monitoring (Note: be careful with the "[node_name-#]" format) | ||
# Example: Do not issue a warning if rviz2 crashes. | ||
ignore_node_names: | ||
- rviz2 | ||
|
||
# Exit codes to exclude from monitoring (e.g., Ctrl+C) | ||
# Example: 0, 130 are considered normal exits and not treated as errors. | ||
ignore_exit_codes: | ||
- 0 | ||
- 130 | ||
|
||
# Check interval (seconds) | ||
check_interval: 1.0 | ||
|
||
# Enable/disable debug output | ||
enable_debug: false |
72 changes: 72 additions & 0 deletions
72
...re_node_death_monitor/include/autoware_node_death_monitor/autoware_node_death_monitor.hpp
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
// Copyright 2025 Tier IV, Inc. | ||
// | ||
// Licensed under the Apache License, Version 2.0 (the "License"); | ||
// you may not use this file except in compliance with the License. | ||
// You may obtain a copy of the License at | ||
// | ||
// http://www.apache.org/licenses/LICENSE-2.0 | ||
// | ||
// Unless required by applicable law or agreed to in writing, software | ||
// distributed under the License is distributed on an "AS IS" BASIS, | ||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
// See the License for the specific language governing permissions and | ||
// limitations under the License. | ||
|
||
#ifndef AUTOWARE_NODE_DEATH_MONITOR__AUTOWARE_NODE_DEATH_MONITOR_HPP_ | ||
#define AUTOWARE_NODE_DEATH_MONITOR__AUTOWARE_NODE_DEATH_MONITOR_HPP_ | ||
|
||
#include "rclcpp/rclcpp.hpp" | ||
|
||
#include <filesystem> | ||
#include <string> | ||
#include <unordered_map> | ||
#include <vector> | ||
|
||
namespace autoware::node_death_monitor | ||
{ | ||
|
||
class NodeDeathMonitor : public rclcpp::Node | ||
{ | ||
public: | ||
/** | ||
* @brief Constructor for NodeDeathMonitor | ||
* @param options Node options for configuration | ||
*/ | ||
explicit NodeDeathMonitor(const rclcpp::NodeOptions & options); | ||
|
||
private: | ||
/** | ||
* @brief Read and process new content appended to launch.log | ||
*/ | ||
void read_launch_log_diff(); | ||
|
||
/** | ||
* @brief Parse a single line from the log for process death information | ||
* @param line The log line to parse | ||
*/ | ||
void parse_log_line(const std::string & line); | ||
|
||
/** | ||
* @brief Timer callback to report and manage dead node list | ||
*/ | ||
void on_timer(); | ||
|
||
// Map to track dead nodes: [node_name-#] -> true | ||
std::unordered_map<std::string, bool> dead_nodes_; | ||
|
||
rclcpp::TimerBase::SharedPtr timer_; | ||
|
||
// Launch log file path and read position | ||
std::filesystem::path launch_log_path_; | ||
size_t last_file_pos_{static_cast<size_t>(-1)}; | ||
|
||
// Parameters | ||
std::vector<std::string> ignore_node_names_; // Node names to exclude from monitoring | ||
std::vector<int64_t> ignore_exit_codes_; // Exit codes to ignore (e.g., normal termination) | ||
double check_interval_{1.0}; // Check interval in seconds | ||
bool enable_debug_{false}; // Enable debug output | ||
}; | ||
|
||
} // namespace autoware::node_death_monitor | ||
|
||
#endif // AUTOWARE_NODE_DEATH_MONITOR__AUTOWARE_NODE_DEATH_MONITOR_HPP_ |
12 changes: 12 additions & 0 deletions
12
system/autoware_node_death_monitor/launch/autoware_node_death_monitor.launch.xml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
<launch> | ||
<!-- Parameter --> | ||
<arg name="config_file" default="$(find-pkg-share autoware_node_death_monitor)/config/autoware_node_death_monitor.param.yaml"/> | ||
|
||
<!-- Set log level --> | ||
<arg name="log_level" default="info"/> | ||
|
||
<node pkg="autoware_node_death_monitor" exec="autoware_node_death_monitor_node" name="node_death_monitor" output="screen" args="--ros-args --log-level $(var log_level)"> | ||
<!-- Parameter --> | ||
<param from="$(var config_file)"/> | ||
</node> | ||
</launch> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
<?xml version="1.0"?> | ||
<package format="3"> | ||
<name>autoware_node_death_monitor</name> | ||
<version>0.0.1</version> | ||
<description>The node_death_monitor package</description> | ||
|
||
<maintainer email="kyoichi.sugahara@tier4.jp">Kyoichi Sugahara</maintainer> | ||
<license>Apache License 2.0</license> | ||
|
||
<buildtool_depend>ament_cmake_auto</buildtool_depend> | ||
<buildtool_depend>autoware_cmake</buildtool_depend> | ||
|
||
<depend>rcl_interfaces</depend> | ||
<depend>rclcpp</depend> | ||
<depend>rclcpp_components</depend> | ||
|
||
<test_depend>ament_cmake_gtest</test_depend> | ||
<test_depend>ament_lint_auto</test_depend> | ||
|
||
<export> | ||
<build_type>ament_cmake</build_type> | ||
</export> | ||
</package> |
Oops, something went wrong.