Skip to content

Commit

Permalink
feat: add autoware_node_death_monitor package for monitoring node cra…
Browse files Browse the repository at this point in the history
…shes

Signed-off-by: Kyoichi Sugahara <kyoichi.sugahara@tier4.jp>
  • Loading branch information
kyoichi-sugahara committed Feb 7, 2025
1 parent c5f0a24 commit 8ef11c7
Show file tree
Hide file tree
Showing 7 changed files with 526 additions and 0 deletions.
18 changes: 18 additions & 0 deletions system/autoware_node_death_monitor/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
cmake_minimum_required(VERSION 3.14)
project(autoware_node_death_monitor)

find_package(autoware_cmake REQUIRED)
autoware_package()

ament_auto_add_library(${PROJECT_NAME} SHARED
src/autoware_node_death_monitor.cpp
)

rclcpp_components_register_node(${PROJECT_NAME}
PLUGIN "autoware::node_death_monitor::NodeDeathMonitor"
EXECUTABLE ${PROJECT_NAME}_node)

ament_auto_package(INSTALL_TO_SHARE
config
launch
)
73 changes: 73 additions & 0 deletions system/autoware_node_death_monitor/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# autoware_node_death_monitor

This package provides a monitoring node that detects ROS 2 node crashes by analyzing `launch.log` files, rather than subscribing to `/rosout` logs.

---

## Overview

- **Node name**: `autoware_node_death_monitor`
- **Monitored file**: `launch.log`
- **Detected event**: Looks for lines containing the substring `"process has died"` and extracts the node name and exit code.

When a crash or unexpected shutdown occurs, `ros2 launch` typically outputs a line in `launch.log` such as:

```bash
[ERROR] [node_name-1]: process has died [pid 12345, exit code 139, cmd '...']
```

The `autoware_node_death_monitor` node continuously reads the latest `launch.log` file, detects these messages, and logs a warning or marks the node as "dead."

---

## How it Works

1. **Find `launch.log`**:
- First, checks the `ROS_LOG_DIR` environment variable.
- If not set, falls back to `~/.ros/log`.
- Identifies the latest log directory based on modification time.
2. **Monitor `launch.log`**:
- Reads the file from the last known position to detect new log entries.
- Looks for lines containing `"process has died"`.
- Extracts the node name and exit code.
3. **Filtering**:
- **Ignored node names**: Nodes matching patterns in `ignore_node_names` are skipped.
- **Ignored exit codes**: Logs with ignored exit codes are not flagged as errors.
4. **Regular Updates**:
- A timer periodically reads new entries from `launch.log`.
- Dead nodes are reported in the logs. (will be changed to publish diagnostics)

---

## Parameters

| Parameter Name | Type | Default | Description |
| ------------------- | ---------- | ----------------- | ---------------------------------------------------------- |
| `ignore_node_names` | `string[]` | `[]` (empty list) | Node name patterns to ignore. E.g., `['rviz2']`. |
| `ignore_exit_codes` | `int[]` | `[]` (empty list) | Exit codes to ignore (e.g., `0` or `130` for normal exit). |
| `check_interval` | `double` | `1.0` | Timer interval (seconds) for scanning the log file. |
| `enable_debug` | `bool` | `false` | Enables debug logging for detailed output. |

Example **`autoware_node_death_monitor.param.yaml`**:

```yaml
autoware_node_death_monitor:
ros__parameters:
ignore_node_names:
- rviz2
- teleop_twist_joy
ignore_exit_codes:
- 0
- 130
check_interval: 1.0
enable_debug: false
```
---
## Limitations
- **後で書く**: TBD.
- **Robust Monitoring**: Works alongside systemd, supervisord, or other process supervisors for enhanced fault detection.

Check warning on line 71 in system/autoware_node_death_monitor/README.md

View workflow job for this annotation

GitHub Actions / spell-check-differential

Unknown word (supervisord)
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
/**:
ros__parameters:
# Node names to exclude from monitoring (Note: be careful with the "[node_name-#]" format)
# Example: Do not issue a warning if rviz2 crashes.
ignore_node_names:
- rviz2

# Exit codes to exclude from monitoring (e.g., Ctrl+C)
# Example: 0, 130 are considered normal exits and not treated as errors.
ignore_exit_codes:
- 0
- 130

# Check interval (seconds)
check_interval: 1.0

# Enable/disable debug output
enable_debug: false
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
// Copyright 2025 Tier IV, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

#ifndef AUTOWARE_NODE_DEATH_MONITOR__AUTOWARE_NODE_DEATH_MONITOR_HPP_
#define AUTOWARE_NODE_DEATH_MONITOR__AUTOWARE_NODE_DEATH_MONITOR_HPP_

#include "rclcpp/rclcpp.hpp"

#include <filesystem>
#include <string>
#include <unordered_map>
#include <vector>

namespace autoware::node_death_monitor
{

class NodeDeathMonitor : public rclcpp::Node
{
public:
/**
* @brief Constructor for NodeDeathMonitor
* @param options Node options for configuration
*/
explicit NodeDeathMonitor(const rclcpp::NodeOptions & options);

private:
/**
* @brief Read and process new content appended to launch.log
*/
void read_launch_log_diff();

/**
* @brief Parse a single line from the log for process death information
* @param line The log line to parse
*/
void parse_log_line(const std::string & line);

/**
* @brief Timer callback to report and manage dead node list
*/
void on_timer();

// Map to track dead nodes: [node_name-#] -> true
std::unordered_map<std::string, bool> dead_nodes_;

rclcpp::TimerBase::SharedPtr timer_;

// Launch log file path and read position
std::filesystem::path launch_log_path_;
size_t last_file_pos_{static_cast<size_t>(-1)};

// Parameters
std::vector<std::string> ignore_node_names_; // Node names to exclude from monitoring
std::vector<int64_t> ignore_exit_codes_; // Exit codes to ignore (e.g., normal termination)
double check_interval_{1.0}; // Check interval in seconds
bool enable_debug_{false}; // Enable debug output
};

} // namespace autoware::node_death_monitor

#endif // AUTOWARE_NODE_DEATH_MONITOR__AUTOWARE_NODE_DEATH_MONITOR_HPP_
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<launch>
<!-- Parameter -->
<arg name="config_file" default="$(find-pkg-share autoware_node_death_monitor)/config/autoware_node_death_monitor.param.yaml"/>

<!-- Set log level -->
<arg name="log_level" default="info"/>

<node pkg="autoware_node_death_monitor" exec="autoware_node_death_monitor_node" name="node_death_monitor" output="screen" args="--ros-args --log-level $(var log_level)">
<!-- Parameter -->
<param from="$(var config_file)"/>
</node>
</launch>
23 changes: 23 additions & 0 deletions system/autoware_node_death_monitor/package.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
<?xml version="1.0"?>
<package format="3">
<name>autoware_node_death_monitor</name>
<version>0.0.1</version>
<description>The node_death_monitor package</description>

<maintainer email="kyoichi.sugahara@tier4.jp">Kyoichi Sugahara</maintainer>
<license>Apache License 2.0</license>

<buildtool_depend>ament_cmake_auto</buildtool_depend>
<buildtool_depend>autoware_cmake</buildtool_depend>

<depend>rcl_interfaces</depend>
<depend>rclcpp</depend>
<depend>rclcpp_components</depend>

<test_depend>ament_cmake_gtest</test_depend>
<test_depend>ament_lint_auto</test_depend>

<export>
<build_type>ament_cmake</build_type>
</export>
</package>
Loading

0 comments on commit 8ef11c7

Please sign in to comment.