Skip to content

Safety Manager Design

Joshua Williams edited this page Oct 29, 2021 · 3 revisions

Safety Manager

The safety manager is, admittedly, going to be a lot of tedious work that hopefully never pays off. However, in the event that it is needed, this could potentially save lives.

Responsibilities

The safety manager is responsible for technical failures of the system and for cases where a human needs to take over operation, not for ordinary operation! This means that a lot of things that could be classified as "safety" are not in fact the job of the safety manager. For example, although safety related, the following are NOT responsibilities of this node:

  • Keeping our speed under the speed limit
  • Stopping at stop signs
  • Steering around an obstacle (exception: if the software fails to find a path and a human driver needs to take over, it is the safety manager that steps in)

In general, anything that occurs on a regular basis or is already built into the rest of the software does not need to involve the safety manager. The responsibilities of this node, then, are as follows:

  • Monitor system vitals
  • Monitor heartbeats
  • Monitor safety events
  • Assign recovery strategies
  • Trigger alarm
  • Manage Echo node

Monitor System Vitals

The safety manager will monitor certain system-wide elements for anomalous operation. Some examples that we need to monitor are:

  • CPU, GPU, RAM, and disk usage - These should fall within some acceptable range
  • CAN bus - This acts as a "heartbeat" of sorts for the entire system
  • Temperature - An overheated processor counts as a safety issue for sure!
  • Power - This is not the battery level (which should be monitored by whichever node decodes it), but rather voltage to the onboard computer. If this drops, the safety manager should handle it in whatever way is appropriate
  • Issues with ROS itself - this will take some research to properly implement

Monitor Heartbeats

If it is possible for a node to fail in such a way that the failure will not be noticed by other nodes and raised as a safety issue, it may need a heartbeat. This may not be necessary for any nodes, or it may be necessary for several, depending on how we design the rest of the system. For any such nodes, the safety monitor should listen for the heartbeat.

Monitor Safety Events

This node is responsible for monitoring a "safety event" topic on which all nodes can raise alarms when things go wrong. Communication must be reliable here! (see the section on this later in the document)

Each safety event message will have the following properties

  • Event id - used to easily look up the event type
  • Description - a human-readable message to accompany the event
  • Timestamp - when this message was sent
  • Status - one of the following
    • Resolved - the event has been resolved
    • Attempting local resolution - the node raising the event has resources at its disposal to resolve the event
      • Example - the CAN bus has failed, but we are going to use the next 100ms to try restarting it at the OS level
    • Unresolved - the node can do no more on its own to resolve the event
  • Threat level - one of the following
    • MINOR - If the event is not corrected, we can simply keep driving indefinitely, possibly with some limitations
      • Example - Lost GPS signal (approximate position using odometry)
      • Example - Lost odometry (increase GPS polling frequency, use lidars to approximate)
    • MAJOR - If the event is not corrected, we cannot keep driving indefinitely, but we can pull off the road
      • Example - Front camera has died (remember locations of objects to pull over)
      • Example - Loss of a lidar (threatens localization, but not enough that we can't pull over)
    • SEVERE - If the event is not corrected, we cannot continue
      • Example - Loss of localization for more than a few seconds
      • Example - EPAS is disconnected from the CAN bus
  • Additional Data - a JSON string that contains additional helpful data about the event. This is for event-specific data. If a single field is common across multiple event types, consider adding it directly to the message itself.

Theoretically, the threat level and description could be omitted entirely, but they allow the safety manager to display and take action for events that don't have explicit strategies yet.

Assign Recovery Strategies

Each node will need to implement a set of "recovery strategies" that will allow operation under a particular anomalous circumstance. Each node will have a "control topic", which it monitors for commands. The safety manager is responsible for sending these commands (and also for releasing the recovery mode when operation returns to normal).

Communication on these control channels needs to be reliable (see the section on this later in the document)

Trigger Alarm

The safety manager should control the "human intervention" alarm that will alert a human driver when they need to take the wheel. I'm not sure how to implement this yet, but it needs to be a sort of deadman's switch in the hardware (if, say, the Jetson is unplugged, the alarm needs to trigger). The safety manager needs to send an "all clear" signal to keep the alarm off, so that it will trigger in the event that the safety manager itself is disrupted.

Manage Echo Node

The echo node is a second node that monitors a heartbeat from the safety manager, and also sends a heartbeat to the safety manager. This provides a way to ensure that ROS messages are being passed correctly (and to measure any delays in message delivery), and also provides a backup should the safety manager fail (the echo node can restart it).

Implementation Notes

The following are a few notes that may be useful for the implementation of this node:

Splitting

This node has many responsibilities, and is something of a monolith. However, splitting it up into multiple nodes adds more points of failure (in their communication). So, this part of the code is probably best left as a single node.

Reliable Communication

We should look into what guarantees ROS makes regarding delivery of messages. If there is no mechanism to ensure delivery of critical messages (in finite time and in the correct order), we may need to implement one ourselves.

One way of doing this is as follows. Each stream of reliable messages has a unique id (generated randomly at the time of sending) which is attached to every message. Additionally, each message has a sequence number and an "acknowledgement" channel name. On receiving such a message, the receiver should send an acknowledgement on the named channel indicating that all messages up to that sequence number have been received. This is modeled after the way TCP works, and while it can't guarantee that messages will arrive quickly, it can at least guarantee that every message will be resent until received, that no message will be received twice, and that the order of messages will be preserved.

Testing

Testing this may be tedious, but it is obviously important. We need to trigger each type of event artificially and ensure that the car responds appropriately.

Events Strategies

The majority of the work (and maintenance) on this node will be building up a huge table of all the possible events and what to do in each case. This should NOT be a nested if statement or a switch statement - such constructs will quickly grow to be unusable. Rather, we need to develop a way to store this in some easy to read format.

One possibility would be to create a "strategy" interface that we subclass once for each event, allowing us to separate code for each event into its own file. We can then store these objects in a lookup table by event id, allowing them to be quickly loaded when needed.

Clone this wiki locally