Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for persistent storage and retrieval of DPU reboot-cause #169

Open
wants to merge 55 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
be55f8e
Adding support for persistent storage and retrieval of DPU reboot-cause
rameshraghupathy Oct 11, 2024
3cd7b67
Added support for persisting dpu reboot-cause on smartswitch host
rameshraghupathy Oct 23, 2024
0e47f97
Working on coverage
rameshraghupathy Oct 23, 2024
1feeb9f
Working on ut coverage
rameshraghupathy Oct 23, 2024
210dd14
working on coverage
rameshraghupathy Oct 23, 2024
4c0fa72
working on coverage
rameshraghupathy Oct 23, 2024
807a267
working on coverage
rameshraghupathy Oct 23, 2024
766a677
working on coverage
rameshraghupathy Oct 23, 2024
00496b5
working on coverage
rameshraghupathy Oct 23, 2024
b0b89c4
Fixed a typo
rameshraghupathy Oct 23, 2024
667ec45
Working on coverage
rameshraghupathy Oct 23, 2024
97ff55c
Fixing test failure
rameshraghupathy Oct 23, 2024
0cca074
improving coverage
rameshraghupathy Oct 23, 2024
d97d228
Improving coverage
rameshraghupathy Oct 23, 2024
1d0650f
working on coverage
rameshraghupathy Oct 23, 2024
17345aa
Modifying reboot-cause workflow to meet multiple smartswitch vendor
rameshraghupathy Oct 29, 2024
093bf00
Fixig the assertions to meet the new change
rameshraghupathy Oct 30, 2024
c28e29c
Fixed the DB
rameshraghupathy Nov 25, 2024
887897d
Using the common API device_info.get_dpu_list()
rameshraghupathy Nov 26, 2024
ede90c0
Addressed review comments
rameshraghupathy Dec 6, 2024
6b6b6b9
Added new test file tests/process-reboot-cause_test.py
rameshraghupathy Dec 12, 2024
e44cd1f
Added the scripts_path
rameshraghupathy Dec 12, 2024
a98c06d
Moved setup outside the test class
rameshraghupathy Dec 12, 2024
6d21142
Fixed the file name
rameshraghupathy Dec 12, 2024
bcba133
Fixing test isssues
rameshraghupathy Dec 12, 2024
dffedc8
Working on UT
rameshraghupathy Dec 12, 2024
fefff1a
Fixed the numbeer of arguments to load_module_from_source
rameshraghupathy Dec 12, 2024
3293749
addressed review comments
rameshraghupathy Dec 12, 2024
b61c2d9
adding mock for uid
rameshraghupathy Dec 12, 2024
0e4d23a
passing uid arg
rameshraghupathy Dec 12, 2024
decce6c
Fixing test failure
rameshraghupathy Dec 12, 2024
499d551
Fixing test failure
rameshraghupathy Dec 12, 2024
e9187a0
Fixing test failure
rameshraghupathy Dec 12, 2024
82fc7fd
Fixing test failure
rameshraghupathy Dec 12, 2024
15a70fa
Fixing test failure
rameshraghupathy Dec 13, 2024
7a0d3d8
Fixing test failure
rameshraghupathy Dec 13, 2024
1a55a59
Iproving coverage
rameshraghupathy Dec 13, 2024
74c38ae
Iproving coverage
rameshraghupathy Dec 13, 2024
2dfbc33
Iproving coverage
rameshraghupathy Dec 13, 2024
0702dee
Iproving coverage
rameshraghupathy Dec 13, 2024
ddf4541
Iproving coverage
rameshraghupathy Dec 13, 2024
3e2114f
Iproving coverage
rameshraghupathy Dec 13, 2024
0be0072
Iproving coverage
rameshraghupathy Dec 13, 2024
93a1dff
Iproving coverage
rameshraghupathy Dec 13, 2024
79ed240
Iproving coverage
rameshraghupathy Dec 13, 2024
6d41638
Iproving coverage
rameshraghupathy Dec 13, 2024
d6c4f94
Addressed review comments
rameshraghupathy Jan 2, 2025
3bae343
Addressed review comments
rameshraghupathy Jan 6, 2025
c48efd9
Addressed review comments
rameshraghupathy Jan 6, 2025
31aba69
Addressed review comments
rameshraghupathy Jan 6, 2025
a19e2ff
Addressed review comments
rameshraghupathy Jan 6, 2025
6ad5432
Addressed review comments
rameshraghupathy Jan 6, 2025
95dff75
Addressed review comments
rameshraghupathy Jan 6, 2025
5f9859a
Addressed review comments
rameshraghupathy Jan 6, 2025
67a1f07
Addressed review comments
rameshraghupathy Jan 8, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 29 additions & 3 deletions scripts/determine-reboot-cause
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ VERSION = "1.0"
SYSLOG_IDENTIFIER = "determine-reboot-cause"

REBOOT_CAUSE_DIR = "/host/reboot-cause/"
REBOOT_CAUSE_MODULE_DIR = "/host/reboot-cause/module"
REBOOT_CAUSE_HISTORY_DIR = "/host/reboot-cause/history/"
REBOOT_CAUSE_FILE = os.path.join(REBOOT_CAUSE_DIR, "reboot-cause.txt")
PREVIOUS_REBOOT_CAUSE_FILE = os.path.join(REBOOT_CAUSE_DIR, "previous-reboot-cause.json")
Expand Down Expand Up @@ -132,10 +133,10 @@ def find_hardware_reboot_cause():


def get_reboot_cause_dict(previous_reboot_cause, comment, gen_time):
"""Store the key infomation of device reboot into a dictionary by parsing the string in
"""Store the key information of device reboot into a dictionary by parsing the string in
previous_reboot_cause.

If user issused a command to reboot device, then user, command and time will be
If user issued a command to reboot device, then user, command and time will be
stored into a dictionary.

If device was rebooted due to the kernel panic, then the string `Kernel Panic`
Expand Down Expand Up @@ -181,7 +182,7 @@ def determine_reboot_cause():

# The main decision logic of the reboot cause:
# If there is a valid hardware reboot cause indicated by platform API,
# check the software reboot cause to add additional rebot cause.
# check the software reboot cause to add additional reboot cause.
# If there is a reboot cause indicated by /proc/cmdline, and/or warmreboot/fastreboot/softreboot
# the software_reboot_cause which is the content of /hosts/reboot-cause/reboot-cause.txt
# will be treated as the additional reboot cause
Expand All @@ -207,6 +208,27 @@ def determine_reboot_cause():

return previous_reboot_cause, additional_reboot_info

def check_and_create_dpu_dirs():
# Get the list of DPUs
dpus = device_info.get_dpu_list()

# Create directories for each DPU and its history
for dpu in dpus:
dpu_dir = os.path.join(REBOOT_CAUSE_MODULE_DIR, dpu)
history_dir = os.path.join(dpu_dir, "history")

# Create the DPU directory if it doesn't exist
if not os.path.exists(dpu_dir):
os.makedirs(dpu_dir)

# Create reboot-cause.txt and write 'First boot' to it
reboot_file = os.path.join(dpu_dir, 'reboot-cause.txt')
with open(reboot_file, 'w') as f:
f.write('First boot\n')

# Create the history directory if it doesn't exist
if not os.path.exists(history_dir):
os.makedirs(history_dir)

def main():
# Configure logger to log all messages INFO level and higher
Expand Down Expand Up @@ -257,6 +279,10 @@ def main():
with open(REBOOT_CAUSE_FILE, "w") as cause_file:
cause_file.write(REBOOT_CAUSE_UNKNOWN)

# Create directories for DPUs in SmartSwitch platforms
if device_info.is_smartswitch():
check_and_create_dpu_dirs()


if __name__ == "__main__":
main()
62 changes: 59 additions & 3 deletions scripts/process-reboot-cause
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,12 @@ try:

from swsscommon import swsscommon
from sonic_py_common import syslogger
from sonic_py_common import device_info
except ImportError as err:
raise ImportError("%s - required module not found" % str(err))

VERSION = "1.0"
CHASSIS_SERVER_PORT = 6380

SYSLOG_IDENTIFIER = "process-reboot-cause"

Expand All @@ -28,6 +30,7 @@ USER_ISSUED_REBOOT_CAUSE_REGEX ="User issued \'{}\' command [User: {}, Time: {}]

REBOOT_CAUSE_UNKNOWN = "Unknown"
REBOOT_CAUSE_TABLE_NAME = "REBOOT_CAUSE"
MAX_HISTORY_FILES = 10

REDIS_HOSTIP = "127.0.0.1"
state_db = None
Expand All @@ -48,7 +51,7 @@ def read_reboot_cause_files_and_save_state_db():

data = []
# Read each sorted previous reboot cause file and update the state db with previous reboot cause information
for i in range(min(10, len(TIME_SORTED_FULL_REBOOT_FILE_LIST))):
for i in range(min(MAX_HISTORY_FILES, len(TIME_SORTED_FULL_REBOOT_FILE_LIST))):
x = TIME_SORTED_FULL_REBOOT_FILE_LIST[i]
if os.path.isfile(x):
with open(x, "r") as cause_file:
Expand All @@ -63,12 +66,61 @@ def read_reboot_cause_files_and_save_state_db():
sonic_logger.log_info("Unable to process reload cause file {}: {}".format(x, je))
pass

if len(TIME_SORTED_FULL_REBOOT_FILE_LIST) > 10:
if len(TIME_SORTED_FULL_REBOOT_FILE_LIST) > MAX_HISTORY_FILES:
for i in range(len(TIME_SORTED_FULL_REBOOT_FILE_LIST)):
if i >= 10:
if i >= MAX_HISTORY_FILES:
x = TIME_SORTED_FULL_REBOOT_FILE_LIST[i]
os.remove(x)

def get_sorted_reboot_cause_files(dpu_history_path):
"""Retrieve and sort the reboot cause files for a specific DPU."""
try:
files = os.listdir(dpu_history_path)
sorted_files = sorted(
[os.path.join(dpu_history_path, f) for f in files if f.endswith('.txt')],
key=os.path.getmtime,
reverse=True # Most recent first
)
return sorted_files
except Exception as e:
sonic_logger.log_error(f"Error retrieving reboot cause files for {dpu_history_path}: {e}")
return []


def update_dpu_reboot_cause_to_chassis_state_db():
"""Retrieve reboot cause from history files and save them to chassisStateDB."""
chassis_state_db = swsscommon.SonicV2Connector(host="redis_chassis.server", port=CHASSIS_SERVER_PORT)
chassis_state_db.connect(chassis_state_db.CHASSIS_STATE_DB)

try:
dpus = device_info.get_dpu_list()

for dpu in dpus:
# Get sorted reboot cause files for the DPU
dpu_history_dir = os.path.join('/host/reboot-cause/module', dpu , 'history')
reboot_files = get_sorted_reboot_cause_files(dpu_history_dir)

for reboot_file in reboot_files:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy How do we handle a case where NPU comes late so that DPU to NPU mid plane is not UP by the time this process starts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor As shown in the HLD the NPU-chassisd will fetch the reboot-cause from the DPU and persist it.

if os.path.isfile(reboot_file):
with open(reboot_file, "r") as cause_file:
try:
data = json.load(cause_file)
# Ensure keys exist
if 'name' not in data:
sonic_logger.log_warning(f"Missing 'name' in data from {reboot_file}")
continue # Skip this file

_hash = f"{REBOOT_CAUSE_TABLE_NAME}|{dpu.upper()}|{data['name']}"
chassis_state_db.set(chassis_state_db.CHASSIS_STATE_DB, _hash, 'cause', data.get('cause', ''))
chassis_state_db.set(chassis_state_db.CHASSIS_STATE_DB, _hash, 'time', data.get('time', ''))
chassis_state_db.set(chassis_state_db.CHASSIS_STATE_DB, _hash, 'user', data.get('user', ''))
chassis_state_db.set(chassis_state_db.CHASSIS_STATE_DB, _hash, 'comment', data.get('comment', ''))

except json.decoder.JSONDecodeError as je:
sonic_logger.log_info(f"Unable to process reboot-cause file {reboot_file}: {je}")
continue # Skip this file
except Exception as e:
sonic_logger.log_err(f"Error reading DPU reboot causes: {e}")

def main():
# Configure logger to log all messages INFO level and higher
Expand Down Expand Up @@ -99,6 +151,10 @@ def main():
# Read the previous reboot cause from saved reboot-cause files and save the previous reboot cause upto 10 entry to the state db
read_reboot_cause_files_and_save_state_db()

# For smartswitch platform store the DPU reboot-cause to CHASSIS_STATE_DB
if device_info.is_smartswitch():
update_dpu_reboot_cause_to_chassis_state_db()


if __name__ == "__main__":
main()
37 changes: 37 additions & 0 deletions tests/determine-reboot-cause_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import os
import shutil
import pytest
import json

from swsscommon import swsscommon
from sonic_py_common.general import load_module_from_source
Expand Down Expand Up @@ -33,6 +34,8 @@
determine_reboot_cause_path = os.path.join(scripts_path, 'determine-reboot-cause')
determine_reboot_cause = load_module_from_source('determine_reboot_cause', determine_reboot_cause_path)

# Get the function to create dpu dir
check_and_create_dpu_dirs = determine_reboot_cause.check_and_create_dpu_dirs

PROC_CMDLINE_CONTENTS = """\
BOOT_IMAGE=/image-20191130.52/boot/vmlinuz-4.9.0-11-2-amd64 root=/dev/sda4 rw console=tty0 console=ttyS1,9600n8 quiet net.ifnames=0 biosdevname=0 loop=image-20191130.52/fs.squashfs loopfstype=squashfs apparmor=1 security=apparmor varlog_size=4096 usbcore.autosuspend=-1 module_blacklist=gpio_ich SONIC_BOOT_TYPE=warm"""
Expand Down Expand Up @@ -71,6 +74,8 @@
EXPECTED_KERNEL_PANIC_REBOOT_CAUSE_DICT = {'comment': '', 'gen_time': '2021_3_28_13_48_49', 'cause': 'Kernel Panic', 'user': 'N/A', 'time': 'Sun Mar 28 13:45:12 UTC 2021'}

REBOOT_CAUSE_DIR="host/reboot-cause/"
PLATFORM_JSON_PATH = "/usr/share/sonic/device/test_platform/platform.json"
REBOOT_CAUSE_MODULE_DIR = "/host/reboot-cause/module"

class TestDetermineRebootCause(object):
def test_parse_warmfast_reboot_from_proc_cmdline(self):
Expand Down Expand Up @@ -199,3 +204,35 @@ def test_determine_reboot_cause_main_with_reboot_cause_dir(self):
determine_reboot_cause.main()
assert os.path.exists("host/reboot-cause/reboot-cause.txt") == True
assert os.path.exists("host/reboot-cause/previous-reboot-cause.json") == True

def create_mock_platform_json(self, dpus):
"""Helper function to create a mock platform.json file."""
os.makedirs(os.path.dirname(PLATFORM_JSON_PATH), exist_ok=True)
with open(PLATFORM_JSON_PATH, "w") as f:
json.dump({"DPUS": dpus}, f)

@mock.patch('os.makedirs')
@mock.patch('builtins.open', new_callable=mock.mock_open)
@mock.patch('os.path.exists', side_effect=lambda path: False)
@mock.patch('sonic_py_common.device_info.is_smartswitch', return_value=True)
@mock.patch('sonic_py_common.device_info.get_dpu_list', return_value=["dpu0", "dpu1"])
def test_check_and_create_dpu_dirs(
self,
mock_get_dpu_list,
mock_is_smartswitch,
mock_exists,
mock_open,
mock_makedirs
):
# Call the function under test
check_and_create_dpu_dirs()

# Assert that directories were created for each DPU
mock_makedirs.assert_any_call(os.path.join(REBOOT_CAUSE_MODULE_DIR, "dpu0"))
mock_makedirs.assert_any_call(os.path.join(REBOOT_CAUSE_MODULE_DIR, "dpu1"))
mock_makedirs.assert_any_call(os.path.join(REBOOT_CAUSE_MODULE_DIR, "dpu0", "history"))
mock_makedirs.assert_any_call(os.path.join(REBOOT_CAUSE_MODULE_DIR, "dpu1", "history"))

# Assert that reboot-cause.txt was created for each DPU
mock_open.assert_any_call(os.path.join(REBOOT_CAUSE_MODULE_DIR, "dpu0", "reboot-cause.txt"), 'w')
mock_open.assert_any_call(os.path.join(REBOOT_CAUSE_MODULE_DIR, "dpu1", "reboot-cause.txt"), 'w')
115 changes: 115 additions & 0 deletions tests/process-reboot-cause_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
import sys
import os
from unittest import TestCase
from unittest.mock import patch, MagicMock, mock_open
from io import StringIO
from sonic_py_common.general import load_module_from_source

# Mock the connector
from .mock_connector import MockConnector
import swsscommon

# Mock the SonicV2Connector
swsscommon.SonicV2Connector = MockConnector

# Define the path to the script and load it using the helper function
test_path = os.path.dirname(os.path.abspath(__file__))
modules_path = os.path.dirname(test_path)
scripts_path = os.path.join(modules_path, "scripts")
sys.path.insert(0, modules_path)

# Load the process-reboot-cause module using the helper function
process_reboot_cause_path = os.path.join(scripts_path, "process-reboot-cause")
process_reboot_cause = load_module_from_source('process_reboot_cause', process_reboot_cause_path)

# Now proceed with your test class and mocks
class TestProcessRebootCause(TestCase):
@patch("builtins.open", new_callable=mock_open, read_data='{"cause": "PowerLoss", "user": "admin", "time": "2024-12-10", "comment": "test"}')
@patch("os.listdir", return_value=["file1.json", "file2.json"])
@patch("os.path.isfile", return_value=True)
@patch("os.path.exists", side_effect=lambda path: path.endswith('file1.json') or path.endswith('file2.json'))
@patch("os.remove")
@patch("process_reboot_cause.swsscommon.SonicV2Connector")
@patch("process_reboot_cause.device_info.is_smartswitch", return_value=True)
@patch("sys.stdout", new_callable=StringIO)
@patch("os.geteuid", return_value=0)
def test_process_reboot_cause(self, mock_geteuid, mock_stdout, mock_is_smartswitch, mock_connector, mock_remove, mock_exists, mock_isfile, mock_listdir, mock_open):
# Mock DB
mock_db = MagicMock()
mock_connector.return_value = mock_db

# Simulate running the script
with patch.object(sys, "argv", ["process-reboot-cause"]):
process_reboot_cause.main()

# Validate syslog and stdout logging
output = mock_stdout.getvalue()

# Verify DB interactions
mock_db.connect.assert_called()

@patch("builtins.open", new_callable=mock_open, read_data='{"invalid_json}')
@patch("os.listdir", return_value=["file1.json"])
@patch("os.path.isfile", return_value=True)
@patch("os.path.exists", side_effect=lambda path: path.endswith('file1.json'))
@patch("process_reboot_cause.swsscommon.SonicV2Connector")
@patch("process_reboot_cause.device_info.is_smartswitch", return_value=True)
@patch("sys.stdout", new_callable=StringIO)
@patch("os.geteuid", return_value=0)
def test_invalid_json(self, mock_geteuid, mock_stdout, mock_is_smartswitch, mock_connector, mock_exists, mock_isfile, mock_listdir, mock_open):
# Mock DB
mock_db = MagicMock()
mock_connector.return_value = mock_db

# Simulate running the script
with patch.object(sys, "argv", ["process-reboot-cause"]):
process_reboot_cause.main()

# Check invalid JSON handling
output = mock_stdout.getvalue()
self.assertTrue(mock_connector.called)

# Test get_sorted_reboot_cause_files
@patch("process_reboot_cause.os.listdir")
@patch("process_reboot_cause.os.path.getmtime")
def test_get_sorted_reboot_cause_files_success(self, mock_getmtime, mock_listdir):
# Setup mock data
mock_listdir.return_value = ["file1.txt", "file2.txt", "file3.txt"]
mock_getmtime.side_effect = [100, 200, 50] # Mock modification times

# Call the function
result = process_reboot_cause.get_sorted_reboot_cause_files("/mock/dpu_history")

# Assert the files are sorted by modification time in descending order
self.assertEqual(result, [
"/mock/dpu_history/file2.txt",
"/mock/dpu_history/file1.txt",
"/mock/dpu_history/file3.txt"
])

@patch("process_reboot_cause.os.listdir")
def test_get_sorted_reboot_cause_files_error(self, mock_listdir):
# Simulate an exception during file listing
mock_listdir.side_effect = Exception("Mocked error")

# Call the function and check the result
result = process_reboot_cause.get_sorted_reboot_cause_files("/mock/dpu_history")
self.assertEqual(result, [])

# Test update_dpu_reboot_cause_to_chassis_state_db
@patch("builtins.open", new_callable=mock_open, read_data='{"cause": "Non-Hardware", "comment": "Switch rebooted DPU", "device": "DPU0", "time": "Fri Dec 13 01:12:36 AM UTC 2024", "name": "2024_12_13_01_12_36"}')
@patch("process_reboot_cause.device_info.get_dpu_list", return_value=["dpu1", "dpu2"])
@patch("os.path.isfile", return_value=True)
@patch("process_reboot_cause.get_sorted_reboot_cause_files")
@patch("process_reboot_cause.os.listdir", return_value=["2024_12_13_01_12_36_reboot_cause.txt", "2024_12_14_01_11_46_reboot_cause.txt"])
@patch("process_reboot_cause.swsscommon.SonicV2Connector")
def test_update_dpu_reboot_cause_to_chassis_state_db_update(self, mock_connector, mock_listdir, mock_get_sorted_files, mock_isfile, mock_get_dpu_list, mock_open):
# Setup mocks
mock_get_sorted_files.return_value = ["/mock/dpu_history/2024_12_13_01_12_36_reboot_cause.txt"]

# Mock the database connection
mock_db = MagicMock()
mock_connector.return_value = mock_db

# Call the function that reads the file and updates the DB
process_reboot_cause.update_dpu_reboot_cause_to_chassis_state_db()
Loading