Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for persistent storage and retrieval of DPU reboot-cause #169

Open
wants to merge 55 commits into
base: master
Choose a base branch
from

Conversation

rameshraghupathy
Copy link

Adding support for persistent storage and retrieval of DPU reboot-cause

@@ -69,6 +70,45 @@ def read_reboot_cause_files_and_save_state_db():
x = TIME_SORTED_FULL_REBOOT_FILE_LIST[i]
os.remove(x)

def read_dpu_reboot_cause_files_and_save_chassis_state_db():
# Get platform using device_info.get_platform()
platform_info = device_info.get_platform_info()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code to query DPUs from the platform.json should be part of src/sonic-py-common/sonic_py_common/device_info.py. There is an implementation for getting the get_num_dpus, so we can put something like get_dpus_info. This will prevent a code duplication in all places that need to access DPUs information

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "process-reboot-cause_test.py" and test_update_dpu_reboot_cause_to_chassis_state_db_update to cover this case.


# Assert that makedirs was called for the DPU directories
mock_makedirs.assert_any_call(os.path.join('/host/reboot-cause/module', 'dpu0'))
mock_makedirs.assert_any_call(os.path.join('/host/reboot-cause/module', 'dpu1'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_dpu_reboot_cause_files_and_save_chassis_state_d is not covered by the tests. Please new test for it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleksandrivantsiv will do but can we defer it as the entire files doe not have a test yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy I think comment is addressed and test is added, right? please confirm.

Copy link
Author

@rameshraghupathy rameshraghupathy Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleksandrivantsiv Done. Added "process-reboot-cause_test.py" and test_update_dpu_reboot_cause_to_chassis_state_db_update to cover this case.

@@ -69,6 +71,84 @@ def read_reboot_cause_files_and_save_state_db():
x = TIME_SORTED_FULL_REBOOT_FILE_LIST[i]
os.remove(x)

def get_dpus():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy can we make use of existing APIs? @vvolam ?

Copy link
Author

@rameshraghupathy rameshraghupathy Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor sure, will do, there is a dependency with @vvolam's commit and will do it once that is committed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor done

try:
# Assuming you have a way to list the files in the directory
files = os.listdir(dpu_history_path)
# Filter and sort the files based on your criteria (e.g., by modification time)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy why eg? Are we not sure?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor cleaned

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy I mean why eg? You are indeed sorting by modification time so just put "sorted by modification time"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor cleaned

Comment on lines 141 to 145
_hash = f"{REBOOT_CAUSE_TABLE_NAME}|{data['gen_time']}"
state_db.set(state_db.STATE_DB, _hash, 'cause', data.get('cause', ''))
state_db.set(state_db.STATE_DB, _hash, 'time', data.get('time', ''))
state_db.set(state_db.STATE_DB, _hash, 'user', data.get('user', ''))
state_db.set(state_db.STATE_DB, _hash, 'comment', data.get('comment', ''))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy function says chassis STATE_DB but here its STATE_DB?

Copy link
Author

@rameshraghupathy rameshraghupathy Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Fixing the DB

# Get sorted reboot cause files for the DPU
reboot_files = get_sorted_reboot_cause_files(os.path.join(dpu_history_path, "history"))

for reboot_file in reboot_files:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy How do we handle a case where NPU comes late so that DPU to NPU mid plane is not UP by the time this process starts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor As shown in the HLD the NPU-chassisd will fetch the reboot-cause from the DPU and persist it.

@kperumalbfn
Copy link

@oleksandrivantsiv could you please review and approve

return []


def read_dpu_reboot_cause_files_and_save_chassis_state_db():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy can we break this into two functions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor It is already done in two functions get_sorted_reboot_cause_files and saving to chassisStateDB using this function. The name of the function is misleading, so renamed it to save_dpu_reboot_cause_files_to_chassis_state_db()

@@ -23,7 +23,9 @@ VERSION = "1.0"

SYSLOG_IDENTIFIER = "determine-reboot-cause"

MAX_HISTORY_FILES = 10

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gpunathilell Removed

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@vvolam vvolam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

scripts/determine-reboot-cause Outdated Show resolved Hide resolved
@@ -69,6 +71,84 @@ def read_reboot_cause_files_and_save_state_db():
x = TIME_SORTED_FULL_REBOOT_FILE_LIST[i]
os.remove(x)

def get_dpus():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/determine-reboot-cause_test.py Outdated Show resolved Hide resolved

# Assert that makedirs was called for the DPU directories
mock_makedirs.assert_any_call(os.path.join('/host/reboot-cause/module', 'dpu0'))
mock_makedirs.assert_any_call(os.path.join('/host/reboot-cause/module', 'dpu1'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy I think comment is addressed and test is added, right? please confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants