-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for persistent storage and retrieval of DPU reboot-cause #169
base: master
Are you sure you want to change the base?
Conversation
scripts/process-reboot-cause
Outdated
@@ -69,6 +70,45 @@ def read_reboot_cause_files_and_save_state_db(): | |||
x = TIME_SORTED_FULL_REBOOT_FILE_LIST[i] | |||
os.remove(x) | |||
|
|||
def read_dpu_reboot_cause_files_and_save_chassis_state_db(): | |||
# Get platform using device_info.get_platform() | |||
platform_info = device_info.get_platform_info() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code to query DPUs from the platform.json should be part of src/sonic-py-common/sonic_py_common/device_info.py
. There is an implementation for getting the get_num_dpus, so we can put something like get_dpus_info
. This will prevent a code duplication in all places that need to access DPUs information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleksandrivantsiv will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleksandrivantsiv done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added "process-reboot-cause_test.py" and test_update_dpu_reboot_cause_to_chassis_state_db_update to cover this case.
tests/determine-reboot-cause_test.py
Outdated
|
||
# Assert that makedirs was called for the DPU directories | ||
mock_makedirs.assert_any_call(os.path.join('/host/reboot-cause/module', 'dpu0')) | ||
mock_makedirs.assert_any_call(os.path.join('/host/reboot-cause/module', 'dpu1')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read_dpu_reboot_cause_files_and_save_chassis_state_d
is not covered by the tests. Please new test for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleksandrivantsiv will do but can we defer it as the entire files doe not have a test yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy I think comment is addressed and test is added, right? please confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleksandrivantsiv Done. Added "process-reboot-cause_test.py" and test_update_dpu_reboot_cause_to_chassis_state_db_update to cover this case.
hardware implementation requirements
466b70d
to
17345aa
Compare
scripts/process-reboot-cause
Outdated
@@ -69,6 +71,84 @@ def read_reboot_cause_files_and_save_state_db(): | |||
x = TIME_SORTED_FULL_REBOOT_FILE_LIST[i] | |||
os.remove(x) | |||
|
|||
def get_dpus(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy can we make use of existing APIs? @vvolam ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, get_dpu_info() defined in https://github.com/sonic-net/sonic-buildimage/pull/20724/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor done
scripts/process-reboot-cause
Outdated
try: | ||
# Assuming you have a way to list the files in the directory | ||
files = os.listdir(dpu_history_path) | ||
# Filter and sort the files based on your criteria (e.g., by modification time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy why eg? Are we not sure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor cleaned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy I mean why eg? You are indeed sorting by modification time so just put "sorted by modification time"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor cleaned
scripts/process-reboot-cause
Outdated
_hash = f"{REBOOT_CAUSE_TABLE_NAME}|{data['gen_time']}" | ||
state_db.set(state_db.STATE_DB, _hash, 'cause', data.get('cause', '')) | ||
state_db.set(state_db.STATE_DB, _hash, 'time', data.get('time', '')) | ||
state_db.set(state_db.STATE_DB, _hash, 'user', data.get('user', '')) | ||
state_db.set(state_db.STATE_DB, _hash, 'comment', data.get('comment', '')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy function says chassis STATE_DB but here its STATE_DB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor Fixing the DB
# Get sorted reboot cause files for the DPU | ||
reboot_files = get_sorted_reboot_cause_files(os.path.join(dpu_history_path, "history")) | ||
|
||
for reboot_file in reboot_files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy How do we handle a case where NPU comes late so that DPU to NPU mid plane is not UP by the time this process starts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor As shown in the HLD the NPU-chassisd will fetch the reboot-cause from the DPU and persist it.
1dbf35e
to
c28e29c
Compare
@oleksandrivantsiv could you please review and approve |
scripts/process-reboot-cause
Outdated
return [] | ||
|
||
|
||
def read_dpu_reboot_cause_files_and_save_chassis_state_db(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy can we break this into two functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor It is already done in two functions get_sorted_reboot_cause_files and saving to chassisStateDB using this function. The name of the function is misleading, so renamed it to save_dpu_reboot_cause_files_to_chassis_state_db()
scripts/determine-reboot-cause
Outdated
@@ -23,7 +23,9 @@ VERSION = "1.0" | |||
|
|||
SYSLOG_IDENTIFIER = "determine-reboot-cause" | |||
|
|||
MAX_HISTORY_FILES = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gpunathilell Removed
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
scripts/process-reboot-cause
Outdated
@@ -69,6 +71,84 @@ def read_reboot_cause_files_and_save_state_db(): | |||
x = TIME_SORTED_FULL_REBOOT_FILE_LIST[i] | |||
os.remove(x) | |||
|
|||
def get_dpus(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, get_dpu_info() defined in https://github.com/sonic-net/sonic-buildimage/pull/20724/files
tests/determine-reboot-cause_test.py
Outdated
|
||
# Assert that makedirs was called for the DPU directories | ||
mock_makedirs.assert_any_call(os.path.join('/host/reboot-cause/module', 'dpu0')) | ||
mock_makedirs.assert_any_call(os.path.join('/host/reboot-cause/module', 'dpu1')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy I think comment is addressed and test is added, right? please confirm.
Adding support for persistent storage and retrieval of DPU reboot-cause