Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bmc watchdog #176

Merged
merged 2 commits into from
Nov 15, 2023
Merged

Conversation

agaoglu
Copy link
Contributor

@agaoglu agaoglu commented Nov 7, 2023

Our supermicro BMC's provide a watchdog functionality, i.e. taking some specified action if a timer is not reset within a specified time. freeipmi tools have a bmc-watchdog command to control and also report the current status of such function. This collector reports that information.

An example output for bmc-watchdog from freeipmi:

$ sudo bmc-watchdog --get
Timer Use:                   BIOS FRB2
Timer:                       Running
Logging:                     Enabled
Timeout Action:              Power Cycle
Pre-Timeout Interrupt:       None
Pre-Timeout Interval:        1 seconds
Timer Use BIOS FRB2 Flag:    Clear
Timer Use BIOS POST Flag:    Clear
Timer Use BIOS OS Load Flag: Clear
Timer Use BIOS SMS/OS Flag:  Clear
Timer Use BIOS OEM Flag:     Clear
Initial Countdown:           600 seconds
Current Countdown:           541 seconds

I've ignored the timer use clear flags for now. Others seem to work

# HELP ipmi_bmc_watchdog_current_countdown_seconds Watchdog initial countdown in seconds
# TYPE ipmi_bmc_watchdog_current_countdown_seconds gauge
ipmi_bmc_watchdog_current_countdown_seconds 584
# HELP ipmi_bmc_watchdog_initial_countdown_seconds Watchdog initial countdown in seconds
# TYPE ipmi_bmc_watchdog_initial_countdown_seconds gauge
ipmi_bmc_watchdog_initial_countdown_seconds 600
# HELP ipmi_bmc_watchdog_logging_state Watchdog log flag (1: Enabled, 0: Disabled / note: reverse of freeipmi)
# TYPE ipmi_bmc_watchdog_logging_state gauge
ipmi_bmc_watchdog_logging_state 1
# HELP ipmi_bmc_watchdog_pretimeout_interrupt_state Watchdog pre-timeout interrupt (1: active, 0: inactive)
# TYPE ipmi_bmc_watchdog_pretimeout_interrupt_state gauge
ipmi_bmc_watchdog_pretimeout_interrupt_state{interrupt="Messaging Interrupt"} 0
ipmi_bmc_watchdog_pretimeout_interrupt_state{interrupt="NMI / Diagnostic Interrupt"} 0
ipmi_bmc_watchdog_pretimeout_interrupt_state{interrupt="None"} 1
ipmi_bmc_watchdog_pretimeout_interrupt_state{interrupt="SMI"} 0
# HELP ipmi_bmc_watchdog_pretimeout_interval_seconds Watchdog pre-timeout interval in seconds
# TYPE ipmi_bmc_watchdog_pretimeout_interval_seconds gauge
ipmi_bmc_watchdog_pretimeout_interval_seconds 1
# HELP ipmi_bmc_watchdog_timeout_action_state Watchdog timeout action (1: active, 0: inactive)
# TYPE ipmi_bmc_watchdog_timeout_action_state gauge
ipmi_bmc_watchdog_timeout_action_state{action="Hard Reset"} 0
ipmi_bmc_watchdog_timeout_action_state{action="None"} 0
ipmi_bmc_watchdog_timeout_action_state{action="Power Cycle"} 1
ipmi_bmc_watchdog_timeout_action_state{action="Power Down"} 0
# HELP ipmi_bmc_watchdog_timer_state Watchdog timer running (1: running, 0: stopped)
# TYPE ipmi_bmc_watchdog_timer_state gauge
ipmi_bmc_watchdog_timer_state 1
# HELP ipmi_bmc_watchdog_timer_use_state Watchdog timer use (1: active, 0: inactive)
# TYPE ipmi_bmc_watchdog_timer_use_state gauge
ipmi_bmc_watchdog_timer_use_state{name="BIOS FRB2"} 1
ipmi_bmc_watchdog_timer_use_state{name="BIOS POST"} 0
ipmi_bmc_watchdog_timer_use_state{name="OEM"} 0
ipmi_bmc_watchdog_timer_use_state{name="OS LOAD"} 0
ipmi_bmc_watchdog_timer_use_state{name="SMS/OS"} 0

@bitfehler
Copy link
Contributor

Hi there,

first of all, thanks a lot. This certainly looks interesting and very comprehensive (docs! 🙌). Code also looks pretty good at first glance, but please give me a bit more time to review the details.

For now, I'd be curious: how did you determine the possible fixed values (e.g. watchdogTimerUses, watchdogTimeoutActions, etc.)? Is this defined in the IPMI spec? Or documented somewhere?

@agaoglu
Copy link
Contributor Author

agaoglu commented Nov 10, 2023 via email

@bitfehler
Copy link
Contributor

Ok, thanks for your patience, I think this looks pretty good. Could you kindly sign off your commits (git rebase --signoff) so that the tests are happy?

Thanks a lot!

Some BMC's provide a watchdog functionality, i.e. taking some specified
action if a timer is not reset within a specified time. freeipmi tools
have a bmc-watchdog command to control and also report the current status
of such function. This collector reports that information.

Signed-off-by: Erdem Agaoglu <erdem.agaoglu@gmail.com>
Signed-off-by: Erdem Agaoglu <erdem.agaoglu@gmail.com>
@agaoglu
Copy link
Contributor Author

agaoglu commented Nov 14, 2023

I guess that clears it :)

Thank you.

@bitfehler bitfehler merged commit b302e65 into prometheus-community:master Nov 15, 2023
2 checks passed
@bitfehler
Copy link
Contributor

Thanks a lot! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants