-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add collector for Linux EDAC #324
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't there be a change in the end-to-end output too?
func NewEdacCollector() (Collector, error) { | ||
return &edacCollector{ | ||
ceCount: prometheus.NewDesc( | ||
prometheus.BuildFQName(Namespace, edacSubsystem, "ce_count"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_count is the suffix for Summaries/Histograms. This is probably a counter, so should be _total
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was translating the names directly without thinking about it too hard.
What about correctable_errors_total
, and similar for the other names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
[]string{"controller"}, nil, | ||
), | ||
ueCount: prometheus.NewDesc( | ||
prometheus.BuildFQName(Namespace, edacSubsystem, "ue_count"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spell out "uncorrectable"
Yes, it should be in the end-to-end output, not sure why. |
[]string{"controller"}, nil, | ||
), | ||
ueNoinfoCount: prometheus.NewDesc( | ||
prometheus.BuildFQName(Namespace, edacSubsystem, "no_csrow_uncorrectable_errors_total"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might we want to lump these into csrow_uncorrectable_errors_total with a label like "unknown"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. That depends on how well the kernel implements this data. In theory, unknown row + the csrow numbers should be possible to aggregate. Then we only need two metrics, one for correctable and one for uncorrectable.
cc3eb99
to
dd3dc45
Compare
@SuperQ What is the state of this? Is it ready to get reviewed/merged? |
19efeed
to
e8b92d3
Compare
@discordianfish Ok, finally fixed up the end-to-end test. This is ready to go. |
2b36015
to
374e060
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good beside the regex question. Also needs rebasing.
) | ||
|
||
var ( | ||
edacMemControllerRE = regexp.MustCompile(`.*devices/system/edac/mc/mc([0-9]*)`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the regexes needed? The globbing should already limit the files to the same, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regexp is used to extract the controller number from the directory name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, makes sense. In general I slightly prefer doing such basic parsing manually.. then again, possibly just a personal preference. So fine with me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
), | ||
csrowUeCount: prometheus.NewDesc( | ||
prometheus.BuildFQName(Namespace, edacSubsystem, "csrow_uncorrectable_errors_total"), | ||
"Total correctable memory errors for this csrow.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correctable -> uncorrectable
type edacCollector struct { | ||
ceCount *prometheus.Desc | ||
ueCount *prometheus.Desc | ||
csrowCeCount *prometheus.Desc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
csrowCeCount -> csRowCECount
csrowUeCount -> csRowUECount
👍 otherwise |
d793aed
to
87fd930
Compare
Collect "Error detection and correction" metrics from memory controllers. * Supported on Linux only. * Add basic fixtures. * Enabled by default.
87fd930
to
e4566d0
Compare
e4566d0
to
38a4a36
Compare
Signed-off-by: prombot <prometheus-team@googlegroups.com>
Collect "Error detection and correction" metrics from memory
controllers.