Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(inputs.smartctl): Introduce smartctl JSON input plugin #15066

Merged
merged 8 commits into from
Apr 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions plugins/inputs/all/smartctl.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
//go:build !custom || inputs || inputs.smartctl

package all

import _ "github.com/influxdata/telegraf/plugins/inputs/smartctl" // register plugin
110 changes: 110 additions & 0 deletions plugins/inputs/smartctl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# smartctl JSON Input Plugin

Get metrics using the command line utility `smartctl` for S.M.A.R.T.
(Self-Monitoring, Analysis and Reporting Technology) storage devices. SMART is a
monitoring system included in computer hard disk drives (HDDs), solid-state
drives (SSDs), and nVME drives that detects and reports on various indicators of
drive reliability, with the intent of enabling the anticipation of hardware
failures.

This version of the plugin requires support of the JSON flag from the `smartctl`
command. This flag was added in 7.0 (2019) and further enhanced in subsequent
releases.

See smartmontools (<https://www.smartmontools.org/>) for more information.

## smart vs smartctl

The smartctl plugin is an alternative to the smart plugin. The biggest
difference is that the smart plugin can also call `nvmectl` to collect
additional details about NVMe devices as well as some vendor specific device
information.

This plugin will also require a version of the `smartctl` command that supports
JSON output versus the smart plugin will parse the raw output.

## Global configuration options <!-- @/docs/includes/plugin_config.md -->

In addition to the plugin-specific configuration settings, plugins support
additional global and plugin configuration settings. These settings are used to
modify metrics, tags, and field or create aliases and configure ordering, etc.
See the [CONFIGURATION.md][CONFIGURATION.md] for more details.

[CONFIGURATION.md]: ../../../docs/CONFIGURATION.md#plugins

## Configuration

```toml @sample.conf
# Read metrics from SMART storage devices using smartclt's JSON output
[[inputs.smartctl]]
## Optionally specify the path to the smartctl executable
# path = "/usr/sbin/smartctl"

## Use sudo
## On most platforms used, smartctl requires root access. Setting 'use_sudo'
## to true will make use of sudo to run smartctl. Sudo must be configured to
## allow the telegraf user to run smartctl without a password.
# use_sudo = false

## Devices to include or exclude
## By default, the plugin will use all devices found in the output of
## `smartctl --scan`. Only one option is allowed at a time. If set, include
## sets the specific devices to scan, while exclude omits specific devices.
# devices_include = []
# devices_exclude = []

## Skip checking disks in specified power mode
## Defaults to "standby" to not wake up disks that have stopped rotating.
## For full details on the options here, see the --nocheck section in the
## smartctl man page. Choose from:
## * never: always check the device
## * sleep: check the device unless it is in sleep mode
## * standby: check the device unless it is in sleep or standby mode
## * idle: check the device unless it is in sleep, standby, or idle mode
# nocheck = "standby"

## Timeout for the cli command to complete
# timeout = "30s"
```

## Permissions

It is important to note that this plugin references `smartctl`, which may
require additional permissions to execute successfully. Depending on the
user/group permissions of the telegraf user executing this plugin, users may
need to use sudo.

Users need the following in the Telegraf config:

```toml
[[inputs.smart_json]]
use_sudo = true
```

And to update the `/etc/sudoers` file to allow running smartctl:

```bash
$ visudo
# Add the following lines:
Cmnd_Alias SMARTCTL = /usr/sbin/smartctl
telegraf ALL=(ALL) NOPASSWD: SMARTCTL
Defaults!SMARTCTL !logfile, !syslog, !pam_session
```

## Debugging Issues

This plugin uses the following commands to determine devices and collect
metrics:

* `smartctl --json --scan`
* `smartctl --json --all $DEVICE --device $TYPE --nocheck=$NOCHECK`

Please include the output of the above two commands for all devices that are
having issues.

## Metrics

## Example Output

```text
```
30 changes: 30 additions & 0 deletions plugins/inputs/smartctl/sample.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Read metrics from SMART storage devices using smartclt's JSON output
[[inputs.smartctl]]
## Optionally specify the path to the smartctl executable
# path = "/usr/sbin/smartctl"

## Use sudo
## On most platforms used, smartctl requires root access. Setting 'use_sudo'
## to true will make use of sudo to run smartctl. Sudo must be configured to
## allow the telegraf user to run smartctl without a password.
# use_sudo = false

## Devices to include or exclude
## By default, the plugin will use all devices found in the output of
## `smartctl --scan`. Only one option is allowed at a time. If set, include
## sets the specific devices to scan, while exclude omits specific devices.
# devices_include = []
# devices_exclude = []

## Skip checking disks in specified power mode
## Defaults to "standby" to not wake up disks that have stopped rotating.
## For full details on the options here, see the --nocheck section in the
## smartctl man page. Choose from:
## * never: always check the device
## * sleep: check the device unless it is in sleep mode
## * standby: check the device unless it is in sleep or standby mode
## * idle: check the device unless it is in sleep, standby, or idle mode
# nocheck = "standby"

## Timeout for the cli command to complete
# timeout = "30s"
93 changes: 93 additions & 0 deletions plugins/inputs/smartctl/smartctl.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
//go:generate ../../../tools/readme_config_includer/generator
package smartctl

import (
_ "embed"
"errors"
"fmt"
"os"
"os/exec"
"time"

"github.com/influxdata/telegraf"
"github.com/influxdata/telegraf/config"
"github.com/influxdata/telegraf/filter"
"github.com/influxdata/telegraf/plugins/inputs"
)

//go:embed sample.conf
var sampleConfig string

// execCommand is used to mock commands in tests.
var execCommand = exec.Command

type Smartctl struct {
Path string `toml:"path"`
NoCheck string `toml:"no_check"`
UseSudo bool `toml:"use_sudo"`
Timeout config.Duration `toml:"timeout"`
DevicesInclude []string `toml:"devices_include"`
DevicesExclude []string `toml:"devices_exclude"`
Log telegraf.Logger `toml:"-"`

deviceFilter filter.Filter
}

func (*Smartctl) SampleConfig() string {
return sampleConfig
}

func (s *Smartctl) Init() error {
if s.Path == "" {
s.Path = "/usr/sbin/smartctl"
}

switch s.NoCheck {
case "never", "sleep", "standby", "idle":
case "":
s.NoCheck = "standby"
default:
return fmt.Errorf("invalid no_check value: %s", s.NoCheck)
}

if s.Timeout == 0 {
s.Timeout = config.Duration(time.Second * 30)
}

if len(s.DevicesInclude) != 0 && len(s.DevicesExclude) != 0 {
return errors.New("cannot specify both devices_include and devices_exclude")
}

var err error
s.deviceFilter, err = filter.NewIncludeExcludeFilter(s.DevicesInclude, s.DevicesExclude)
if err != nil {
return err
}

return nil
}

func (s *Smartctl) Gather(acc telegraf.Accumulator) error {
devices, err := s.scan()
if err != nil {
return fmt.Errorf("Error scanning system: %w", err)
}

for _, device := range devices {
if err := s.scanDevice(acc, device.Name, device.Type); err != nil {
return fmt.Errorf("Error getting device %s: %w", device, err)
}
}

return nil
}

func init() {
// Set LC_NUMERIC to uniform numeric output from cli tools
_ = os.Setenv("LC_NUMERIC", "en_US.UTF-8")
inputs.Add("smartctl", func() telegraf.Input {
return &Smartctl{
Timeout: config.Duration(time.Second * 30),
}
})
}
151 changes: 151 additions & 0 deletions plugins/inputs/smartctl/smartctl_device.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
package smartctl

import (
"encoding/json"
"fmt"
"time"

"github.com/influxdata/telegraf"
"github.com/influxdata/telegraf/internal"
)

func (s *Smartctl) scanDevice(acc telegraf.Accumulator, deviceName string, deviceType string) error {
args := []string{"--json", "--all", deviceName, "--device", deviceType, "--nocheck=" + s.NoCheck}
cmd := execCommand(s.Path, args...)
if s.UseSudo {
cmd = execCommand("sudo", append([]string{"-n", s.Path}, args...)...)
}

var device smartctlDeviceJSON
out, err := internal.CombinedOutputTimeout(cmd, time.Duration(s.Timeout))
if err != nil {
// Error running the command and unable to parse the JSON, then bail
if jsonErr := json.Unmarshal(out, &device); jsonErr != nil {
return fmt.Errorf("error running smartctl with %s: %w", args, err)
}

// If we were able to parse the result, then only exit if we get an error
// as sometimes we can get warnings, that still produce data.
if len(device.Smartctl.Messages) > 0 &&
device.Smartctl.Messages[0].Severity == "error" &&
device.Smartctl.Messages[0].String != "" {
return fmt.Errorf("error running smartctl with %s got smartctl error message: %s", args, device.Smartctl.Messages[0].String)
}
}

if err := json.Unmarshal(out, &device); err != nil {
return fmt.Errorf("error unable to unmarshall response %s: %w", args, err)
}

t := time.Now()

tags := map[string]string{
"name": device.Device.Name,
"type": device.Device.Type,
"model": device.ModelName,
"serial": device.SerialNumber,
}

if device.Vendor != "" {
tags["vendor"] = device.Vendor
}

// The JSON WWN is in decimal and needs to be converted to hex
if device.Wwn.ID != 0 && device.Wwn.Naa != 0 && device.Wwn.Oui != 0 {
tags["wwn"] = fmt.Sprintf("%01x%06x%09x", device.Wwn.Naa, device.Wwn.Oui, device.Wwn.ID)
}

fields := map[string]interface{}{
"capacity": device.UserCapacity.Bytes,
"health_ok": device.SmartStatus.Passed,
"temperature": device.Temperature.Current,
"firmware": device.FirmwareVersion,
}

// Add NVMe specific fields
if device.Device.Type == "nvme" {
fields["critical_warning"] = device.NvmeSmartHealthInformationLog.CriticalWarning
fields["temperature"] = device.NvmeSmartHealthInformationLog.Temperature
fields["available_spare"] = device.NvmeSmartHealthInformationLog.AvailableSpare
fields["available_spare_threshold"] = device.NvmeSmartHealthInformationLog.AvailableSpareThreshold
fields["percentage_used"] = device.NvmeSmartHealthInformationLog.PercentageUsed
fields["data_units_read"] = device.NvmeSmartHealthInformationLog.DataUnitsRead
fields["data_units_written"] = device.NvmeSmartHealthInformationLog.DataUnitsWritten
fields["host_reads"] = device.NvmeSmartHealthInformationLog.HostReads
fields["host_writes"] = device.NvmeSmartHealthInformationLog.HostWrites
fields["controller_busy_time"] = device.NvmeSmartHealthInformationLog.ControllerBusyTime
fields["power_cycles"] = device.NvmeSmartHealthInformationLog.PowerCycles
fields["power_on_hours"] = device.NvmeSmartHealthInformationLog.PowerOnHours
fields["unsafe_shutdowns"] = device.NvmeSmartHealthInformationLog.UnsafeShutdowns
fields["media_errors"] = device.NvmeSmartHealthInformationLog.MediaErrors
fields["num_err_log_entries"] = device.NvmeSmartHealthInformationLog.NumErrLogEntries
fields["warning_temp_time"] = device.NvmeSmartHealthInformationLog.WarningTempTime
fields["critical_comp_time"] = device.NvmeSmartHealthInformationLog.CriticalCompTime
}

acc.AddFields("smartctl", fields, tags, t)

// Check for ATA specific attribute fields
for _, attribute := range device.AtaSmartAttributes.Table {
attributeTags := make(map[string]string, len(tags)+1)
for k, v := range tags {
attributeTags[k] = v
}
attributeTags["name"] = attribute.Name

fields := map[string]interface{}{
"raw_value": attribute.Raw.Value,
"worst": attribute.Worst,
"threshold": attribute.Thresh,
"value": attribute.Value,
}

acc.AddFields("smartctl_attributes", fields, attributeTags, t)
}

// Check for SCSI error counter entries
if device.Device.Type == "scsi" {
counterTags := make(map[string]string, len(tags)+1)
for k, v := range tags {
counterTags[k] = v
}

counterTags["page"] = "read"
fields := map[string]interface{}{
"errors_corrected_by_eccfast": device.ScsiErrorCounterLog.Read.ErrorsCorrectedByEccfast,
"errors_corrected_by_eccdelayed": device.ScsiErrorCounterLog.Read.ErrorsCorrectedByEccdelayed,
"errors_corrected_by_rereads_rewrites": device.ScsiErrorCounterLog.Read.ErrorsCorrectedByRereadsRewrites,
"total_errors_corrected": device.ScsiErrorCounterLog.Read.TotalErrorsCorrected,
"correction_algorithm_invocations": device.ScsiErrorCounterLog.Read.CorrectionAlgorithmInvocations,
"gigabytes_processed": device.ScsiErrorCounterLog.Read.GigabytesProcessed,
"total_uncorrected_errors": device.ScsiErrorCounterLog.Read.TotalUncorrectedErrors,
}
acc.AddFields("smartctl_scsi_error_counter_log", fields, counterTags, t)

counterTags["page"] = "write"
fields = map[string]interface{}{
"errors_corrected_by_eccfast": device.ScsiErrorCounterLog.Write.ErrorsCorrectedByEccfast,
"errors_corrected_by_eccdelayed": device.ScsiErrorCounterLog.Write.ErrorsCorrectedByEccdelayed,
"errors_corrected_by_rereads_rewrites": device.ScsiErrorCounterLog.Write.ErrorsCorrectedByRereadsRewrites,
"total_errors_corrected": device.ScsiErrorCounterLog.Write.TotalErrorsCorrected,
"correction_algorithm_invocations": device.ScsiErrorCounterLog.Write.CorrectionAlgorithmInvocations,
"gigabytes_processed": device.ScsiErrorCounterLog.Write.GigabytesProcessed,
"total_uncorrected_errors": device.ScsiErrorCounterLog.Write.TotalUncorrectedErrors,
}
acc.AddFields("smartctl_scsi_error_counter_log", fields, counterTags, t)

counterTags["page"] = "verify"
fields = map[string]interface{}{
"errors_corrected_by_eccfast": device.ScsiErrorCounterLog.Verify.ErrorsCorrectedByEccfast,
"errors_corrected_by_eccdelayed": device.ScsiErrorCounterLog.Verify.ErrorsCorrectedByEccdelayed,
"errors_corrected_by_rereads_rewrites": device.ScsiErrorCounterLog.Verify.ErrorsCorrectedByRereadsRewrites,
"total_errors_corrected": device.ScsiErrorCounterLog.Verify.TotalErrorsCorrected,
"correction_algorithm_invocations": device.ScsiErrorCounterLog.Verify.CorrectionAlgorithmInvocations,
"gigabytes_processed": device.ScsiErrorCounterLog.Verify.GigabytesProcessed,
"total_uncorrected_errors": device.ScsiErrorCounterLog.Verify.TotalUncorrectedErrors,
}
acc.AddFields("smartctl_scsi_error_counter_log", fields, counterTags, t)
}

return nil
}
Loading
Loading