Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(inputs.smartctl): Introduce smartctl JSON input plugin #15066

Merged
merged 8 commits into from
Apr 16, 2024

Conversation

powersj
Copy link
Contributor

@powersj powersj commented Mar 26, 2024

Summary

Checklist

  • No AI generated code was used in this PR

Related issues

resolves #

@telegraf-tiger telegraf-tiger bot added feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins labels Mar 26, 2024
@powersj powersj self-assigned this Mar 26, 2024
@nlgranger
Copy link

Hi! To answer your question in the issue, here is the command which returns exit code 4
sudo smartctl --json --all /dev/bus/6 --device megaraid,14 --nocheck=standby
smartctl_megaraid.txt

@nlgranger
Copy link

nlgranger commented Mar 29, 2024

I think the plugin only collects data from the last disk of the raid controller.

$ sudo smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/sdc -d sat # /dev/sdc [SAT], ATA device
/dev/bus/6 -d megaraid,8 # /dev/bus/6 [megaraid_disk_08], SCSI device
/dev/bus/6 -d megaraid,9 # /dev/bus/6 [megaraid_disk_09], SCSI device
/dev/bus/6 -d megaraid,10 # /dev/bus/6 [megaraid_disk_10], SCSI device
/dev/bus/6 -d megaraid,11 # /dev/bus/6 [megaraid_disk_11], SCSI device
/dev/bus/6 -d megaraid,12 # /dev/bus/6 [megaraid_disk_12], SCSI device
/dev/bus/6 -d megaraid,13 # /dev/bus/6 [megaraid_disk_13], SCSI device
/dev/bus/6 -d megaraid,14 # /dev/bus/6 [megaraid_disk_14], SCSI device

Only the sd[x] and megaraid disk 14 are mentioned in the telegraf logs (and visible in influxdb):

smartctl,host=**********,model=LSI\ MR9361-8i,name=/dev/sda,serial=00dfc2930dda29b821305dcc0cb00506,type=scsi firmware="",capacity=255516999680i,health_ok=false,temperature=0i 1711745550000000000
smartctl,host=**********,model=ST6000NM0115-1YZ110,name=/dev/bus/6,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d capacity=6001175126016i,health_ok=true,temperature=25i,firmware="SN04" 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=Raw_Read_Error_Rate,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d value=83i,raw_value=185123688i,worst=64i,threshold=44i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=Spin_Up_Time,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d raw_value=0i,worst=91i,threshold=0i,value=91i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=Start_Stop_Count,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d raw_value=62i,worst=100i,threshold=20i,value=100i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=Reallocated_Sector_Ct,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d raw_value=0i,worst=100i,threshold=10i,value=100i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=Seek_Error_Rate,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d raw_value=2836411902i,worst=60i,threshold=45i,value=95i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=Power_On_Hours,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d raw_value=44345i,worst=50i,threshold=0i,value=50i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=Spin_Retry_Count,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d worst=100i,threshold=97i,value=100i,raw_value=0i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=Power_Cycle_Count,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d raw_value=62i,worst=100i,threshold=20i,value=100i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=End-to-End_Error,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d raw_value=0i,worst=100i,threshold=99i,value=100i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=Reported_Uncorrect,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d worst=100i,threshold=0i,value=100i,raw_value=0i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=Command_Timeout,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d raw_value=0i,worst=100i,threshold=0i,value=100i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=High_Fly_Writes,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d worst=100i,threshold=0i,value=100i,raw_value=0i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=Airflow_Temperature_Cel,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d raw_value=504627225i,worst=64i,threshold=40i,value=75i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=G-Sense_Error_Rate,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d raw_value=3296i,worst=99i,threshold=0i,value=99i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=Power-Off_Retract_Count,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d raw_value=1866i,worst=100i,threshold=0i,value=100i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=Load_Cycle_Count,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d worst=100i,threshold=0i,value=100i,raw_value=1895i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=Temperature_Celsius,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d value=25i,raw_value=30064771097i,worst=40i,threshold=0i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=Hardware_ECC_Recovered,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d value=83i,raw_value=185123688i,worst=64i,threshold=0i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=Current_Pending_Sector,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d raw_value=0i,worst=100i,threshold=0i,value=100i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=Offline_Uncorrectable,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d raw_value=0i,worst=100i,threshold=0i,value=100i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=UDMA_CRC_Error_Count,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d raw_value=0i,worst=200i,threshold=0i,value=200i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=Head_Flying_Hours,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d raw_value=8331506409844008i,worst=253i,threshold=0i,value=100i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=Total_LBAs_Written,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d worst=253i,threshold=0i,value=100i,raw_value=28115177774i 1711745550000000000
smartctl_attributes,host=**********,model=ST6000NM0115-1YZ110,name=Total_LBAs_Read,serial=ZAD2C11G,type=sat+megaraid\,14,wwn=5000c500a496983d raw_value=3197497259616i,worst=253i,threshold=0i,value=100i 1711745550000000000
smartctl,host=**********,model=LSI\ MR9361-8i,name=/dev/sdb,serial=00aa72f30f022ab821305dcc0cb00506,type=scsi capacity=18001818550272i,health_ok=false,temperature=0i,firmware="" 1711745550000000000
smartctl,host=**********,name=/dev/sdc,type=sat temperature=0i,firmware="",capacity=0i,health_ok=false 1711745550000000000

EDIT: sudo strace -e trace=process -f -p $(pidof ./telegraf) also confirms smartctl is only called for these devices.

@nlgranger
Copy link

Adding scan result with megaraid controller in case you want to make a test out of it.
smartctl_scan_megaraid.txt

@powersj
Copy link
Contributor Author

powersj commented Mar 29, 2024

Adding scan result with megaraid controller in case you want to make a test out of it. smartctl_scan_megaraid.txt

Yes, thank you very much!

@powersj
Copy link
Contributor Author

powersj commented Apr 1, 2024

@nlgranger,

I've pushed an update that does not assume unique names during scan. The next set of artifacts after this message will contain a build with those changes. If you could once again give it a shot and let me know if all the data is collected, I would really appreciate it!

Thanks again!

@nlgranger
Copy link

It work perfectly now!

@nlgranger,

I've pushed an update that does not assume unique names during scan. The next set of artifacts after this message will contain a build with those changes. If you could once again give it a shot and let me know if all the data is collected, I would really appreciate it!

Thanks again!

@powersj
Copy link
Contributor Author

powersj commented Apr 2, 2024

Thanks for confirming!

@srebhan
Copy link
Member

srebhan commented Apr 3, 2024

@powersj FYI: there is a pure-go library for accessing SMART information: https://github.com/anatol/smart.go

@DEvil0000
Copy link
Contributor

DEvil0000 commented Apr 4, 2024

I am comming from #15095. I gave the build above from the bot a spin. result however is it is working for normal disks but not showing any smart data for the raid disks. Same config and machine as in #15095 - so attribute read is not enabled in config.

edit adding potential output:

root@g11:~# sudo -u telegraf sudo -n smartctl --json --all /dev/bus/0 --device megaraid,2 --nocheck standby
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      1
    ],
    "svn_revision": "5022",
    "platform_info": "x86_64-linux-5.4.0-173-generic",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "--json",
      "--all",
      "/dev/bus/0",
      "--device",
      "megaraid,2",
      "--nocheck",
      "standby"
    ],
    "exit_status": 0
  },
  "device": {
    "name": "/dev/bus/0",
    "info_name": "/dev/bus/0 [megaraid_disk_02]",
    "type": "megaraid,2",
    "protocol": "SCSI"
  },
  "vendor": "HPE",
  "product": "EG001200JWJNK",
  "model_name": "HPE EG001200JWJNK",
  "revision": "HPD5",
  "scsi_version": "SPC-4",
  "user_capacity": {
    "blocks": 2344225968,
    "bytes": 1200243695616
  },
  "logical_block_size": 512,
  "rotation_rate": 10000,
  "form_factor": {
    "scsi_value": 3,
    "name": "2.5 inches"
  },
  "serial_number": "9370A0C1FF4F",
  "device_type": {
    "scsi_value": 0,
    "name": "disk"
  },
  "local_time": {
    "time_t": 1712244165,
    "asctime": "Thu Apr  4 17:22:45 2024 CEST"
  },
  "smart_status": {
    "passed": true
  },
  "temperature": {
    "current": 52,
    "drive_trip": 60
  },
  "scsi_grown_defect_list": 0,
  "scsi_error_counter_log": {
    "read": {
      "errors_corrected_by_eccfast": 0,
      "errors_corrected_by_eccdelayed": 0,
      "errors_corrected_by_rereads_rewrites": 0,
      "total_errors_corrected": 0,
      "correction_algorithm_invocations": 0,
      "gigabytes_processed": "7345.274",
      "total_uncorrected_errors": 0
    },
    "write": {
      "errors_corrected_by_eccfast": 0,
      "errors_corrected_by_eccdelayed": 0,
      "errors_corrected_by_rereads_rewrites": 0,
      "total_errors_corrected": 0,
      "correction_algorithm_invocations": 0,
      "gigabytes_processed": "333.228",
      "total_uncorrected_errors": 0
    },
    "verify": {
      "errors_corrected_by_eccfast": 0,
      "errors_corrected_by_eccdelayed": 0,
      "errors_corrected_by_rereads_rewrites": 0,
      "total_errors_corrected": 0,
      "correction_algorithm_invocations": 0,
      "gigabytes_processed": "8402.033",
      "total_uncorrected_errors": 0
    }
  }
}

@powersj
Copy link
Contributor Author

powersj commented Apr 4, 2024

however is it is working for normal disks but not showing any smart data for the raid disks. Same config and machine as in #15095

Hopefully you are not using the same config, as this PR introduces a new plugin called inputs.smartctl to parse the JSON output.

If you still have issues after updating the input name, please provide the results of:

  • smartctl --json --scan

And the device scan from the device you are not seeing:

  • smartctl --json --all $DEVICE --device $TYPE

@DEvil0000
Copy link
Contributor

thanks for the config change hint. I just missed that. while i just added the output above
works fine for me then.

the path for me however is /usr/sbin/smartctl and not /usr/bin/smartctl - you may want consider a change there.
can I get a enable/disable option for the attributes - I am just interested in health normally (-H parameter)?

@powersj
Copy link
Contributor Author

powersj commented Apr 4, 2024

works fine for me then.

Thanks for confirming!

the path for me however is /usr/sbin/smartctl and not /usr/bin/smartctl - you may want consider a change there.

There is a config option for this ;)

    ## Optionally specify the path to the smartctl executable
    # path = "/usr/bin/smartctl"

can I get a enable/disable option for the attributes

You can use metric filtering to remove unwanted metrics. For example, using the namedrop config parameter prevents metrics with a certain name leaving the metric: namedrop = ["smartctl_attributes"]

@DEvil0000
Copy link
Contributor

btw: thanks for the work you put in this!

another thing I realized is that the smart plugin in newer versions used to call nvme cli tool as well for some details. Not sure if that actually matters or you want to include this as well.

There is a config option for this ;)

That is what I used. Pointing to bin however just looks like a odd default to me since I do not know any distro packaging it to this path.

can I get a enable/disable option for the attributes

sure I can filter but this option would have been an more easy config and may reduce computational load.

@nlgranger
Copy link

I do not know any distro packaging it to this path.

Archlinux! (BTW™)

@powersj
Copy link
Contributor Author

powersj commented Apr 4, 2024

Not sure if that actually matters or you want to include this as well.

Yeah this is part of the reason why this is a new plugin. I specifically did not want to continue doing this. The nvme calls should have been a second plugin.

I do not know any distro packaging it to this path.
Archlinux! (BTW™)

hahah exactly :) I copied that path over from the previous plugin, but if the bulk of distros use /usr/sbin then I can update that to the default. I know debian/ubuntu use /usr/sbin. If Fedora does the same, then I think that would be ok to switch.

@nlgranger
Copy link

If Fedora does the same, then I think that would be ok to switch.

RHEL does.

@powersj
Copy link
Contributor Author

powersj commented Apr 4, 2024

@srebhan,

@powersj FYI: there is a pure-go library for accessing SMART information: https://github.com/anatol/smart.go

Thanks for finding this. I took a look at this and have the following comments:

  1. The library does not scan or find devices for us. The examples use github.com/jaypipes/ghw to discover disks. I would rather us continue to use the scan JSON output from the command itself as it is a bit easier to ask users for that output if issues are found.
  2. smartmontools does exists for other platforms while that library is linux on. They do have some darwin files, but require cgo. While I have not tried this, if the JSON output works on those platforms they this plugin should work on them as well.
  3. Like the nvidia_smi plugin I prefer to parse the actual data from the tool itself and allow us to easily add new metrics or fields that may get requested. I do not want to depend on another library (i.e. gopsutil) in another plugin.
  4. The commands to get smart data depend on the type, and may not cover all types. For example the megaraid in the above user's comments may not actually be covered. We may be able to switch this to SCSI, but it would be logic we have to maintain.

edit: the repo mentioned https://github.com/dswarbrick/smart/ as a source, and it looks like they even had to have megaraid specific details: https://github.com/dswarbrick/smart/blob/master/megaraid/megaraid.go

@powersj powersj added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Apr 4, 2024
@powersj powersj assigned srebhan and DStrand1 and unassigned powersj Apr 4, 2024
@powersj powersj marked this pull request as ready for review April 4, 2024 20:38
Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @powersj! The code looks good, my only question is how this relates to inputs.smart? Is this a replacement? If not, what are the differences?

@powersj
Copy link
Contributor Author

powersj commented Apr 10, 2024

how this relates to inputs.smart? Is this a replacement? If not, what are the differences?

Should I add the following to the readme?

Smartctl is focused only on collecting data from smartctl via the JSON output it provides via the scan and collection commands.

The smart plugin parses the raw output of the smartctl using a variety of regex. It can also optionally collect additional details via the nvme command including some vendor specific details.

I don't fully consider smartctl as a replacement, given it lacks the additional intel nvme collection/details, but for users who report issues using smart in the future, I would absolutely point them at smartctl first.

@srebhan
Copy link
Member

srebhan commented Apr 11, 2024

@powersj after sleeping over it, it would be nice to add this statement to the readme. I guess this will also clarify things for the user.

@powersj
Copy link
Contributor Author

powersj commented Apr 11, 2024

@powersj after sleeping over it, it would be nice to add this statement to the readme. I guess this will also clarify things for the user.

done, give it a read and let me know if I should add something to the smart plugin as well.

@telegraf-tiger
Copy link
Contributor

Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip.
Downloads for additional architectures and packages are available below.

🥳 This pull request decreases the Telegraf binary size by -2.19 % for linux amd64 (new size: 231.5 MB, nightly size 236.7 MB)

📦 Click here to get additional PR build artifacts

Artifact URLs

DEB RPM TAR GZ ZIP
amd64.deb aarch64.rpm darwin_amd64.tar.gz windows_amd64.zip
arm64.deb armel.rpm darwin_arm64.tar.gz windows_arm64.zip
armel.deb armv6hl.rpm freebsd_amd64.tar.gz windows_i386.zip
armhf.deb i386.rpm freebsd_armv7.tar.gz
i386.deb ppc64le.rpm freebsd_i386.tar.gz
mips.deb riscv64.rpm linux_amd64.tar.gz
mipsel.deb s390x.rpm linux_arm64.tar.gz
ppc64el.deb x86_64.rpm linux_armel.tar.gz
riscv64.deb linux_armhf.tar.gz
s390x.deb linux_i386.tar.gz
linux_mips.tar.gz
linux_mipsel.tar.gz
linux_ppc64le.tar.gz
linux_riscv64.tar.gz
linux_s390x.tar.gz

Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks @powersj!

@srebhan srebhan removed their assignment Apr 12, 2024
@DStrand1 DStrand1 merged commit 1214de6 into influxdata:master Apr 16, 2024
26 checks passed
@github-actions github-actions bot added this to the v1.31.0 milestone Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants