Harvest service doesn't work as expected #122

faguayot · 2021-06-07T13:50:38Z

Describe the bug
A clear and concise description of what the bug is.

When I started the service of harvest, it only runs two poller from 11 that we have defined in the harvest.yml. If I run the harvest process with the next command: **/opt/harvest/bin/harvest start** or **/opt/harvest/bin/harvest restart all the pollers** run correctly. It happens the same with the different context defining for the service that is to say: start, restart, status and stop.

I attached an image of what we see in the beginning of the service

Environment
Provide accurate information about the environment to help us reproduce the issue.

Harvest version: harvest version 21.05.1-1 (commit 2211c00) (build date 2021-05-21T01:28:12+0530) linux/amd64
Command line arguments used: [e.g. bin/harvest start --config=foo.yml --collectors Zapi]
OS: RHEL 8.2
Install method: yum
ONTAP Version: 9.5 and 9.7
Other:

To Reproduce
Running the service systemctl start harvest.service

Expected behavior
It should run a process for every poller in my harvest.yml.

Actual behavior
It only runs two pollers, sometimes none of them.

Possible solution, workaround, fix
Starting the gathering using the executable: "/opt/harvest/bin/harvest" instead of the service

The text was updated successfully, but these errors were encountered:

cgrinds · 2021-06-07T17:48:54Z

i have a theory on what's going, but first a few questions:

can you share the logs in /var/log/harvest/ or check if there was anything logged for some of the pollers that didn't start?
did you upgrade from a previous version of Harvest?
What does which harvest return? You may see some pollers not starting because you have multiple harvest.yml files and the incorrect one is being used. We can find those with find / -name 'harvest.yml' return?

One issue you hit is running /opt/harvest/bin/harvest as root.
When you do that, the pollers will write a pidfile in /var/run/harvest/ owned by root.

Later when you try to start|stop|restart using systemctl the Harvest service unit file tells systemctl to use the harvest user. Since the pidfiles were created by root, the harvest user won't have permission to read or change those files.

To fix this (as root):

Stop the pollers
rm /var/run/harvest/*
systemctl restart harvest
verify with ps aux | grep poller that the poller processes are owned by the harvest user
verify that ls -la /var/run/harvest shows the pidfiles are owned by harvest

As root, you could also su to harvest and then safely use /opt/harvest/bin/harvest directly too.

See also #122

faguayot · 2021-06-08T16:14:11Z

Hello Chris,

I try to respond your questions:

No, I didn't find anything in the /var/log/harvest related with my boot attempt. I found information relative to the service in the /var/log/messages.

Yes, I've upgraded harvest version. Could be that a problem?
Here I show you the output of which harvest and which harvest.yml files we have in our vm.

I was aware about the harvest user used in the service before I opened the issue and I tested the same that you said about stop the harvest manually and checking there aren't any poller process and the PID files in the /run/var/harvest too.

Here the results following your steps:

The permission for the /var/run/harvest:

Without any poller process after stop the pollers with the command harvest stop:

The messages that appear after starting the service.

The last test I did it was doing login into the vm with the harvest user and I tried to run with the command: harvest start without success.

cgrinds · 2021-06-08T19:08:41Z

Hi faguayot,

Thanks for the details and screenshots.

The fact that you have a harvest binary in bin/harvest implies you have multiple versions of harvest installed since the RPM does not install there. As you mentioned, I'm assuming this is from an earlier install.

When using systemctl, the version of Harvest that will be used is the one in /opt/harvest/. That means /etc/harvest/harvest.yml will not be used. It would be less confusing to remove the old version of Harvest. Make sure there aren't any changes you want to keep and remove /etc/harvest/

Interesting that in your last example none of the pollers started. We're going in the wrong direction :) Let's try starting just one poller in the foreground. Maybe there are errors that are being missed.

Make a copy of your harvest.yml like so:
cp /opt/harvest/harvest.yml /opt/harvest/one.yml

Edit /opt/harvest/one.yml and remove or comment out all pollers but one.

Login or su as the harvest user

cd /opt/harvest/
bin/harvest --config one.yml --foreground

Hopefully you'll get some logging to the terminal that helps us figure out what's wrong.

faguayot · 2021-06-09T07:22:23Z

Hi Chris,

Sorry but I think I have confused you because I forgot to show you the /bin/harvest is only a symbolic link which I created for recognize the command harvest in any path.

I remove the file /etc/harvest/harvest.yml and then I've tried to follow the steps for trying to start one poller in the foreground but it seems like my harvest doesn't recognize the flag foreground.

But I've deleted the foreground flag too and the result is the same without the error about the foreground.

Thanks.
Best regards.

cgrinds · 2021-06-09T11:54:57Z

Yes, I left out the start command.
Can you try the following bin/harvest start --config one.yml --foreground

faguayot · 2021-06-09T15:51:28Z

Good point!
I had not realized, directly I copied the same command that you gave me. With the correct command runs. Below I attached the screenshot with the output.

cgrinds · 2021-06-09T21:59:06Z

That's a good start! SCES1P000 is running fine and that's one of the pollers that wasn't running earlier. What if you try running that same poller, but in the background like this: bin/harvest start --config one.yml

If you do that and then run the following, do you see the poller running?
ps aux | grep poller

What does the current log file for this poller show (see /var/log/harvest/) after trying to run in the background?

faguayot · 2021-06-10T08:03:54Z

With the poller SCES1P000 in the background it works.

Yes, I see the poller which I launched:

This is the new traces in the log file for this poller:

cgrinds · 2021-06-10T12:11:04Z

More progress - options at this point:
A. add a few more clusters to one.yml and confirm they work
B. switch back to getting systemctl to work with your original harvest.yml If you want to try this, what if you change the harvest.yml that systemctl is going to use - the one in /opt/harvest/harvest.yml. Change it to include only cluster SCES1P000, since we know it works

With option B, make sure all pollers are stopped first, just so we're at a known state, then do the systemctl dance:
systemctl restart harvest
ps aux | grep poller

faguayot · 2021-06-11T09:11:43Z

I directly tried the init of the service and it run correctly. Below the evidences.

cgrinds · 2021-06-11T15:33:47Z

Excellent! Looks like everything is working now.

rahulguptajss · 2021-06-13T09:23:20Z

@faguayot Pls close the issue if resolved.

faguayot · 2021-06-14T07:49:02Z

Thanks Chris.
The issue was resolved.

@steverweber

deb/rpm harvest.example changes Handle special characters in passwords This change only addresses passwords in Pollers and Defaults. The bigger refactor is to use HarvestConfig through out the codebase, but that was too big a change at the moment. That change touches a lot more code. When that change is made, the code in conf.LoadConfig can be removed. fix remaining merge Enable GitHub code scanning Remove extra fmt workflow action Remove redundant Slack section and polish Add Dev team to clabot Add license check and GitHub action add zerolog pretty print for console InsecureSkipVerify with basicauth Correct httpd logging pattern Replace snake case with camel Fix mistyped package Shelf purges instances too soon Fixes #75 update clabot allow user-defined URL for the influxDB server update conf tests, move allow_addrs_regex: not influxdb parameter auth test cases Change triage label Replace CCLA.pdf with online link to CCLA Remove CONTRIBUTING_CCLA.pdf uniform structure of collector doc, add explanation about metric collection/calculation add known issue on WSL update toc add rename example, remove tabs disliked by markdown removed allow_addrs_regex, not a parameter tab to space tab to space remove redundant TOC; spelling typos in docs support/hacks for workload objects templates for 4 workload objects re-add earlier removed disk counters chrishenzie has signed the CCLA Make vendored copy of dependencies handle panic in collector Allow insecure Grafana TLS connections `harvest/grafana` should not rewrite https connections into http Fixes #111 enable caller for zerolog Remove buildmode=plugin Add support for cluster simulator WIP Implement Caddy style plugins for collectors Fix go vet warnings in node.go enable stacktrace during errors InfluxDB exporter should pass url unchanged Thanks to @steverweber for the suggestion Fixes #63 Add unique prom ports and export type checks to doctor Prometheus dashboards don't load when exemplar = true Fixes #96 Don't run harvest as root on RHEL/Deb See also #122 Improve harvest start behavior Two cases are improved here: 1) Harvest detects when there is a stale pidfile and correctly restarts the poller process. A stale pidfile is when the pidfile exists in `/var/run/harvest` but there is no running process associated with that pid. 1) Harvest no longer suggests killing an already running poller when you try to start it. This is a a no-op. Fixes #123 stop renamed pollers resolved comments for stop pollers in case of rename Addressed review comments Fixes #20 Restore Zapiperf support workload changes add missing tag for labels pseudometric cache ZAPI counters to distinct from own metircs Update needs triage label rpb deb bugs Fixes #50 Fixes #129 Auth_style should not be redacted Run workflows on release branch Remove unused graphite_leaves PrometheusPort should be int Trim absolute file system paths Add -trimpath to go build so errors and stacktraces print with module path@version instead of this {"level":"info","Poller":"infinity","collector":"ZapiPerf:WAFLAggr","caller":"/var/jenkins_home/workspace/BuildHarvestArtifacts/harvest/cmd/poller/collector/collector.go:318","time":"2021-06-11T13:40:03-04:00","message":"recovered from standby mode, back to normal schedule"} correct ghost poll kill Sridevi has signed CCLA Update README.md Added Upgrade steps to README file Removed specific links in the Installation steps Overall updated format Polish README.md Reduce redundant information Make tar gz example copy pasteable Fix panic in unix.go When a poller in harvest.yml is changed while a unix collector is running it panics Fixes #160 Remove pidfiles - Improve poller detection by injecting IS_HARVEST into exec-ed process's environment. - Simplify management code and improve accuracy - Remove /var/run logic from RPM and Deb script to validate metrics at runtime typo update changelog update support md update readme run ghost kill poller during harvest start Store reason as a label for disk.yaml so that disk status is correctly reported Fixes #182 check trailing newline needs to be done before splitlines make sure stream trails with newline label value can be empty fix mistake in label regex include empty keys, to make sure label set is consistent fix export options, to avoid duplicate labels properly parse boolean parameters avoid metric name conflict fix return value when nothing is scraped drop using lib alias typo in plugin params Correcting Grafana Cluster Dashboard Typo plus other same typos port range changes resolved merge commits port range review comments Encapsulate port mapping port range changes Reduce the amount of time and attempts spinning for status checks Makes a big difference on Mac when process is not found Goes from 19.5 seconds to (not) start 27 pollers to 1.9 seconds Add README on how to setup per poller systemd services. Add generate systemd subcommand check for duplicate metatags, since telegraf complains about this as well ugly temporary solution against duplicate metatags temporary fix to duplicate node labels, until fixed in Aggregator plugin resolve conflicting names with system_node.yaml, to prevent label inconsistency shelf dashboard: adding ovverride option for shelf field Node Dashboard Bugs

@steverweber

* script to validate metrics at runtime * typo * check trailing newline needs to be done before splitlines * make sure stream trails with newline * label value can be empty * fix mistake in label regex * include empty keys, to make sure label set is consistent * fix export options, to avoid duplicate labels * properly parse boolean parameters * avoid metric name conflict * fix return value when nothing is scraped * drop using lib alias * typo in plugin params * check for duplicate metatags, since telegraf complains about this as well * ugly temporary solution against duplicate metatags * temporary fix to duplicate node labels, until fixed in Aggregator plugin * resolve conflicting names with system_node.yaml, to prevent label inconsistency * harvest yml changes deb/rpm harvest.example changes Handle special characters in passwords This change only addresses passwords in Pollers and Defaults. The bigger refactor is to use HarvestConfig through out the codebase, but that was too big a change at the moment. That change touches a lot more code. When that change is made, the code in conf.LoadConfig can be removed. fix remaining merge Enable GitHub code scanning Remove extra fmt workflow action Remove redundant Slack section and polish Add Dev team to clabot Add license check and GitHub action add zerolog pretty print for console InsecureSkipVerify with basicauth Correct httpd logging pattern Replace snake case with camel Fix mistyped package Shelf purges instances too soon Fixes #75 update clabot allow user-defined URL for the influxDB server update conf tests, move allow_addrs_regex: not influxdb parameter auth test cases Change triage label Replace CCLA.pdf with online link to CCLA Remove CONTRIBUTING_CCLA.pdf uniform structure of collector doc, add explanation about metric collection/calculation add known issue on WSL update toc add rename example, remove tabs disliked by markdown removed allow_addrs_regex, not a parameter tab to space tab to space remove redundant TOC; spelling typos in docs support/hacks for workload objects templates for 4 workload objects re-add earlier removed disk counters chrishenzie has signed the CCLA Make vendored copy of dependencies handle panic in collector Allow insecure Grafana TLS connections `harvest/grafana` should not rewrite https connections into http Fixes #111 enable caller for zerolog Remove buildmode=plugin Add support for cluster simulator WIP Implement Caddy style plugins for collectors Fix go vet warnings in node.go enable stacktrace during errors InfluxDB exporter should pass url unchanged Thanks to @steverweber for the suggestion Fixes #63 Add unique prom ports and export type checks to doctor Prometheus dashboards don't load when exemplar = true Fixes #96 Don't run harvest as root on RHEL/Deb See also #122 Improve harvest start behavior Two cases are improved here: 1) Harvest detects when there is a stale pidfile and correctly restarts the poller process. A stale pidfile is when the pidfile exists in `/var/run/harvest` but there is no running process associated with that pid. 1) Harvest no longer suggests killing an already running poller when you try to start it. This is a a no-op. Fixes #123 stop renamed pollers resolved comments for stop pollers in case of rename Addressed review comments Fixes #20 Restore Zapiperf support workload changes add missing tag for labels pseudometric cache ZAPI counters to distinct from own metircs Update needs triage label rpb deb bugs Fixes #50 Fixes #129 Auth_style should not be redacted Run workflows on release branch Remove unused graphite_leaves PrometheusPort should be int Trim absolute file system paths Add -trimpath to go build so errors and stacktraces print with module path@version instead of this {"level":"info","Poller":"infinity","collector":"ZapiPerf:WAFLAggr","caller":"/var/jenkins_home/workspace/BuildHarvestArtifacts/harvest/cmd/poller/collector/collector.go:318","time":"2021-06-11T13:40:03-04:00","message":"recovered from standby mode, back to normal schedule"} correct ghost poll kill Sridevi has signed CCLA Update README.md Added Upgrade steps to README file Removed specific links in the Installation steps Overall updated format Polish README.md Reduce redundant information Make tar gz example copy pasteable Fix panic in unix.go When a poller in harvest.yml is changed while a unix collector is running it panics Fixes #160 Remove pidfiles - Improve poller detection by injecting IS_HARVEST into exec-ed process's environment. - Simplify management code and improve accuracy - Remove /var/run logic from RPM and Deb script to validate metrics at runtime typo update changelog update support md update readme run ghost kill poller during harvest start Store reason as a label for disk.yaml so that disk status is correctly reported Fixes #182 check trailing newline needs to be done before splitlines make sure stream trails with newline label value can be empty fix mistake in label regex include empty keys, to make sure label set is consistent fix export options, to avoid duplicate labels properly parse boolean parameters avoid metric name conflict fix return value when nothing is scraped drop using lib alias typo in plugin params Correcting Grafana Cluster Dashboard Typo plus other same typos port range changes resolved merge commits port range review comments Encapsulate port mapping port range changes Reduce the amount of time and attempts spinning for status checks Makes a big difference on Mac when process is not found Goes from 19.5 seconds to (not) start 27 pollers to 1.9 seconds Add README on how to setup per poller systemd services. Add generate systemd subcommand check for duplicate metatags, since telegraf complains about this as well ugly temporary solution against duplicate metatags temporary fix to duplicate node labels, until fixed in Aggregator plugin resolve conflicting names with system_node.yaml, to prevent label inconsistency shelf dashboard: adding ovverride option for shelf field Node Dashboard Bugs Co-authored-by: rahulg2 <rahul.gupta@netapp.com>

faguayot added the status/needs-triage label Jun 7, 2021

cgrinds added a commit that referenced this issue Jun 7, 2021

Don't run harvest as root on RHEL/Deb

b111b2a

See also #122

cgrinds mentioned this issue Jun 7, 2021

Create harvest doctor to validate customer environments #16

Closed

faguayot closed this as completed Jun 14, 2021

ruanruijuan removed the status/needs-triage label Jun 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harvest service doesn't work as expected #122

Harvest service doesn't work as expected #122

faguayot commented Jun 7, 2021

cgrinds commented Jun 7, 2021 •

edited by Hardikl

Loading

faguayot commented Jun 8, 2021

cgrinds commented Jun 8, 2021

faguayot commented Jun 9, 2021

cgrinds commented Jun 9, 2021

faguayot commented Jun 9, 2021

cgrinds commented Jun 9, 2021

faguayot commented Jun 10, 2021

cgrinds commented Jun 10, 2021

faguayot commented Jun 11, 2021

cgrinds commented Jun 11, 2021

rahulguptajss commented Jun 13, 2021

faguayot commented Jun 14, 2021

Harvest service doesn't work as expected #122

Harvest service doesn't work as expected #122

Comments

faguayot commented Jun 7, 2021

cgrinds commented Jun 7, 2021 • edited by Hardikl Loading

faguayot commented Jun 8, 2021

cgrinds commented Jun 8, 2021

faguayot commented Jun 9, 2021

cgrinds commented Jun 9, 2021

faguayot commented Jun 9, 2021

cgrinds commented Jun 9, 2021

faguayot commented Jun 10, 2021

cgrinds commented Jun 10, 2021

faguayot commented Jun 11, 2021

cgrinds commented Jun 11, 2021

rahulguptajss commented Jun 13, 2021

faguayot commented Jun 14, 2021

cgrinds commented Jun 7, 2021 •

edited by Hardikl

Loading