Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvest service doesn't work as expected #122

Closed
faguayot opened this issue Jun 7, 2021 · 13 comments
Closed

Harvest service doesn't work as expected #122

faguayot opened this issue Jun 7, 2021 · 13 comments

Comments

@faguayot
Copy link

faguayot commented Jun 7, 2021

Describe the bug
A clear and concise description of what the bug is.

When I started the service of harvest, it only runs two poller from 11 that we have defined in the harvest.yml. If I run the harvest process with the next command: **/opt/harvest/bin/harvest start** or **/opt/harvest/bin/harvest restart all the pollers** run correctly. It happens the same with the different context defining for the service that is to say: start, restart, status and stop.

I attached an image of what we see in the beginning of the service

image

Environment
Provide accurate information about the environment to help us reproduce the issue.

  • Harvest version: harvest version 21.05.1-1 (commit 2211c00) (build date 2021-05-21T01:28:12+0530) linux/amd64
  • Command line arguments used: [e.g. bin/harvest start --config=foo.yml --collectors Zapi]
  • OS: RHEL 8.2
  • Install method: yum
  • ONTAP Version: 9.5 and 9.7
  • Other:

To Reproduce
Running the service systemctl start harvest.service

Expected behavior
It should run a process for every poller in my harvest.yml.

Actual behavior
It only runs two pollers, sometimes none of them.

Possible solution, workaround, fix
Starting the gathering using the executable: "/opt/harvest/bin/harvest" instead of the service

@cgrinds
Copy link
Collaborator

cgrinds commented Jun 7, 2021

i have a theory on what's going, but first a few questions:

  • can you share the logs in /var/log/harvest/ or check if there was anything logged for some of the pollers that didn't start?
  • did you upgrade from a previous version of Harvest?
  • What does which harvest return? You may see some pollers not starting because you have multiple harvest.yml files and the incorrect one is being used. We can find those with find / -name 'harvest.yml' return?

One issue you hit is running /opt/harvest/bin/harvest as root.
When you do that, the pollers will write a pidfile in /var/run/harvest/ owned by root.

Later when you try to start|stop|restart using systemctl the Harvest service unit file tells systemctl to use the harvest user. Since the pidfiles were created by root, the harvest user won't have permission to read or change those files.

To fix this (as root):

  • Stop the pollers
  • rm /var/run/harvest/*
  • systemctl restart harvest
  • verify with ps aux | grep poller that the poller processes are owned by the harvest user
  • verify that ls -la /var/run/harvest shows the pidfiles are owned by harvest

As root, you could also su to harvest and then safely use /opt/harvest/bin/harvest directly too.

@faguayot
Copy link
Author

faguayot commented Jun 8, 2021

Hello Chris,

I try to respond your questions:

  • No, I didn't find anything in the /var/log/harvest related with my boot attempt. I found information relative to the service in the /var/log/messages.

image

  • Yes, I've upgraded harvest version. Could be that a problem?
  • Here I show you the output of which harvest and which harvest.yml files we have in our vm.
    image

I was aware about the harvest user used in the service before I opened the issue and I tested the same that you said about stop the harvest manually and checking there aren't any poller process and the PID files in the /run/var/harvest too.

Here the results following your steps:

The permission for the /var/run/harvest:

image

Without any poller process after stop the pollers with the command harvest stop:

image

The messages that appear after starting the service.
image

The last test I did it was doing login into the vm with the harvest user and I tried to run with the command: harvest start without success.

image

@cgrinds
Copy link
Collaborator

cgrinds commented Jun 8, 2021

Hi faguayot,

Thanks for the details and screenshots.

The fact that you have a harvest binary in bin/harvest implies you have multiple versions of harvest installed since the RPM does not install there. As you mentioned, I'm assuming this is from an earlier install.

When using systemctl, the version of Harvest that will be used is the one in /opt/harvest/. That means /etc/harvest/harvest.yml will not be used. It would be less confusing to remove the old version of Harvest. Make sure there aren't any changes you want to keep and remove /etc/harvest/

Interesting that in your last example none of the pollers started. We're going in the wrong direction :) Let's try starting just one poller in the foreground. Maybe there are errors that are being missed.

Make a copy of your harvest.yml like so:
cp /opt/harvest/harvest.yml /opt/harvest/one.yml

Edit /opt/harvest/one.yml and remove or comment out all pollers but one.

Login or su as the harvest user

cd /opt/harvest/
bin/harvest --config one.yml --foreground

Hopefully you'll get some logging to the terminal that helps us figure out what's wrong.

@faguayot
Copy link
Author

faguayot commented Jun 9, 2021

Hi Chris,

Sorry but I think I have confused you because I forgot to show you the /bin/harvest is only a symbolic link which I created for recognize the command harvest in any path.

image

I remove the file /etc/harvest/harvest.yml and then I've tried to follow the steps for trying to start one poller in the foreground but it seems like my harvest doesn't recognize the flag foreground.

image

But I've deleted the foreground flag too and the result is the same without the error about the foreground.

image

Thanks.
Best regards.

@cgrinds
Copy link
Collaborator

cgrinds commented Jun 9, 2021

Yes, I left out the start command.
Can you try the following bin/harvest start --config one.yml --foreground

@faguayot
Copy link
Author

faguayot commented Jun 9, 2021

Good point!
I had not realized, directly I copied the same command that you gave me. With the correct command runs. Below I attached the screenshot with the output.

image

@cgrinds
Copy link
Collaborator

cgrinds commented Jun 9, 2021

That's a good start! SCES1P000 is running fine and that's one of the pollers that wasn't running earlier. What if you try running that same poller, but in the background like this: bin/harvest start --config one.yml

If you do that and then run the following, do you see the poller running?
ps aux | grep poller

What does the current log file for this poller show (see /var/log/harvest/) after trying to run in the background?

@faguayot
Copy link
Author

With the poller SCES1P000 in the background it works.

image

Yes, I see the poller which I launched:

image

This is the new traces in the log file for this poller:
image

@cgrinds
Copy link
Collaborator

cgrinds commented Jun 10, 2021

More progress - options at this point:
A. add a few more clusters to one.yml and confirm they work
B. switch back to getting systemctl to work with your original harvest.yml If you want to try this, what if you change the harvest.yml that systemctl is going to use - the one in /opt/harvest/harvest.yml. Change it to include only cluster SCES1P000, since we know it works

With option B, make sure all pollers are stopped first, just so we're at a known state, then do the systemctl dance:
systemctl restart harvest
ps aux | grep poller

@faguayot
Copy link
Author

I directly tried the init of the service and it run correctly. Below the evidences.

image

@cgrinds
Copy link
Collaborator

cgrinds commented Jun 11, 2021

Excellent! Looks like everything is working now.

@rahulguptajss
Copy link
Contributor

@faguayot Pls close the issue if resolved.

@faguayot
Copy link
Author

Thanks Chris.
The issue was resolved.

vgratian pushed a commit that referenced this issue Jun 21, 2021
deb/rpm harvest.example changes

Handle special characters in passwords

This change only addresses passwords in Pollers and Defaults. The bigger
refactor is to use HarvestConfig through out the codebase, but that was too
big a change at the moment. That change touches a lot more code.

When that change is made, the code in conf.LoadConfig can be removed.

fix remaining merge

Enable GitHub code scanning

Remove extra fmt workflow action

Remove redundant Slack section and polish

Add Dev team to clabot

Add license check and GitHub action

add zerolog pretty print for console

InsecureSkipVerify with basicauth

Correct httpd logging pattern

Replace snake case with camel

Fix mistyped package

Shelf purges instances too soon

Fixes #75

update clabot

allow user-defined URL for the influxDB server

update conf tests, move allow_addrs_regex: not influxdb parameter

auth test cases

Change triage label

Replace CCLA.pdf with online link to CCLA

Remove CONTRIBUTING_CCLA.pdf

uniform structure of collector doc, add explanation about metric collection/calculation

add known issue on WSL

update toc

add rename example, remove tabs disliked by markdown

removed allow_addrs_regex, not a parameter

tab to space

tab to space

remove redundant TOC; spelling

typos in docs

support/hacks for workload objects

templates for 4 workload objects

re-add earlier removed disk counters

chrishenzie has signed the CCLA

Make vendored copy of dependencies

handle panic in collector

Allow insecure Grafana TLS connections

`harvest/grafana` should not rewrite https connections into http

Fixes #111

enable caller for zerolog

Remove buildmode=plugin

Add support for cluster simulator
WIP Implement Caddy style plugins for collectors
Fix go vet warnings in node.go

enable stacktrace during errors

InfluxDB exporter should pass url unchanged

Thanks to @steverweber for the suggestion
Fixes #63

Add unique prom ports and export type

checks to doctor

Prometheus dashboards don't load when exemplar = true

Fixes #96

Don't run harvest as root on RHEL/Deb

See also #122

Improve harvest start behavior

Two cases are improved here:
1) Harvest detects when there is a stale pidfile and correctly restarts the poller process. A stale pidfile is when the pidfile exists in `/var/run/harvest` but there is no running process associated with that pid.

1) Harvest no longer suggests killing an already running poller when you try to start it. This is a a no-op.

Fixes #123

stop renamed pollers

resolved comments for stop pollers in case of rename

Addressed review comments Fixes #20

Restore Zapiperf support workload changes

add missing tag for labels pseudometric

cache ZAPI counters to distinct from own metircs

Update needs triage label

rpb deb bugs Fixes #50 Fixes #129

Auth_style should not be redacted

Run workflows on release branch

Remove unused graphite_leaves

PrometheusPort should be int

Trim absolute file system paths

Add -trimpath to go build so errors and stacktraces print
with module path@version instead of this

{"level":"info","Poller":"infinity","collector":"ZapiPerf:WAFLAggr","caller":"/var/jenkins_home/workspace/BuildHarvestArtifacts/harvest/cmd/poller/collector/collector.go:318","time":"2021-06-11T13:40:03-04:00","message":"recovered from standby mode, back to normal schedule"}

correct ghost poll kill

Sridevi has signed CCLA

Update README.md

Added Upgrade steps to README file
Removed specific links in the Installation steps
Overall updated format

Polish README.md

Reduce redundant information
Make tar gz example copy pasteable

Fix panic in unix.go

When a poller in harvest.yml is changed while a unix collector is running it panics

Fixes #160

Remove pidfiles

- Improve poller detection by injecting IS_HARVEST into exec-ed process's
environment.
- Simplify management code and improve accuracy
- Remove /var/run logic from RPM and Deb

script to validate metrics at runtime

typo

update changelog

update support md

update readme

run ghost kill poller during harvest start

Store reason as a label for disk.yaml so

that disk status is correctly reported

Fixes #182

check trailing newline needs to be done before splitlines

make sure stream trails with newline

label value can be empty

fix mistake in label regex

include empty keys, to make sure label set is consistent

fix export options, to avoid duplicate labels

properly parse boolean parameters

avoid metric name conflict

fix return value when nothing is scraped

drop using lib alias

typo in plugin params

Correcting Grafana Cluster Dashboard Typo plus other same typos

port range changes

resolved merge commits

port range review comments

Encapsulate port mapping

port range changes

Reduce the amount of time and attempts spinning

for status checks

Makes a big difference on Mac when process is not found
Goes from 19.5 seconds to (not) start 27 pollers to
1.9 seconds

Add README on how to setup per poller systemd

services.

Add generate systemd subcommand

check for duplicate metatags, since telegraf complains about this as well

ugly temporary solution against duplicate metatags

temporary fix to duplicate node labels, until fixed in Aggregator plugin

resolve conflicting names with system_node.yaml, to prevent label inconsistency

shelf dashboard: adding ovverride option for shelf field

Node Dashboard Bugs
vgratian pushed a commit that referenced this issue Jun 22, 2021
* script to validate metrics at runtime

* typo

* check trailing newline needs to be done before splitlines

* make sure stream trails with newline

* label value can be empty

* fix mistake in label regex

* include empty keys, to make sure label set is consistent

* fix export options, to avoid duplicate labels

* properly parse boolean parameters

* avoid metric name conflict

* fix return value when nothing is scraped

* drop using lib alias

* typo in plugin params

* check for duplicate metatags, since telegraf complains about this as well

* ugly temporary solution against duplicate metatags

* temporary fix to duplicate node labels, until fixed in Aggregator plugin

* resolve conflicting names with system_node.yaml, to prevent label inconsistency

* harvest yml changes

deb/rpm harvest.example changes

Handle special characters in passwords

This change only addresses passwords in Pollers and Defaults. The bigger
refactor is to use HarvestConfig through out the codebase, but that was too
big a change at the moment. That change touches a lot more code.

When that change is made, the code in conf.LoadConfig can be removed.

fix remaining merge

Enable GitHub code scanning

Remove extra fmt workflow action

Remove redundant Slack section and polish

Add Dev team to clabot

Add license check and GitHub action

add zerolog pretty print for console

InsecureSkipVerify with basicauth

Correct httpd logging pattern

Replace snake case with camel

Fix mistyped package

Shelf purges instances too soon

Fixes #75

update clabot

allow user-defined URL for the influxDB server

update conf tests, move allow_addrs_regex: not influxdb parameter

auth test cases

Change triage label

Replace CCLA.pdf with online link to CCLA

Remove CONTRIBUTING_CCLA.pdf

uniform structure of collector doc, add explanation about metric collection/calculation

add known issue on WSL

update toc

add rename example, remove tabs disliked by markdown

removed allow_addrs_regex, not a parameter

tab to space

tab to space

remove redundant TOC; spelling

typos in docs

support/hacks for workload objects

templates for 4 workload objects

re-add earlier removed disk counters

chrishenzie has signed the CCLA

Make vendored copy of dependencies

handle panic in collector

Allow insecure Grafana TLS connections

`harvest/grafana` should not rewrite https connections into http

Fixes #111

enable caller for zerolog

Remove buildmode=plugin

Add support for cluster simulator
WIP Implement Caddy style plugins for collectors
Fix go vet warnings in node.go

enable stacktrace during errors

InfluxDB exporter should pass url unchanged

Thanks to @steverweber for the suggestion
Fixes #63

Add unique prom ports and export type

checks to doctor

Prometheus dashboards don't load when exemplar = true

Fixes #96

Don't run harvest as root on RHEL/Deb

See also #122

Improve harvest start behavior

Two cases are improved here:
1) Harvest detects when there is a stale pidfile and correctly restarts the poller process. A stale pidfile is when the pidfile exists in `/var/run/harvest` but there is no running process associated with that pid.

1) Harvest no longer suggests killing an already running poller when you try to start it. This is a a no-op.

Fixes #123

stop renamed pollers

resolved comments for stop pollers in case of rename

Addressed review comments Fixes #20

Restore Zapiperf support workload changes

add missing tag for labels pseudometric

cache ZAPI counters to distinct from own metircs

Update needs triage label

rpb deb bugs Fixes #50 Fixes #129

Auth_style should not be redacted

Run workflows on release branch

Remove unused graphite_leaves

PrometheusPort should be int

Trim absolute file system paths

Add -trimpath to go build so errors and stacktraces print
with module path@version instead of this

{"level":"info","Poller":"infinity","collector":"ZapiPerf:WAFLAggr","caller":"/var/jenkins_home/workspace/BuildHarvestArtifacts/harvest/cmd/poller/collector/collector.go:318","time":"2021-06-11T13:40:03-04:00","message":"recovered from standby mode, back to normal schedule"}

correct ghost poll kill

Sridevi has signed CCLA

Update README.md

Added Upgrade steps to README file
Removed specific links in the Installation steps
Overall updated format

Polish README.md

Reduce redundant information
Make tar gz example copy pasteable

Fix panic in unix.go

When a poller in harvest.yml is changed while a unix collector is running it panics

Fixes #160

Remove pidfiles

- Improve poller detection by injecting IS_HARVEST into exec-ed process's
environment.
- Simplify management code and improve accuracy
- Remove /var/run logic from RPM and Deb

script to validate metrics at runtime

typo

update changelog

update support md

update readme

run ghost kill poller during harvest start

Store reason as a label for disk.yaml so

that disk status is correctly reported

Fixes #182

check trailing newline needs to be done before splitlines

make sure stream trails with newline

label value can be empty

fix mistake in label regex

include empty keys, to make sure label set is consistent

fix export options, to avoid duplicate labels

properly parse boolean parameters

avoid metric name conflict

fix return value when nothing is scraped

drop using lib alias

typo in plugin params

Correcting Grafana Cluster Dashboard Typo plus other same typos

port range changes

resolved merge commits

port range review comments

Encapsulate port mapping

port range changes

Reduce the amount of time and attempts spinning

for status checks

Makes a big difference on Mac when process is not found
Goes from 19.5 seconds to (not) start 27 pollers to
1.9 seconds

Add README on how to setup per poller systemd

services.

Add generate systemd subcommand

check for duplicate metatags, since telegraf complains about this as well

ugly temporary solution against duplicate metatags

temporary fix to duplicate node labels, until fixed in Aggregator plugin

resolve conflicting names with system_node.yaml, to prevent label inconsistency

shelf dashboard: adding ovverride option for shelf field

Node Dashboard Bugs

Co-authored-by: rahulg2 <rahul.gupta@netapp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants