Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

does nomad must honor collection_interval when collect telemetry by allocations? #6961

Closed
tantra35 opened this issue Jan 20, 2020 · 10 comments · Fixed by #7043
Closed

does nomad must honor collection_interval when collect telemetry by allocations? #6961

tantra35 opened this issue Jan 20, 2020 · 10 comments · Fixed by #7043

Comments

@tantra35
Copy link
Contributor

tantra35 commented Jan 20, 2020

Nomad version

Nomad v0.10.2 (40edb4d+CHANGES)

Is it normal that nomad doesn't honor collection_interval (https://www.nomadproject.io/docs/configuration/telemetry.html#collection_interval) when collecting statistic by allocations

For example on nomad client we have flow telemetry configuration:

telemetry
{
	disable_hostname                       = true
	publish_allocation_metrics             = true
	publish_node_metrics                   = true
	collection_interval                    = "60s"
	disable_dispatched_job_summary_metrics = true
	statsd_address                         = "localhost:8125"
	prefix_filter                          = [
		"-nomad.runtime",
	]                                                                                                                                                                                                                                                                            
}

And i think that telemetry for allocation will be collected every 60 seconds, but this is not true, and statistics for allocation collected every 1 second, at localhost:8125 we have custom agregated statsd proxy and can intercept all metrics, so from logs from this daemon we have follow frequency on metrics:

[2020-01-20 15:36:55.355] DEBUG Got metric: nomad.client.allocs.memory.kernel_usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:55.355] DEBUG Got metric: nomad.client.allocs.memory.kernel_max_usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:55.355] DEBUG Got metric: nomad.client.allocs.memory.allocated.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:55.355] DEBUG Got metric: nomad.client.allocs.cpu.total_percent.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:55.355] DEBUG Got metric: nomad.client.allocs.cpu.system.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:55.361] DEBUG Got metric: nomad.client.allocs.cpu.user.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:55.361] DEBUG Got metric: nomad.client.allocs.cpu.throttled_time.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:55.361] DEBUG Got metric: nomad.client.allocs.cpu.throttled_periods.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:55.361] DEBUG Got metric: nomad.client.allocs.cpu.total_ticks.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.356] DEBUG Got metric: nomad.client.allocs.memory.rss.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.356] DEBUG Got metric: nomad.client.allocs.memory.cache.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.356] DEBUG Got metric: nomad.client.allocs.memory.swap.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.356] DEBUG Got metric: nomad.client.allocs.memory.usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.356] DEBUG Got metric: nomad.client.allocs.memory.max_usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.356] DEBUG Got metric: nomad.client.allocs.memory.kernel_usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.356] DEBUG Got metric: nomad.client.allocs.memory.kernel_max_usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.356] DEBUG Got metric: nomad.client.allocs.memory.allocated.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.356] DEBUG Got metric: nomad.client.allocs.cpu.total_percent.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.356] DEBUG Got metric: nomad.client.allocs.cpu.system.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.361] DEBUG Got metric: nomad.client.allocs.cpu.user.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.361] DEBUG Got metric: nomad.client.allocs.cpu.throttled_time.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.361] DEBUG Got metric: nomad.client.allocs.cpu.throttled_periods.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.361] DEBUG Got metric: nomad.client.allocs.cpu.total_ticks.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.364] DEBUG Got metric: nomad.client.allocs.memory.rss.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.364] DEBUG Got metric: nomad.client.allocs.memory.cache.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.364] DEBUG Got metric: nomad.client.allocs.memory.swap.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.365] DEBUG Got metric: nomad.client.allocs.memory.usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.365] DEBUG Got metric: nomad.client.allocs.memory.max_usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.365] DEBUG Got metric: nomad.client.allocs.memory.kernel_usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.365] DEBUG Got metric: nomad.client.allocs.memory.kernel_max_usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.365] DEBUG Got metric: nomad.client.allocs.memory.allocated.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.365] DEBUG Got metric: nomad.client.allocs.cpu.total_percent.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.366] DEBUG Got metric: nomad.client.allocs.cpu.system.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.461] DEBUG Got metric: nomad.client.allocs.cpu.user.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.461] DEBUG Got metric: nomad.client.allocs.cpu.throttled_time.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.461] DEBUG Got metric: nomad.client.allocs.cpu.throttled_periods.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.461] DEBUG Got metric: nomad.client.allocs.cpu.total_ticks.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:58.365] DEBUG Got metric: nomad.client.allocs.memory.rss.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:58.365] DEBUG Got metric: nomad.client.allocs.memory.cache.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:58.365] DEBUG Got metric: nomad.client.allocs.memory.swap.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:58.365] DEBUG Got metric: nomad.client.allocs.memory.usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:58.365] DEBUG Got metric: nomad.client.allocs.memory.max_usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default

for example metric nomad.client.allocs.memory.kernel_max_usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default collected evry 1 second

@tantra35 tantra35 changed the title does nomad must honor collection_interval when collect data by allocations? does nomad must honor collection_interval when collect telemetry by allocations? Jan 20, 2020
@tgross
Copy link
Member

tgross commented Jan 21, 2020

Hi @tantra35! It looks like client stats are collected at the specified interval but task resource usage stats collection is a tight loop around the collection here: stats_hook.go#L87-L112. Based on the more detailed treatment of the configuration in the telemetry metrics docs I'm not sure whether or not that's intentional and just a documentation gap.

@tgross
Copy link
Member

tgross commented Jan 21, 2020

My colleague @notnoop has pointed out to me that we do pass the interval to the task driver's Stats() method (see here) so that the drivers can run the stats collection on that interval. The exact implementation is up to the driver though. So either we're missing a spot to thread through the interval from the config or the specific task driver isn't implementing it correctly. Which task driver are you using here, @tantra35 ?

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jan 21, 2020
@tgross tgross moved this from Needs Triage to Triaged in Nomad - Community Issues Triage Jan 21, 2020
@tantra35
Copy link
Contributor Author

@tgross on test stand for simplicity purposes we use exec driver

@tantra35
Copy link
Contributor Author

tantra35 commented Jan 22, 2020

@tgross thanks for clarify as code works. After some debugging i found that value of interval passed to exec.(*Driver).TaskStats is 1s:

(dlv) break exec.(*Driver).TaskStats
(dlv) c
> github.com/hashicorp/nomad/drivers/exec.(*Driver).TaskStats() /opt/gopath/src/github.com/hashicorp/nomad/drivers/exec/driver.go:482 (hits goroutine(2011):1 total:3) (PC: 0x141bbe3)
Warning: debugging optimized function
(dlv) bt
0  0x000000000141bbe3 in github.com/hashicorp/nomad/drivers/exec.(*Driver).TaskStats
   at /opt/gopath/src/github.com/hashicorp/nomad/drivers/exec/driver.go:482
1  0x0000000001311332 in github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*DriverHandle).Stats
   at /opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/driver_handle.go:49
2  0x0000000001311332 in github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*LazyHandle).Stats
   at /opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/lazy_handle.go:141
3  0x0000000001316e4d in github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*statsHook).callStatsWithRetry
   at /opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/stats_hook.go:126
4  0x0000000001316d65 in github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*statsHook).collectResourceUsageStats
   at /opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/stats_hook.go:92
5  0x00000000004625e1 in runtime.goexit
   at /usr/local/go/src/runtime/asm_amd64.s:1337
(dlv) print interval
github.com/hashicorp/nomad/vendor/github.com/gorilla/websocket.writeWait (1000000000)

so problem is in interval value passing code, and it looks like all drivers doensn't honor collection_interval

@tgross
Copy link
Member

tgross commented Jan 22, 2020

Thanks @tantra35! We'll look into this!

@tgross tgross added this to the unscheduled milestone Jan 22, 2020
@tantra35
Copy link
Contributor Author

@tgross ah sorry, just found and this is our misconfiguration in provisioning, so bug doesn't realy present

Nomad - Community Issues Triage automation moved this from Triaged to Done Jan 22, 2020
@tantra35
Copy link
Contributor Author

@tgross after some investigations i mast reopen this issue, due problem actualy exist at least in exec driver. interval value passed correctly(we made wrong conclusions due mistake in provisioning(there was not set collection_interval value)), other info menitioned in the beginning of this issue are correct

@tantra35 tantra35 reopened this Jan 22, 2020
Nomad - Community Issues Triage automation moved this from Done to Triaged Jan 22, 2020
@tgross
Copy link
Member

tgross commented Jan 22, 2020

No problem. 😀 We'll take a look.

@tgross tgross self-assigned this Jan 22, 2020
Nomad - Community Issues Triage automation moved this from Triaged to Done Jan 31, 2020
@tgross
Copy link
Member

tgross commented Feb 5, 2020

Repro from our testing for 0.10.4's upcoming release candidate.

Run an exec job and find its allocs:

▶ nomad job run e2e/metrics/input/helloworld.nomad
▶ nomad job status helloworld
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
5ebd84ce  d442d255  hello       0        run      running  8s ago   8s ago
602b99d1  0427d6f3  hello       0        run      running  8s ago   7s ago
feaf805d  d442d255  hello       0        run      running  8s ago   7s ago

ssh into the node d442d255. Update the telemetry config at /etc/nomad.d/base.hcl to increase the collection interval and emit stats:

▶ diff e2e/terraform/shared/nomad/base.hcl base.hcl
14c14,15
<   collection_interval        = "1s"
---
>   collection_interval        = "60s"
>   statsd_address             = "localhost:8125"

Create a statsd config:

{
  graphitePort: 2003
  , port: 8125
  , backends: [ "./backends/console" ]
  , dumpMessages: true
}

Run a statsd collector on the node, using that configuration file:

$ sudo docker run -d --name statsd \
    -v "$(pwd)/config.js:/usr/src/app/config.js" \
    --network=host statsd/statsd:latest

Restart the Nomad client, wait a while, and then check that statsd is getting logs at the expected rate. We'll receive debug messages every 60 -- these are Nomad's collection intervals. The lines between (every 10 seconds) are statsd's own flushing of all the metrics it knows to console.

$ sudo systemctl restart nomad

$ sudo docker logs -f statsd | grep 'cpu\.total_ticks\.helloworld\.hello\.feaf'
5 Feb 19:52:00 - DEBUG: nomad.client.allocs.cpu.total_ticks.helloworld.hello.feaf805d-c518-65b9-ee41-c4d301a04c33.hello.default:0.383332|g
     'nomad.client.allocs.cpu.total_ticks.helloworld.hello.feaf805d-c518-65b9-ee41-c4d301a04c33.hello.default': 0.383332,
     'nomad.client.allocs.cpu.total_ticks.helloworld.hello.feaf805d-c518-65b9-ee41-c4d301a04c33.hello.default': 0.383332,
     'nomad.client.allocs.cpu.total_ticks.helloworld.hello.feaf805d-c518-65b9-ee41-c4d301a04c33.hello.default': 0.383332,
     'nomad.client.allocs.cpu.total_ticks.helloworld.hello.feaf805d-c518-65b9-ee41-c4d301a04c33.hello.default': 0.383332,
     'nomad.client.allocs.cpu.total_ticks.helloworld.hello.feaf805d-c518-65b9-ee41-c4d301a04c33.hello.default': 0.383332,
     'nomad.client.allocs.cpu.total_ticks.helloworld.hello.feaf805d-c518-65b9-ee41-c4d301a04c33.hello.default': 0.383332,
5 Feb 19:53:00 - DEBUG: nomad.client.allocs.cpu.total_ticks.helloworld.hello.feaf805d-c518-65b9-ee41-c4d301a04c33.hello.default:0.383332|g
     'nomad.client.allocs.cpu.total_ticks.helloworld.hello.feaf805d-c518-65b9-ee41-c4d301a04c33.hello.default': 0.383332,

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants