Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the officially recommended metrics #25

Merged
merged 2 commits into from
Sep 7, 2016
Merged

Add the officially recommended metrics #25

merged 2 commits into from
Sep 7, 2016

Conversation

SimenB
Copy link
Collaborator

@SimenB SimenB commented Aug 31, 2016

Hi there!

All Prometheus clients are recommended to export some basic metrics https://prometheus.io/docs/instrumenting/writing_clientlibs/#standard-and-runtime-collectors.

Descriptions stolen from the official Go client, https://github.com/prometheus/client_golang/blob/master/prometheus/process_collector.go.

This PR adds all of them, although most will only work on Linux.

No tests added in case you're not interested in the addition.

Example file for testing:

var prom = require('./');

prom.defaultMetrics([], 500);

setInterval(function () {
    console.log(prom.register.metrics());
}, 500);

Example Dockerfile for testing

FROM node:6

COPY . .

CMD ["node", "testfile.js"]

Example output of running app:

# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total gauge
process_cpu_seconds_total 33.73
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1472657378
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 30420000
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1174712000
# HELP process_heap_bytes Process heap size in bytes.
# TYPE process_heap_bytes gauge
process_heap_bytes 1131760000
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 13
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 198652

Could probably add some metrics regarding stuff like GC etc.

EDIT: In hindsight, I can see that I should just check process.platform for linux instead of try-catching... I'll make that change tomorrow along with GC metrics!

EDIT2: Here is the table with recommended metrics:

Metric name Help string Unit
process_cpu_seconds_total Total user and system CPU time spent in seconds. seconds
process_open_fds Number of open file descriptors. file descriptors
process_max_fds Maximum number of open file descriptors. file descriptors
process_virtual_memory_bytes Virtual memory size in bytes. bytes
process_resident_memory_bytes Resident memory size in bytes. bytes
process_heap_bytes Process heap size in bytes. bytes
process_start_time_seconds Start time of the process since unix epoch in seconds. seconds

@siimon
Copy link
Owner

siimon commented Aug 31, 2016

Interesting, thanks!

I'm away for this week and half of next so won't have a chance to dive into the code. But all of your suggestions seems reasonable and should be implemented!

A few other thoughts,

  1. Why have a defaultMetrics? Shouldn't we include them in the metrics function, with a setting if they should be included or not?
  2. Is it possible to include GC metrics without adding a 3rd party dependency? If not I'm hesitant to add it
  3. Maybe we should add a option object where you can disable any of the defaultMetrics?
  4. Adding some documentation to the readme would be nice as well!

Did my thoughts make any sense?

@SimenB
Copy link
Collaborator Author

SimenB commented Sep 1, 2016

  1. That'd mean there'd be no way to disabling actually getting the metrics, which is potentially a somewhat expensive operation (as it's I/O).
  2. IDK. I remember there being added GC metrics to Node back in 2013 or something, but it seems to have been removed in later iterations... I'll dig around!
  3. That I can add
  4. Yup, I dropped this, along with tests, in case you weren't interested in this change 😄

One last things; at work we have our own wrapper around this library adding a few more metrics (event loop lag, OS load, process handles and process requests), that are all something node exposes. Would you be interested in adding those as well?

@siimon
Copy link
Owner

siimon commented Sep 1, 2016

actually, when thinking about this point 1 and 3 are pretty much the same thing. Would it make any sense to add a configuration object into the metrics function where you can disable/enable what metrics you want? Then it's just a matter of selecting sensible defaults.

I like your ideas, we do the same thing. I don't see any downside to this, as long as we have a good api where it's easy to enable and disable any of them. Just keep in mind that I'm trying to keep the number of dependencies as low as possible!

@SimenB
Copy link
Collaborator Author

SimenB commented Sep 1, 2016

There is no "start measuring" right now, so disabling them can't be done until you first query for them, and that'd mean keeping some state around. So I still prefer having a "start polling default metrics", and this function can take options disabling them or not.

All of those extras are dependency free, just using node's own APIs 😄

I'm back at work now, so I'll spend some time on it.

@siimon
Copy link
Owner

siimon commented Sep 1, 2016

True, sounds great! :)

@SimenB
Copy link
Collaborator Author

SimenB commented Sep 1, 2016

I opened a PR with Prometheus to document this a bit better btw. If you have some comments, that'd be appreciated!

prometheus/docs#529

@SimenB
Copy link
Collaborator Author

SimenB commented Sep 1, 2016

@siimon This is ready to go now 😄

@SimenB
Copy link
Collaborator Author

SimenB commented Sep 1, 2016

It's a bit hard to get the tests to pass when it depends on both node version and os. Ideas?

@SimenB
Copy link
Collaborator Author

SimenB commented Sep 1, 2016

@siimon Latest commit makes it pass, but I'm not happy with it... Ideas?

@SimenB
Copy link
Collaborator Author

SimenB commented Sep 1, 2016

I dug into GC, and it got pulled from Node here: nodejs/node#174

And open issue about adding same sort of data back is here: nodejs/node#4496

And this userland module seems to fit the bill. I can create a project exposing this along with registering it with prom-client, which can be linked in the readme here?

@siimon
Copy link
Owner

siimon commented Sep 2, 2016

A couple of ideas,

  1. Update the travis ci with more node versions, maybe one test per major version or something?
  2. Not sure if node allows it, but the platform variable is probably mutable, so you could change it in your test, then put it back when your test is done. Not that pretty either but it works.

@SimenB
Copy link
Collaborator Author

SimenB commented Sep 2, 2016

  1. Added separate PR for it, which I'll rebase upon (Test on all supported majors of node, and OSX #28)
  2. Good idea! I've done something similar with process.version in one of my own projects

@SimenB
Copy link
Collaborator Author

SimenB commented Sep 4, 2016

I'll try to wrap up the tests at work tomorrow, but I came over prometheus/docs#506, and it touches upon the fact that default metrics should be registered by default, but able to be turned off.

Thoughts on that? IDK if it'd be considered a breaking change, but it's just a number anyways 😄

@siimon
Copy link
Owner

siimon commented Sep 4, 2016

Yeah, the best thing would be if users of this lib doesn't have to do anything, and as you said I think we should do a major bump so we don't introduce any weird duplicate metrics for those who are doing this by themselves.

@jsha
Copy link

jsha commented Sep 4, 2016

Thanks for working on this! I am planning to make use of it as soon as it lands.

I agree these metrics should be enabled by default, and agree that that's consistent with the Prometheus docs.

@SimenB
Copy link
Collaborator Author

SimenB commented Sep 5, 2016

@siimon Should be good to go now.

I suggest removing the lower case exports (client.counter etc.) in this major as well 😄

value: 'my-bogus-platform'
});

if(cpuUsage) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to support both node@<6.1 and node@>=6.1

@SimenB
Copy link
Collaborator Author

SimenB commented Sep 5, 2016

I'll create a separate PR for some other useful metrics, such as event loop lag and process handles/requests. Would be nice to include that as well.

@jsha @siimon Any ideas if memoryUsage could be used better? WRT what prometheus expects. I updated OP with the table of data they suggest.

@SimenB SimenB mentioned this pull request Sep 5, 2016
@jsha
Copy link

jsha commented Sep 5, 2016

I think process_heap_bytes is intended to be the total size of the heap, which I believe better matches process.memoryUsage().heapTotal. I think it would also be useful to report .heapUsed, which I don't think corresponds to any of the standard Prometheus metrics.

@SimenB
Copy link
Collaborator Author

SimenB commented Sep 6, 2016

I did them as they are because of the description in the original word document: https://docs.google.com/document/d/1Q0MXWdwp1mdXCzNRak6bW5LLVylVRXhdi7_21Sg15xQ/edit

@jsha
Copy link

jsha commented Sep 6, 2016

The word document says:

Heap size: process_heap_bytes. ??? DATA as reported from ps, this comes from VmData in /proc/self/status ???

But in man ps I don't see anything about a DATA section. The closest thing I can see in procps-ng version 3.3.10 is:

drs         DRS       data resident set size, the amount of physical memory devoted to other than executable code.

If I run a node REPL, and run ps ao args,drs | grep node I get:

node                        917122

Or ~9MB. From the REPL itself I get:

> process.memoryUsage().heapTotal
10619424

Not sure why Node thinks it has more heap than ps does. It could be a distinction between resident size and non-resident. At any rate, I don't have strong opinions here. I'm happy to proceed with what you've got implemented for process_heap_bytes.

@SimenB
Copy link
Collaborator Author

SimenB commented Sep 6, 2016

Ok, thanks for the analysis 😄 Should be GTG then, shouldn't be any problem to iterate on this later if we figure out we've done it wrong

@siimon
Copy link
Owner

siimon commented Sep 6, 2016

Great :)

One thing we talked about earlier was enabling this by default, if I understood you correctly, the prom docs mentioned they should be enabled by default?

If that is the case, my suggestion is to enable them right away, and then instead have an exported function that stops collecting if you want to do that in your service.

I won't have any problems having it like that if we make a major release which I think we should

@SimenB
Copy link
Collaborator Author

SimenB commented Sep 6, 2016

It does collect automatically as soon as you require index.js. And yup, it's recommended in the docs to collect automatically.

And I agree it should be major bump.

@siimon
Copy link
Owner

siimon commented Sep 6, 2016

Bah, thanks, i'll do a test spin and stop writing dump comments then :)

@siimon
Copy link
Owner

siimon commented Sep 6, 2016

I've no further comments, @jsha do you have any input before merging?

@jsha
Copy link

jsha commented Sep 6, 2016

LGTM

@siimon siimon merged commit a74e5b3 into siimon:master Sep 7, 2016
@SimenB SimenB deleted the standard-metrics branch September 7, 2016 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants