Diagnostics "Best Practices" Guide? #211

mike-kaufman · 2018-07-06T21:39:39Z

Something that would be interesting & valuable to the community is if the WG could come up with a set of diagnostics best practices/techniques for production applications. I'm a little concerned that this is a bit too ambitious, and a little concerned that one person's "best practice" is another's "really dumb thing to do". But I'm willing to throw this out to see what comes out of it. :)

I have a few thoughts about how this could be structured, but would love to hear if anyone else has ideas/suggestions first.

joyeecheung · 2018-07-10T07:36:16Z

It would be great if we can come up with even just a guide with links to good resources and advertise it so people know what to read when they run into different types of problems in production. We always receive bug reports with obscure screenshots of resource usages in the core repo, documentations like that would be a good place to redirect to instead of nodejs/help.

mhdawson · 2018-07-10T13:29:47Z

It would be great if we can put this together

mike-kaufman · 2018-07-17T17:58:34Z

into different types of problems in production. We always receive bug reports with obscure screenshots of resource usages in the core repo,

@joyeecheung, can you point me to some examples here?

joyeecheung · 2018-07-18T01:11:40Z

@mike-kaufman There should be a lot of hits if you search for the issues labeled memory: https://github.com/nodejs/node/issues?utf8=%E2%9C%93&q=is%3Aissue+label%3Amemory+

gireeshpunathil · 2018-07-18T06:13:39Z

I love this idea, support this, and willing to participate / contribute in ways this group needs it and in ways I can.

While I agree that best practices can be subjective, there are elements of diagnostic steps that are impersonal and to the point - an example would be collecting and analysing heapdumps on memory leak.

Regarding the structure, I see we can have different flows such as:

organized based on symptoms [ memory, exception, performance, hang, crash ... ]
organized based on tools [ tracer, profiler, dumper, inspector, debugger, reporter ... ]
organized based on subsystems [ installer, net, fs, child_process, natives, uv, ... ]

I suggest the symptom based categorization as it leads to faster discovery for consumers.

Rgarding the best practice content, again I see few models:

dump of complete diagnostic steps on a failing case, with example code
screen-shot assisted illustration of diagnostic methodology
plain text elaboration of tools and their usage

I suggest the first model as it leads to improved education for consumers.

mike-kaufman · 2018-07-18T15:42:36Z

@joyeecheung, @gireeshpunathil thanks. Let me strawman an outline and see if this matches what's in your head - feel free to tear it down. :)

Production Configuration Best Practices
- application logging & log management
  - breakouts for different app types
    - server apps
    - serverless
    - desktop
    - IOT
- APMs
- Node Internals Tracing
  - configuration
  - interpretation
Troubleshooting
- Profiling Perf Problems
- Analyzing Memory Leaks
- Analyzing Core Dumps

gireeshpunathil · 2018-07-19T11:51:35Z

thanks @mike-kaufman . While the troubleshooting part is straightforward for me to relate, the first part (production configuration best practices) looks very wider in scope to me:

is it possible to suggest configurations for such wide deployment scenarios?
even with say server apps, is it possible to generalize configurations?
is it possible to propose configurations independent of execution environment of such app types?

mmarchini · 2018-07-25T17:23:07Z

I gave a talk on NodeSummit about this topic, where I showed 6 tools suited for production environments. Even though the topic is subjective, I don’t think there’s much disagreement on which tools or techniques should be used (our current pool of production tools is not that large).

The tools I showed alongside the examples are available here if anyone is interested. I would like to help to write these guides :)

gireeshpunathil · 2018-07-26T14:32:24Z

I had a discussion with @mike-kaufman and the consensus was to start with a draft and iterate over PRs and refine through collective intelligence.

So let us start with say Profiling perf problems and then others can follow the structure. I will start looking at dump debugging.

mike-kaufman · 2018-07-26T17:11:37Z

@mmarchini - that's a great start :)

Thanks Gireesh. I'd like to ultimately get content the written to leverage github's auto-html-site feature - i.e., we'd be able to submit markdown updates to this repo, and it will be automatically renedered to html available at http://nodejs.github.io/diagnostics/bestPractices/.

mmarchini · 2018-07-26T17:26:25Z

@mike-kaufman auto-generating a website is an awesome idea! But maybe we should try to coordinate with @nodejs/website to have this content available in https://nodejs.org/ as well?

mike-kaufman · 2018-07-26T17:33:29Z

@nodejs/website to have this content available in https://nodejs.org/ as well?

Yes, this would be good. I think as the first step though, we can start getting the content organized, and the github.io "auto-magic-web-site" is a really simple & cheap way for us to get that content rendered & reviewable, w/out the distraction of how we plug into their process.

@bnb - is there any thinking on how we can plug in content to new website? Ideally, we'd have a bunch of markdown & images here (in diag repo), and this would just get "sucked up" into the website.

fhemberger · 2018-07-27T04:33:56Z

Please get in touch with @nodejs/website-redesign, as we are in the middle of planning the content structure for the website relaunch:

https://github.com/nodejs/website-redesign/issues/

misterdjules · 2018-08-15T16:33:11Z

For what it's worth, a long time ago I had written a set of guides to investigate various types of production issues with Node.js. @cjihrig kindly made those guides available publicly at https://github.com/joyent/node-debugging-methodologies. The content is mostly specific to SmartOS but the methodologies/concepts can almost always easily be transferred to other OSes.

mike-kaufman · 2018-08-15T16:54:07Z

@misterdjules - thanks, this is great!

keywordnew · 2018-08-15T16:54:25Z

There's a biweekly meeting for the website-redesign initiative, and the next one is on Thursday Aug 16th, 15:00 UTC.

Anyone here is welcome to attend. You're welcome to propose adding items to the agenda at the start of the meeting 👍🏽 If you wanted to know where this Diagnostics best practices guide could fit in, that'd be a good place to get direct comments.

Otherwise, please make an issue to get the discussion going!

The initiative also working on creating guides for many things in Node.js. I don't have context for this diagnostics discussion, but it's possible a diagnostics guide could be part of the "analytics" or "ops" categories. Feel free to correct me here or on that issue :)

mhdawson · 2018-09-21T14:52:08Z

We should incorporate https://github.com/naugtur/node-diagnostics-howtos as well. @naugtur has been working to get some of these on the website. See nodejs/nodejs.org#1444

naugtur · 2018-09-22T19:19:55Z

I'm interested in contributing to this guide. It's likely the flame graph one will finally come through and I'll choose something simpler for the next one :)

I like the idea to organize by symptoms.
Also, basic best practices should say what to switch on in production to even get the chance to diagnose anything (like profiling commandline switches or enabling core dumps on the OS level)

Fishrock123 · 2018-10-12T22:49:15Z

I think we should aim this for content as part of the new website redesign.

organized based on symptoms [ memory, exception, performance, hang, crash ... ]

IMO this is probably the most valuable for most users.

bnb · 2018-10-12T23:05:10Z

100% agreed.

@amiller-gh this is probably one you'll want to see 😄

amiller-gh · 2018-10-13T15:32:43Z

Yes please! If you’re looking for a place to store document drafts while we figure out a format and home for the content, you can PR them in to nodejs/website-redesign repo under /documentation using the template provided there.

…

Sent from my iPhone On Oct 12, 2018, at 4:05 PM, Tierney Cyren <notifications@github.com<mailto:notifications@github.com>> wrote: 100% agreed. @amiller-gh<https://github.com/amiller-gh> this is probably one you'll want to see 😄 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#211 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AHfhO7Smn09m6pLUHhKvGcoOu_OUsYXVks5ukSA4gaJpZM4VGDvr>.

gireeshpunathil · 2018-11-16T08:56:26Z

As discussed and decided in the just concluded workgroup meeting (probably in the last couple of meetings) , I plan to setup one (or more, until convergence) meeting to define the next steps on this.

Goal is to be able to gather consensus on the content, and be able to identify some prioritization.

Does 21st Nov 9.30 AM PST work for everyone? please 👍 if you will be able to make it and the time suits you, thanks!

gireeshpunathil · 2018-11-20T07:58:09Z

thanks @mhdawson and @mmarchini for expressing availability. However, given the lack of reasonable number of responses, I think we cannot hold this meeting tomorrow, and move it to a later point in time.

Instead of me prescribing an alternative time, @nodejs/diagnostics - will you please express your availability within the next 10 days or so? based on that I can schedule one. thanks in advance!

gireeshpunathil · 2020-04-18T06:09:58Z

removed from wg meeting agenda, as per discussed in the last meeting (rationale: the work is being progressed as part of uesr journey deep dives and subsequent documentation work. If doc work stalls, we could always re-insert this to gain focus)

mmarchini · 2020-04-28T23:20:28Z

As a reminder, tomorrow we're meeting (same time as always) to discuss diagnostics on CPU usage.

github-actions · 2020-07-28T00:38:58Z

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

github-actions · 2022-07-27T00:28:34Z

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

mike-kaufman added the diag-agenda label Aug 22, 2018

mhdawson mentioned this issue Sep 3, 2018

Node.js Foundation Diagnostics WorkGroup Meeting 2018-09-05 #228

Closed

mhdawson mentioned this issue Sep 17, 2018

Node.js Foundation Diagnostics WorkGroup Meeting 2018-09-19 #233

Closed

mhdawson mentioned this issue Oct 1, 2018

Node.js Foundation Diagnostics WorkGroup Meeting 2018-10-03 #237

Closed

mhdawson mentioned this issue Oct 15, 2018

Node.js Foundation Diagnostics WorkGroup Meeting 2018-10-17 #243

Closed

mhdawson mentioned this issue Oct 29, 2018

Node.js Foundation Diagnostics WorkGroup Meeting 2018-10-31 #250

Closed

mhdawson mentioned this issue Nov 12, 2018

Node.js Foundation Diagnostics WorkGroup Meeting 2018-11-14 #252

Closed

mhdawson mentioned this issue Jul 1, 2019

Node.js Foundation Diagnostics WorkGroup Meeting 2019-07-03 #312

Closed

mhdawson mentioned this issue Jul 15, 2019

Node.js Foundation Diagnostics WorkGroup Meeting 2019-07-17 #317

Closed

mhdawson mentioned this issue Jul 29, 2019

Node.js Foundation Diagnostics WorkGroup Meeting 2019-07-31 #321

Closed

mhdawson mentioned this issue Aug 12, 2019

Node.js Foundation Diagnostics WorkGroup Meeting 2019-08-14 #323

Closed

mhdawson mentioned this issue Aug 26, 2019

Node.js Foundation Diagnostics WorkGroup Meeting 2019-08-28 #325

Closed

mhdawson mentioned this issue Sep 9, 2019

Node.js Foundation Diagnostics WorkGroup Meeting 2019-09-11 #326

Closed

mhdawson mentioned this issue Sep 23, 2019

Node.js Foundation Diagnostics WorkGroup Meeting 2019-09-25 #328

Closed

mhdawson mentioned this issue Oct 7, 2019

Node.js Foundation Diagnostics WorkGroup Meeting 2019-10-09 #330

Closed

mhdawson mentioned this issue Oct 21, 2019

Node.js Foundation Diagnostics WorkGroup Meeting 2019-10-23 #334

Closed

mhdawson mentioned this issue Nov 4, 2019

Node.js Foundation Diagnostics WorkGroup Meeting 2019-11-06 #335

Closed

mhdawson mentioned this issue Nov 18, 2019

Node.js Foundation Diagnostics WorkGroup Meeting 2019-11-20 #341

Closed

mhdawson mentioned this issue Dec 2, 2019

Node.js Foundation Diagnostics WorkGroup Meeting 2019-12-04 #342

Closed

mhdawson mentioned this issue Jan 13, 2020

Node.js Foundation Diagnostics WorkGroup Meeting 2020-01-15 #346

Closed

mhdawson mentioned this issue Jan 27, 2020

Node.js Foundation Diagnostics WorkGroup Meeting 2020-01-29 #351

Closed

mhdawson mentioned this issue Feb 10, 2020

Node.js Foundation Diagnostics WorkGroup Meeting 2020-02-12 #354

Closed

mhdawson mentioned this issue Feb 24, 2020

Node.js Diagnostics WorkGroup Meeting 2020-02-26 #359

Closed

mhdawson mentioned this issue Mar 9, 2020

Node.js Diagnostics WorkGroup Meeting 2020-03-11 #364

Closed

mhdawson mentioned this issue Mar 23, 2020

Node.js Diagnostics WorkGroup Meeting 2020-03-25 #367

Closed

gireeshpunathil mentioned this issue Apr 1, 2020

doc: document how to examine --trace_gc output #372

Closed

mhdawson mentioned this issue Apr 6, 2020

Node.js Diagnostics WorkGroup Meeting 2020-04-08 #374

Closed

gireeshpunathil removed the diag-agenda label Apr 18, 2020

github-actions bot added the stale label Jul 28, 2020

mmarchini added never stale and removed stale labels Jul 28, 2020

github-actions bot added the stale label Jul 27, 2022

github-actions bot closed this as completed Aug 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diagnostics "Best Practices" Guide? #211

Diagnostics "Best Practices" Guide? #211

mike-kaufman commented Jul 6, 2018

joyeecheung commented Jul 10, 2018

mhdawson commented Jul 10, 2018

mike-kaufman commented Jul 17, 2018

joyeecheung commented Jul 18, 2018

gireeshpunathil commented Jul 18, 2018

mike-kaufman commented Jul 18, 2018 •

edited

Loading

gireeshpunathil commented Jul 19, 2018

mmarchini commented Jul 25, 2018

gireeshpunathil commented Jul 26, 2018

mike-kaufman commented Jul 26, 2018

mmarchini commented Jul 26, 2018

mike-kaufman commented Jul 26, 2018

fhemberger commented Jul 27, 2018

misterdjules commented Aug 15, 2018

mike-kaufman commented Aug 15, 2018

keywordnew commented Aug 15, 2018

mhdawson commented Sep 21, 2018

naugtur commented Sep 22, 2018

Fishrock123 commented Oct 12, 2018

bnb commented Oct 12, 2018

amiller-gh commented Oct 13, 2018 via email

gireeshpunathil commented Nov 16, 2018

gireeshpunathil commented Nov 20, 2018

gireeshpunathil commented Apr 18, 2020

mmarchini commented Apr 28, 2020

github-actions bot commented Jul 28, 2020

github-actions bot commented Jul 27, 2022

Diagnostics "Best Practices" Guide? #211

Diagnostics "Best Practices" Guide? #211

Comments

mike-kaufman commented Jul 6, 2018

joyeecheung commented Jul 10, 2018

mhdawson commented Jul 10, 2018

mike-kaufman commented Jul 17, 2018

joyeecheung commented Jul 18, 2018

gireeshpunathil commented Jul 18, 2018

mike-kaufman commented Jul 18, 2018 • edited Loading

gireeshpunathil commented Jul 19, 2018

mmarchini commented Jul 25, 2018

gireeshpunathil commented Jul 26, 2018

mike-kaufman commented Jul 26, 2018

mmarchini commented Jul 26, 2018

mike-kaufman commented Jul 26, 2018

fhemberger commented Jul 27, 2018

misterdjules commented Aug 15, 2018

mike-kaufman commented Aug 15, 2018

keywordnew commented Aug 15, 2018

mhdawson commented Sep 21, 2018

naugtur commented Sep 22, 2018

Fishrock123 commented Oct 12, 2018

bnb commented Oct 12, 2018

amiller-gh commented Oct 13, 2018 via email

gireeshpunathil commented Nov 16, 2018

gireeshpunathil commented Nov 20, 2018

gireeshpunathil commented Apr 18, 2020

mmarchini commented Apr 28, 2020

github-actions bot commented Jul 28, 2020

github-actions bot commented Jul 27, 2022

mike-kaufman commented Jul 18, 2018 •

edited

Loading