Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diagnostics "Best Practices" Guide? #211

Closed
mike-kaufman opened this issue Jul 6, 2018 · 33 comments
Closed

Diagnostics "Best Practices" Guide? #211

mike-kaufman opened this issue Jul 6, 2018 · 33 comments

Comments

@mike-kaufman
Copy link
Contributor

Something that would be interesting & valuable to the community is if the WG could come up with a set of diagnostics best practices/techniques for production applications. I'm a little concerned that this is a bit too ambitious, and a little concerned that one person's "best practice" is another's "really dumb thing to do". But I'm willing to throw this out to see what comes out of it. :)

I have a few thoughts about how this could be structured, but would love to hear if anyone else has ideas/suggestions first.

@joyeecheung
Copy link
Member

It would be great if we can come up with even just a guide with links to good resources and advertise it so people know what to read when they run into different types of problems in production. We always receive bug reports with obscure screenshots of resource usages in the core repo, documentations like that would be a good place to redirect to instead of nodejs/help.

@mhdawson
Copy link
Member

It would be great if we can put this together

@mike-kaufman
Copy link
Contributor Author

into different types of problems in production. We always receive bug reports with obscure screenshots of resource usages in the core repo,

@joyeecheung, can you point me to some examples here?

@joyeecheung
Copy link
Member

@mike-kaufman There should be a lot of hits if you search for the issues labeled memory: https://github.com/nodejs/node/issues?utf8=%E2%9C%93&q=is%3Aissue+label%3Amemory+

@gireeshpunathil
Copy link
Member

I love this idea, support this, and willing to participate / contribute in ways this group needs it and in ways I can.

While I agree that best practices can be subjective, there are elements of diagnostic steps that are impersonal and to the point - an example would be collecting and analysing heapdumps on memory leak.

Regarding the structure, I see we can have different flows such as:

  • organized based on symptoms [ memory, exception, performance, hang, crash ... ]
  • organized based on tools [ tracer, profiler, dumper, inspector, debugger, reporter ... ]
  • organized based on subsystems [ installer, net, fs, child_process, natives, uv, ... ]

I suggest the symptom based categorization as it leads to faster discovery for consumers.

Rgarding the best practice content, again I see few models:

  • dump of complete diagnostic steps on a failing case, with example code
  • screen-shot assisted illustration of diagnostic methodology
  • plain text elaboration of tools and their usage

I suggest the first model as it leads to improved education for consumers.

@mike-kaufman
Copy link
Contributor Author

mike-kaufman commented Jul 18, 2018

@joyeecheung, @gireeshpunathil thanks. Let me strawman an outline and see if this matches what's in your head - feel free to tear it down. :)

  • Production Configuration Best Practices
    • application logging & log management
      • breakouts for different app types
        • server apps
        • serverless
        • desktop
        • IOT
    • APMs
    • Node Internals Tracing
      • configuration
      • interpretation
  • Troubleshooting
    • Profiling Perf Problems
    • Analyzing Memory Leaks
    • Analyzing Core Dumps

@gireeshpunathil
Copy link
Member

thanks @mike-kaufman . While the troubleshooting part is straightforward for me to relate, the first part (production configuration best practices) looks very wider in scope to me:

  • is it possible to suggest configurations for such wide deployment scenarios?
  • even with say server apps, is it possible to generalize configurations?
  • is it possible to propose configurations independent of execution environment of such app types?

@mmarchini
Copy link
Contributor

I gave a talk on NodeSummit about this topic, where I showed 6 tools suited for production environments. Even though the topic is subjective, I don’t think there’s much disagreement on which tools or techniques should be used (our current pool of production tools is not that large).

The tools I showed alongside the examples are available here if anyone is interested. I would like to help to write these guides :)

@gireeshpunathil
Copy link
Member

I had a discussion with @mike-kaufman and the consensus was to start with a draft and iterate over PRs and refine through collective intelligence.

So let us start with say Profiling perf problems and then others can follow the structure. I will start looking at dump debugging.

@mike-kaufman
Copy link
Contributor Author

@mmarchini - that's a great start :)

Thanks Gireesh. I'd like to ultimately get content the written to leverage github's auto-html-site feature - i.e., we'd be able to submit markdown updates to this repo, and it will be automatically renedered to html available at http://nodejs.github.io/diagnostics/bestPractices/.

@mmarchini
Copy link
Contributor

@mike-kaufman auto-generating a website is an awesome idea! But maybe we should try to coordinate with @nodejs/website to have this content available in https://nodejs.org/ as well?

@mike-kaufman
Copy link
Contributor Author

@nodejs/website to have this content available in https://nodejs.org/ as well?

Yes, this would be good. I think as the first step though, we can start getting the content organized, and the github.io "auto-magic-web-site" is a really simple & cheap way for us to get that content rendered & reviewable, w/out the distraction of how we plug into their process.

@bnb - is there any thinking on how we can plug in content to new website? Ideally, we'd have a bunch of markdown & images here (in diag repo), and this would just get "sucked up" into the website.

@fhemberger
Copy link

Please get in touch with @nodejs/website-redesign, as we are in the middle of planning the content structure for the website relaunch:

https://github.com/nodejs/website-redesign/issues/

@misterdjules
Copy link

For what it's worth, a long time ago I had written a set of guides to investigate various types of production issues with Node.js. @cjihrig kindly made those guides available publicly at https://github.com/joyent/node-debugging-methodologies. The content is mostly specific to SmartOS but the methodologies/concepts can almost always easily be transferred to other OSes.

@mike-kaufman
Copy link
Contributor Author

@misterdjules - thanks, this is great!

@keywordnew
Copy link

There's a biweekly meeting for the website-redesign initiative, and the next one is on Thursday Aug 16th, 15:00 UTC.

Anyone here is welcome to attend. You're welcome to propose adding items to the agenda at the start of the meeting 👍🏽 If you wanted to know where this Diagnostics best practices guide could fit in, that'd be a good place to get direct comments.

Otherwise, please make an issue to get the discussion going!

The initiative also working on creating guides for many things in Node.js. I don't have context for this diagnostics discussion, but it's possible a diagnostics guide could be part of the "analytics" or "ops" categories. Feel free to correct me here or on that issue :)

@mhdawson
Copy link
Member

We should incorporate https://github.com/naugtur/node-diagnostics-howtos as well. @naugtur has been working to get some of these on the website. See nodejs/nodejs.org#1444

@naugtur
Copy link
Contributor

naugtur commented Sep 22, 2018

I'm interested in contributing to this guide. It's likely the flame graph one will finally come through and I'll choose something simpler for the next one :)

I like the idea to organize by symptoms.
Also, basic best practices should say what to switch on in production to even get the chance to diagnose anything (like profiling commandline switches or enabling core dumps on the OS level)

@Fishrock123
Copy link
Contributor

I think we should aim this for content as part of the new website redesign.

organized based on symptoms [ memory, exception, performance, hang, crash ... ]

IMO this is probably the most valuable for most users.

@bnb
Copy link

bnb commented Oct 12, 2018

100% agreed.

@amiller-gh this is probably one you'll want to see 😄

@amiller-gh
Copy link
Member

amiller-gh commented Oct 13, 2018 via email

@gireeshpunathil
Copy link
Member

As discussed and decided in the just concluded workgroup meeting (probably in the last couple of meetings) , I plan to setup one (or more, until convergence) meeting to define the next steps on this.

Goal is to be able to gather consensus on the content, and be able to identify some prioritization.

Does 21st Nov 9.30 AM PST work for everyone? please 👍 if you will be able to make it and the time suits you, thanks!

@gireeshpunathil
Copy link
Member

thanks @mhdawson and @mmarchini for expressing availability. However, given the lack of reasonable number of responses, I think we cannot hold this meeting tomorrow, and move it to a later point in time.

Instead of me prescribing an alternative time, @nodejs/diagnostics - will you please express your availability within the next 10 days or so? based on that I can schedule one. thanks in advance!

@gireeshpunathil
Copy link
Member

removed from wg meeting agenda, as per discussed in the last meeting (rationale: the work is being progressed as part of uesr journey deep dives and subsequent documentation work. If doc work stalls, we could always re-insert this to gain focus)

@mmarchini
Copy link
Contributor

As a reminder, tomorrow we're meeting (same time as always) to discuss diagnostics on CPU usage.

@github-actions
Copy link

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

@github-actions
Copy link

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests