New `top` subcommand v1 #3211

binarylogic · 2020-07-26T17:02:59Z

As part of https://github.com/timberio/vector-product/issues/24, we want to introduce lightweight CLI-based observability for Vector.

Goals

In v3 of the "Observe & monitor Vector" series, we will be introducing a Web UI. We'd like for the work here to unblock that project.
Provide Vector operators with CLI-based observability. This is useful in situations where an operator will not have access to a browser, such as SSH'ing onto a remote host that Vector is running on.

Out of scope

As noted in https://github.com/timberio/vector-product/issues/24:

Fancy CLI-based graphs are out of scope for this project. As much as my inner-nerd wants to do this, all of the presentation details are a distraction from the purpose of this project.
Full-blown metrics querying is likely out of scope for this project. We will be doing this in the future, but we'd like to use this project to learn more about the UI requirements. I use the term "likely" because it is possible that the beginnings of a query syntax might be easier.

Proposal

I propose that we introduce a vector top subcommand, taking inspiration from the glorious top command. I like this because:

It's a familiar tool to anyone that has used the command line before.
It's clear that this provides current, real-time insight (not historical).
The requirements align tightly with the upcoming UI requirements.

Examples

To demonstrate what I'm thinking:

$ vector help top

USAGE:
    vector top [OPTIONS]

OPTIONS
    --refresh-rate      How often the screen refreshes (default 500ms)
    --resolution        Determines the window size for each value (default 500ms)

And usage is simple:

$ vector top
ID              KIND       TYPE         THRPT   I/O      LATENCY    ERRORS
my-file-source  source     file         5.2k    10.6MiB  10.2ns     251
my-json-parser  transform  json_parser  5.0k    -        1.2ms      523
my-s3-sink      sink       s3           4.5k    5.2MiB   10.2s      12

I am very much open to suggestions/changes here.* I want to start simple, but not in a way that will require rework in the future.

Outstanding questions

Do we want to show host resource usage? It would be nice to communicate Vector's CPU, memory, disk, and network usage as gauges. This will be needed for the Web UI.
Can we communicate resource usage on a per-component basis? I assuming no, but that would be very useful.
How can we communicate back pressure clearly? Backpressure detection #892 touches on this.
How about network errors (retries, failed transmissions, etc)? In my example above I have all of this bucketed under a generic "errors" column, but I'm not sure that's the most helpful.
Finally, the big one, how are we communicating from the Vector binary and a live running Vector instance? This should be done in a way that will unblock the web UI.

Future concerns

Future unknowns. It is very likely we'll want to add/remove/change the data as we progress, and it should be easy to do so. If it is not easy, we should consider a syntax that makes it easy to query data.

The text was updated successfully, but these errors were encountered:

leebenson · 2020-10-06T10:27:26Z

@binarylogic - I'm wondering if vector top should default to showing a snapshot of the stats, and then exit... and instead, we could have an explicit -f to 'follow' the stats that auto-update on the supplied/default --refresh-interval?

I think the common case of top will be to get an at-a-glance view of topology + stats. An explicit -f would separate out the behavior of the console prompt not returning.

binarylogic · 2020-10-06T15:17:49Z

@leebenson I don't think so. When I think top, I think about a persistent updating interface. If we want to print stats we can off a vector stats command. That'll likely print different input, take a window argument to get averages, etc.

leebenson · 2020-10-06T15:51:37Z

Got it, thanks for clarifying 👍

leebenson · 2020-10-07T18:45:58Z

@binarylogic - I think I'm hitting a wall with what we're able to currently show in the console. I chatted about this a little earlier with @jamtur01. Interested to get your thoughts.

Blockers:

We don't currently collect internal events/metrics against an individual source. We emit structs such as GeneratorEventProcessed to determine the type of event, but not where it happened. To aggregate stats/metrics by source, I think we'd need to modify all emit! paths to take an ID of the source/sink. This should be relatively straightforward, since SourceConfig.build() already takes a name: str -- so it should (mostly) just be a case of passing that through to the inner methods. There may be code paths where we don't have the context. I need to dig in further.
Collecting stats using get_controller(), by extension, also lacks topology context. We can collect eventsProcessed or bytesProcessed-- but we don't know where they came from. Some work may be required to further split stats by ID.
I can't see any obvious groundwork for some of the stats exampled in the task description. The only results for "latency" are in tests against certain sources. I'm only just getting acquainted with internal events, so I may have missed something, but I don't see any internal concepts for throughout, latency, etc. Is work here ongoing?
There are specific events such as PrometheusParseError and PrometheusErrorResponse, but it's not clear how these should be aggregated to determine an 'errors' stat, or how we'd host these specific stats in a table where for other rows, these may not be applicable.

Based on the above, I think there's a couple of potential next steps:

Attempt to augment existing stats with an ID, and pull them out based on that same ID -- to retain a similar layout to the task description. We're still missing the example columns, but we should be able to pull out obvious stats like events/bytes processed, and a few others.
Defer aggregation by ID, and just dump out the high-level stats. The console can still update with new data -- but it's not related to any individual source/sink.

What do you think?

binarylogic · 2020-10-07T19:11:15Z

@leebenson

We don't currently collect internal events/metrics against an individual source. We emit structs such as GeneratorEventProcessed to determine the type of event, but not where it happened.

#4181 should include all span context as metrics tags. This includes component_kind, component_type and component_id.

The only results for "latency" are in tests against certain sources.

Yep, let's defer this column for now. I opened #3445 and never got a response. My hope is that we can get certain metrics for free, like how long an event spent in a component.

it's not clear how these should be aggregated to determine an 'errors' stat

We have an processing_errors metric that tracks this.

Let me know if that helps. We'll get #4181 merged shortly.

leebenson · 2020-10-07T19:16:51Z

Thanks, #4181 should help a lot.

leebenson · 2020-10-16T08:42:55Z

Closing and removing the points estimate on this, since this is now partially implemented and being tracked more specifically across multiple issues.

binarylogic added this to the 2020.08.03 - Observability Bonanza milestone Jul 26, 2020

binarylogic changed the title ~~New top subcommand v1 RFC~~ New top subcommand v1 Jul 26, 2020

binarylogic mentioned this issue Jul 27, 2020

Real-time communication R&D and POC #3225

Closed

4 tasks

binarylogic assigned leebenson Jul 28, 2020

binarylogic added domain: data model Anything related to Vector's internal data model domain: metrics Anything related to Vector's metrics events and removed event type: metric labels Aug 6, 2020

binarylogic modified the milestones: 2020.08.03 - Observability Bonanza, 2020.08.17 - On The Road Again Aug 17, 2020

This was referenced Aug 19, 2020

Run internal metrics through the Vector pipeline #230

Closed

feat(observability): API health checks + GraphQL #3514

Closed

jamtur01 modified the milestones: 2020.08.17 - On The Road Again, 2020-08-31 - Digitization Laser Sep 1, 2020

jamtur01 added this to the 2020-08-31 - Digitization Laser milestone Sep 10, 2020

binarylogic modified the milestones: 2020-08-31 - Digitization Laser, 2020-09-14 - The Grid Sep 14, 2020

leebenson modified the milestones: 2020-09-14 - The Grid, 2020-09-28 - Derezzed Sep 24, 2020

leebenson mentioned this issue Sep 28, 2020

GraphQL schema for topology #4172

Closed

leebenson removed the needs: rfc Needs an RFC before work can begin. label Sep 28, 2020

leebenson mentioned this issue Sep 28, 2020

Observing Config in both RunningTopology and API schema #4173

Closed

leebenson mentioned this issue Oct 5, 2020

enhancement(observability): Bidirectional source/transform/sink GraphQL types #4383

Merged

leebenson mentioned this issue Oct 7, 2020

feat(observability): vector top, v1 #4431

Merged

binarylogic mentioned this issue Oct 7, 2020

chore(metrics): Switch from literals at emit! to span context #4181

Merged

leebenson mentioned this issue Oct 9, 2020

Unify span context when components are nested #4454

Closed

leebenson modified the milestones: 2020-09-28 - Derezzed, 2020-10-12: Son of Flynn Oct 12, 2020

leebenson closed this as completed Oct 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New `top` subcommand v1 #3211

New `top` subcommand v1 #3211

binarylogic commented Jul 26, 2020 •

edited

Loading

leebenson commented Oct 6, 2020

binarylogic commented Oct 6, 2020

leebenson commented Oct 6, 2020

leebenson commented Oct 7, 2020 •

edited

Loading

binarylogic commented Oct 7, 2020

leebenson commented Oct 7, 2020

leebenson commented Oct 16, 2020

New top subcommand v1 #3211

New top subcommand v1 #3211

Comments

binarylogic commented Jul 26, 2020 • edited Loading

Goals

Out of scope

Proposal

Examples

Outstanding questions

Future concerns

leebenson commented Oct 6, 2020

binarylogic commented Oct 6, 2020

leebenson commented Oct 6, 2020

leebenson commented Oct 7, 2020 • edited Loading

binarylogic commented Oct 7, 2020

leebenson commented Oct 7, 2020

leebenson commented Oct 16, 2020

New `top` subcommand v1 #3211

New `top` subcommand v1 #3211

binarylogic commented Jul 26, 2020 •

edited

Loading

leebenson commented Oct 7, 2020 •

edited

Loading