Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(observability): Vector tap subcommand #6871

Merged
merged 113 commits into from
Apr 6, 2021
Merged

Conversation

leebenson
Copy link
Member

@leebenson leebenson commented Mar 24, 2021

This PR introduces a vector tap CLI subcommand, communicating with the tap API added in #6610.

Closes epic #6518.

Usage

Running:

vector tap --help

Yields:

vector-tap 0.13.0
Observe log events from topology components

USAGE:
    vector tap [OPTIONS] [components]...

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -i, --interval <interval>    Interval to sample metrics at, in milliseconds [default: 500]
    -u, --url <url>              Vector GraphQL API server endpoint
    -l, --limit <limit>          Sample log events to the provided limit [default: 100]
    -f, --format <format>        Encoding format for logs printed to screen [default: json]  [possible values:
                                 json, yaml]

ARGS:
    <components>...    Components to observe (comma-separated; accepts glob patterns) [default: *]

By default, running vector tap will sample events for all source/transform components against a locally running Vector instance (localhost:8686/graphql)

-f can be provided as either "json" or "yaml" to control the output format. #6854 will extend this to logfmt.

Multiple, comma-separated components can be supplied as glob patterns.

Example

To sample 10 events over a 1 sec interval against a remote endpoint, with all components prefixed with "web", run:

vector tap -l 5 -i 1000 -u <remote>/graphql "web*"

Which might return something like:

{"message":"From web 2","timestamp":"2021-03-24T06:57:39.776877Z"}
{"message":"from web 1","timestamp":"2021-03-24T06:57:39.976529Z"}
{"message":"From web 2","timestamp":"2021-03-24T06:57:40.176341Z"}
{"message":"from web 1","timestamp":"2021-03-24T06:57:40.376128Z"}
{"message":"From web 2","timestamp":"2021-03-24T06:57:40.576249Z"}

Reloading topology and unmatched components

The tap API is designed to persist over both configuration reloads, and when encountering unmatched components.

This allows for running a tap operation against a live Vector agent before component(s) are connected to topology, to enable a range of debugging or observability use-cases.

The tap connection will also remain active if encountering configuration errors on a subsequent load (assuming the API was correctly configured initially). This is because the new configuration won't be applied until it's in a valid state, so the previous configuration will remain active.

Notifications

The tap API in #6610 returns (NOT_)MATCHED notifications, which are intended as status messages to inform a client to update its interface accordingly if applicable.

Notifications are currently ignored in this PR, to avoid commingling with log events.

For future work, a --verbose option or equivalent could be added to add notifications to output. These could be printed in bold/styled to highlight them as notification messages.

Tracking in #6870.

GraphQL subscription client bug

The tap API works, in part, by injecting a topology watch channel into GraphQL resolvers.

I discovered a bug where if a connection_init message isn't provided by the client prior to a start operation, context wouldn't be provided. This is because async-graphql's graphql_subscription_with_data expects the initial payload to merge into context.

Since I am using async-graphql's Context.data_unchecked() to pull context without checking for its existence, this resulted in a thread panic. It doesn't bring Vector down, but does show up in console logs and leaves the subscription hanging.

I think this is unexpected behaviour on the part of async-graphql, and will file a report. But it also highlighted a missing part of the initial handshake expected of GraphQL subscriptions (led, in mostpart, by Apollo's spec in subscriptions-transport-ws).

I've added an init() method to the internal Subscription handler to ensure this is sent with an empty ({}) payload to merge with our own internal context.

Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
@leebenson
Copy link
Member Author

Noting that this PR was updated to merge in the more general outputEvents schema added in #6610.

👍 I think a flag is a good idea. I'm even happy to go the other way and default to no metadata if we think the more common case is tapping a single component. Then we could add --metadata or something to include that as well.

I like that. Since this is opt-in, I think I can get this merged and tackle as a follow-up. Tracking in #6898.

@leebenson leebenson requested review from a team and JeanMertz and removed request for a team March 29, 2021 11:26
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Base automatically changed from leebenson/tap2 to master April 6, 2021 18:35
@leebenson
Copy link
Member Author

All tests are passing, but the benchmarks seem stuck. I'll go ahead and merge this, since it's been queued for 1.5hrs already.

@leebenson leebenson merged commit 52daacd into master Apr 6, 2021
@leebenson leebenson deleted the leebenson/tap-cli branch April 6, 2021 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: api Anything related to Vector's GraphQL API domain: observability Anything related to monitoring/observing Vector type: enhancement A value-adding code change that enhances its existing functionality.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants