-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhancement(observability): Vector tap
, v2
#6610
Conversation
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Cargo.toml
Outdated
@@ -129,9 +129,10 @@ goauth = { version = "0.9.0", optional = true } | |||
smpl_jwt = { version = "0.6.1", optional = true } | |||
|
|||
# API | |||
async-graphql = { version = "=2.5.0", optional = true } | |||
async-graphql-warp = { version = "=2.5.0", optional = true } | |||
async-graphql = { version = "=2.4.11", optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Downgraded (and stapled) to 2.4.11, to unblock a fix for subscription context. Will be able to bump to the latest when #6383 lands.
src/api/schema/log_events/mod.rs
Outdated
ctx: &'a Context<'a>, | ||
component_names: Vec<String>, | ||
#[graphql(default = 500)] interval: i32, | ||
#[graphql(default = 100, validator(IntRange(min = "1", max = "10_000")))] limit: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows for observing log events in batch sizes of up to 10,000. This feels a little large to me (it's probably quite expensive to unwind 10k records on the client, at default intervals of 0.5 seconds -- not to mention collect/serialize a JSON payload of that size in Vector) -- but this defaults to 100, and is user-definable.
We can adjust the upper bound if needed.
Signed-off-by: Lee Benson <lee@leebenson.com>
@leebenson awesome! I think I can see why you collapsed You could still define another union type that combines
|
@jszwedko - unfortunately, I don't think that'll work due to this spec constraint:
|
Ah 😢 I still think we should split them up though. So I guess we'll need to do: union Event = LogEvent
union OutputEventsPayload = LogEvent | EventNotification I'm also a fan of dropping |
Gotcha, yeah, I think it does make sense to scope the union name here to a more precise use-case than That also opens up the possibility of making
I went with I'll update 👍🏻 |
Signed-off-by: Lee Benson <lee@leebenson.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks @leebenson !
I still defer to @lukesteensen's thoughts on the topology bits, but this looks good to me!
Signed-off-by: Lee Benson <lee@leebenson.com>
I found a subtle bug at the weekend that should be fixed by ab0252b. If a configuration is changed while an active subscription is in flight, and the subscription is later closed by the client, topology would panic because the event handler went away. This didn't happen for subscriptions that were terminated before the change, or for subscriptions that didn't change over subsequent reloads. To fix this, the control channel is now cloned and a Since we're not doing anything with When a subscription is dropped, the HashMap holding the component name -> shutdown channel is dropped, triggering a shutdown of all event handlers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I hadn't really thought about letting things subscribe to topology changes like this but it seems to work nicely here.
|
||
// Random number generator to allow for sampling. Speed trumps cryptographic security here. | ||
// The RNG must be Send + Sync to use with the `select!` loop below, hence `SmallRng`. | ||
let mut rng = SmallRng::from_entropy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this block? From the docs it seems possible but I'm not sure if it's something to worry about or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the docs:
RNG recommended when small state, cheap initialization and good performance are required. The PRNG algorithm in SmallRng is chosen to be efficient on the current platform, without consideration for cryptography or security. The size of its state is much smaller than for StdRng
This is initialised once in an async task (each GraphQL subscription, and possibly each field, is a separate task), and is invoked a maximum of the total number of logs for the observed component(s) over a given interval, minus the provided limit
.
Running the bench for seq_iter_choose_from_1000 locally produces 5354 MB/s of usize values, which I think is ~701,759,488/sec on a 64 bit platform (macOS, 2.3 Ghz 8 core i9 in this case.)
I don't think this is anything to be concerned with, but open to alternatives -- I chose this because it purports to be fast and thread safe.
#[cfg(feature = "api")] | ||
// Pass the new config to the API server. | ||
if let Some(ref api_server) = api_server { | ||
api_server.update_config(topology.config()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to update the config here if the server is able to watch the topology?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, this would be updated to follow the same pattern. It's a fair amount of work, since metrics subscriptions are built around the notion of global state. I think this is a good follow-up issue. Tracking in #6939.
|
||
/// Returns a tap handler that listens for topology changes, and connects sinks to observe | ||
/// `LogEvent`s` when a component matches one or more of the provided patterns. | ||
async fn tap_handler( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding correctly, we spawn one of these per client connection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. When the subscription comes in, the process is roughly:
- The client hits the
/graphql
endpoint; the request is upgraded to a WebSocket. - RunningTopology's
watch_tx
is injected into the request context, persisting for the life of the request. - The subscription (and possibly the individual field; in this case, output_events) is resolved asynchronously.
- In the body of the output_events resolver, create_events_stream is called, returning an
impl Stream<Item = Vec<OutputEventsPayload>>
- Inside create_events_stream, an event is spawned, calling TapSink::new, which spawns the tap_handler
- Each tap_handler receives a pattern of components to match against, and selects over RunningTopology.watch to get recent/updated config.
- When a component matches, a sink is connected to the output ControlChannel, which spawns an event handler, returning the sink and a
oneshot::Sender
to trigger event shutdown. The shutdown channel is stored along with the component name in a HashMap. LogEvent
s fan-in to the subscription's event_rx, collecting events from component(s).- Events are sampled per the reservoir sampling strategy here, randomly evicting entries from a subscription cache.
- Each
interval
, results are drained and emitted toevent_rx
. - Results stream back to the client over the active WebSocket.
If event_rx
go away (either explicitly with a "stop" WS frame from the client, or the WS connection itself going away), the stream is terminated which drops each of the event handlers along the way.
Likewise, if topology changes, sinks are reconnected. The process of overwriting the sink cache replaces the existing sink cache, which implicitly drops the oneshot::Sender
, dropping the event handler for the previous sink. This is in lieu of reading ConfigDiff to determine which are "new" components with the same name, and which are unchanged.
It's a little complicated to follow, but the end result is a fan-in pattern with results streamed back to the client, while also cleaning up when the client goes away or topology changes, with minimal touch points with RunningTopology.
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
Signed-off-by: Lee Benson <lee@leebenson.com>
@lukesteensen, TapSink now implements |
Signed-off-by: Lee Benson <lee@leebenson.com>
I've updated this PR to work alongside the tokio 1.x upgrade, and refactored the tap sink to be non-blocking. |
@lukesteensen, this is merged after refactoring |
This PR is a retake of the API for Vector 'tap'.
It follows on from #6363 by attempting to reimplement the majority of the tap sink outside of topology.
Closes #3212.
Closes #6391.
Closes #6409.
Closes #6635.
Premise
This PR enables observability into the log events that pass through topology, via the GraphQL API.
The purpose of which is to support an (upcoming) CLI command and a web UI to allow a user to 'tap' into the stream of events that are being processed live.
Example query
Assuming this config:
This query:
Initially produces:
The subscription will remain open until the client explicitly terminates it (or the WebSocket connection goes away.)
If the config is then updated to:
Then the next payload may look something like this:
Notes:
When a
componentNames
value is matched, aLogEventNotification
is returned with anotification.MATCHED
value. This is intended to signal to the client that a component has been 'found', and therefore update the user interface in some way to indicate that fact.The
LogEvent
itself contains the payloads of log events.If a component goes away, it returns the
notification.NOT_MATCHED
value initially seen.LogEvent
A log event is comprised of the following fields:
message
andtimestamp
are opinionated fields based on the assumption that they will be relatively common in the event data.json
andyaml
output text strings encoded to those respective formats.TODO
Later work
LogEvent
s #6390)fields
bag toLogEvent
API type #6392, Add dynamic field look-up toLogEvent
GraphQL type #6393)