Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: add config composition RFC #4427

Merged
merged 4 commits into from
Oct 23, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 101 additions & 0 deletions rfcs/2020-10-06-3791-composing-components-pt-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# RFC 3791 - 2020-10-06 - Composing Components: Part 1

Vector is designed to be very modular, and the current tool for composing those
modules is the TOML config file. This gives users a great deal of flexibility,
but it can require configurations that are a bit verbose and require more of
users than other pre-built, specific solutions.

One way that Vector could get some of the best of both worlds would be to make
it easy to create pre-built "chunks" of config that users could configure as
normal components. These would be bundles of lower-level components wired
together with adjusted default values for the specific use case.

## Scope

This RFC focuses on enabling rapid development of "composed" sources (e.g. NGINX
logs) within our existing architecture. A more complete solution for composing
arbitrary components is deferred to a later RFC.

## Motivation

We need a way to quickly assemble Vector components that address specific use
cases. This will allow us to improve ease of use without spending significant
development time on each individual use case. It will allow us to focus
development time on reuseable components without forcing users to do the work of
assembling them from scratch.

## Internal Proposal

There are multiple levels at which we could implement this type of
functionality:

1. Manually implement new component as config facade over one existing component
2. Manually implement new component as config facade over one source and one
codec transform
3. Manually implement new component as config expanding to arbitrary pipeline of
components
4. Automatically derive new component from data describing arbitrary pipeline of
components

We currently are at level (1), where we can do things like implement the Humio
sink as a wrapper around the existing Splunk HEC sink.

The next simplest is level (2). While it's not implemented yet, we do have
existing plans to introduce the idea of a codec attached to sources. This would
allow users to directly configure how to parse the incoming data as part of the
source config itself. With that feature implemented, it would be relatively
straightforward to do something similar to level (1) but expanding to both
a source and an included codec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also consider sinks when discussing codecs. :) Seems like that work is left for a pt 2 though?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, codecs are basically a way for users to do something similar manually. Definitely related, but achieves a different goal.


Level (3) becomes more complicated. We currently have a limited ability for
transforms to expand to multiple transforms via `TransformConfig::expand`, and
this could theoretically be generalized to include sources and sinks as well.
The main problem is that this does not mesh well with the config traits as they
currently exist and the API can be confusing. To do this properly would likely
involve deeper changes to the config traits to better support this kind of
staged building.
Comment on lines +53 to +56
Copy link
Member

@bruceg bruceg Oct 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may entail a layer between configuration and components. I believe this is similar to what has been discussed with a config "compiler".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, exactly.


Finally, layer (4) would allow defining these compositions via TOML instead of
Rust code. This is somewhat similar to the idea of snippets that has been
floated previously, but with a few key differences. The main one is that they
would be built directly into Vector at compile time instead of loaded at
runtime. This means they would need to be integrated into our build process and
changing them would require recompiling Vector. They would also require
a sufficiently general composition API to be exposed via TOML, which would be
difficult to come up with for such a wide variety of potential pipelines. For
these two reasons, I doubt that level (4) is worthwhile right now (this could
change when/if we have more data-driven config definition in general).

My proposal is that we initally focus on level (2) while collecting data on use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we have cases of 3 already, in that some sinks are a codec-ish -> batchedhttp -> http.

cases that require level (3). It is my assumption that the largest number of
these types of composed components will be similar to the example of the NGINX
source. We will want to combine an existing source (file) with an existing
transform (regex or grok parser) and provide NGINX-specific default values for
each. Focusing on these simpler cases will dramatically decrease how much
complexity we need to add before being able to reap the value.

## Rationale

This set of changes unblocks the most user-facing value with the least required
investment, and it does so without compromising future plans for deeper
architectural changes.

## Plan of Attack

- [ ] Implement `TransformFn` from the [Architecture
RFC](https://github.com/timberio/vector/blob/master/rfcs/2020-06-18-2625-architecture-revisit.md),
switch non-task transforms to it
- [ ] Add `Vec<dyn TransformFn>` field to `Pipeline`
- [ ] Implement composed sources as facades that prepend the relevant `TransformFn`
to the `Pipeline` passed to `SourceConfig::build`
- [ ] Move `event_processed` internal events to topology wrappers instead of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay!

components themselves to avoid double counting or incorrect tagging (likely
within `impl Transform for TransformFn` for now)

Then later we can choose to push towards level (3) as needed:

- [ ] Make `TransformConfig::expand` into first-class stage, splitting the
existing config `build` methods
- [ ] Allow new expansion stage to work for all components, not just transforms
- [ ] Consider introducing more fine-grained internal component types designed
to be composed into user-facing sources, transforms, and sinks