From e8781b7aeef49ac9b7a47249c228b4a5b03df4dc Mon Sep 17 00:00:00 2001
From: Luke Steensen <luke.steensen@gmail.com>
Date: Fri, 23 Oct 2020 15:54:50 -0500
Subject: [PATCH] chore: add config composition RFC (#4427)

Signed-off-by: Luke Steensen <luke.steensen@gmail.com>
Co-authored-by: Ana Hobden <operator@hoverbear.org>
---
 ...20-10-06-3791-composing-components-pt-1.md | 101 ++++++++++++++++++
 1 file changed, 101 insertions(+)
 create mode 100644 rfcs/2020-10-06-3791-composing-components-pt-1.md

diff --git a/rfcs/2020-10-06-3791-composing-components-pt-1.md b/rfcs/2020-10-06-3791-composing-components-pt-1.md
new file mode 100644
index 0000000000000..b08d0b09c6885
--- /dev/null
+++ b/rfcs/2020-10-06-3791-composing-components-pt-1.md
@@ -0,0 +1,101 @@
+# RFC 3791 - 2020-10-06 - Composing Components: Part 1
+
+Vector is designed to be very modular, and the current tool for composing those
+modules is the TOML config file. This gives users a great deal of flexibility,
+but it can require configurations that are a bit verbose and require more of
+users than other pre-built, specific solutions.
+
+One way that Vector could get some of the best of both worlds would be to make
+it easy to create pre-built "chunks" of config that users could configure as
+normal components. These would be bundles of lower-level components wired
+together with adjusted default values for the specific use case.
+
+## Scope
+
+This RFC focuses on enabling rapid development of "composed" sources (e.g. NGINX
+logs) within our existing architecture. A more complete solution for composing
+arbitrary components is deferred to a later RFC.
+
+## Motivation
+
+We need a way to quickly assemble Vector components that address specific use
+cases. This will allow us to improve ease of use without spending significant
+development time on each individual use case. It will allow us to focus
+development time on reuseable components without forcing users to do the work of
+assembling them from scratch.
+
+## Internal Proposal
+
+There are multiple levels at which we could implement this type of
+functionality:
+
+1. Manually implement new component as config facade over one existing component
+2. Manually implement new component as config facade over one source and one
+   codec transform
+3. Manually implement new component as config expanding to arbitrary pipeline of
+   components
+4. Automatically derive new component from data describing arbitrary pipeline of
+   components
+
+We currently are at level (1), where we can do things like implement the Humio
+sink as a wrapper around the existing Splunk HEC sink.
+
+The next simplest is level (2). While it's not implemented yet, we do have
+existing plans to introduce the idea of a codec attached to sources. This would
+allow users to directly configure how to parse the incoming data as part of the
+source config itself. With that feature implemented, it would be relatively
+straightforward to do something similar to level (1) but expanding to both
+a source and an included codec.
+
+Level (3) becomes more complicated. We currently have a limited ability for
+transforms to expand to multiple transforms via `TransformConfig::expand`, and
+this could theoretically be generalized to include sources and sinks as well.
+The main problem is that this does not mesh well with the config traits as they
+currently exist and the API can be confusing. To do this properly would likely
+involve deeper changes to the config traits to better support this kind of
+staged building.
+
+Finally, layer (4) would allow defining these compositions via TOML instead of
+Rust code. This is somewhat similar to the idea of snippets that has been
+floated previously, but with a few key differences. The main one is that they
+would be built directly into Vector at compile time instead of loaded at
+runtime. This means they would need to be integrated into our build process and
+changing them would require recompiling Vector. They would also require
+a sufficiently general composition API to be exposed via TOML, which would be
+difficult to come up with for such a wide variety of potential pipelines. For
+these two reasons, I doubt that level (4) is worthwhile right now (this could
+change when/if we have more data-driven config definition in general).
+
+My proposal is that we initally focus on level (2) while collecting data on use
+cases that require level (3). It is my assumption that the largest number of
+these types of composed components will be similar to the example of the NGINX
+source. We will want to combine an existing source (file) with an existing
+transform (regex or grok parser) and provide NGINX-specific default values for
+each. Focusing on these simpler cases will dramatically decrease how much
+complexity we need to add before being able to reap the value.
+
+## Rationale
+
+This set of changes unblocks the most user-facing value with the least required
+investment, and it does so without compromising future plans for deeper
+architectural changes.
+
+## Plan of Attack
+
+- [ ] Implement `TransformFn` from the [Architecture
+    RFC](https://github.com/timberio/vector/blob/master/rfcs/2020-06-18-2625-architecture-revisit.md),
+    switch non-task transforms to it
+- [ ] Add `Vec<dyn TransformFn>` field to `Pipeline`
+- [ ] Implement composed sources as facades that prepend the relevant `TransformFn`
+    to the `Pipeline` passed to `SourceConfig::build`
+- [ ] Move `event_processed` internal events to topology wrappers instead of
+    components themselves to avoid double counting or incorrect tagging (likely
+    within `impl Transform for TransformFn` for now)
+
+Then later we can choose to push towards level (3) as needed:
+
+- [ ] Make `TransformConfig::expand` into first-class stage, splitting the
+    existing config `build` methods
+- [ ] Allow new expansion stage to work for all components, not just transforms
+- [ ] Consider introducing more fine-grained internal component types designed
+    to be composed into user-facing sources, transforms, and sinks