Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(vrl): remote store/enrichment table RFC #20495

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

lsampras
Copy link
Contributor

chore(vrl): remote store/enrichment table RFC

Readable

External stores support for enrichment discussed here: #17195

@lsampras
Copy link
Contributor Author

@jszwedko Can you take a look at this?

@jszwedko
Copy link
Member

Thanks @lsampras ! I was OOO last week, but we'll give this a review this week.

@StephenWakely
Copy link
Contributor

Thank you for writing this. It is definitely something that could be quite useful.

The biggest show stopper to getting this implemented is the lack of async within VRL. Any io operations would halt processing on that thread, which would be terrible for performance. One of the guiding principles behind Vector is that there should be no performance footguns.

Existing enrichment tables work by loading the data into memory at startup. This would not be doable for these proposals given the need to also write to the database.

I think there are several possible ways to move forward with this.

  1. VRL is updated to allow functions to be async. This would be a fairly large undertaking and in the past there has been little appetite for it. But in light of this it may be be worth revisiting if there is enough demand for it.
  2. The initial data could be loaded into memory. On updates, a writer thread is signaled. The writer thread then updates the disk storage in the background, updates the memory store and updates the indexes, once all the updates are complete the old store is swapped with the new one. This would work with the caveat that writes would not be visible straight away due to the time taken to build the indexes and each data store could not be shared by multiple Vector instances since there would be no way to signal to the other instances that they need to re-read from the store.. This could potentially be made to work, but it's clunky.
  3. Something else that I haven't thought of. Very open to other suggestions!

@lukesteensen
Copy link
Member

The main problem with any kind of async VRL function is unpredictable and likely extremely poor performance. Normal programming languages can just say that this is the responsibility of the user to manage, but the purpose of VRL is to be reliably fast, deterministic, etc. Strong constraints on what you're able to do is how we achieve that (see the lack of even things like basic loops).

I would actually suggest a different pattern for this kind of read/write use case. Instead of a single instance of VRL that is trying to do both, I would split the reads and writes into different pipelines. The write path, rather than needing to write to an enrichment table, would do the appropriate transformations and then write to the datastore via a normal sink. The read path could then use an enrichment table to read from that same datastore. This would require support for enrichment tables backed by remote datastores, but that's a pretty natural extension of the existing feature. One important caveat is that we'd still require there to be some level of local caching of queries to the remote datastore, because we do not want to block event processing in VRL on network calls.

If you need data that is perfectly up to date and not subject to stale caches like above, the other option would be to keep everything local to the vector instance. We could do this by introducing something very similar to the reduce transform that exposes it's state as an enrichment table. This would allow updating the state, but would not need to be async or blocking in any way because it would only work on local data. If persistence is required, it could support periodic checkpointing to disk or object storage, or anything else as long as it is not in the hot path of data processing.

Finally, the escape hatch option is always Lua. You can do pretty much anything you want there and simply deal with the performance consequences as they come up.

@lsampras
Copy link
Contributor Author

@lukesteensen @StephenWakely thanks for your reviews,
I could gather the following points from above

  1. Currently VRL is meant for purely compute heavy transforms and adding IO tasks here would cause performance problems and additional footguns.
  2. This feature deviates significantly from the VRL & enrichment table principles and can cause poor performance.
  3. There isn't any way to avoid the async nature here.
    1. Even if we add caching the data would needed to be fetched for the first time
    2. I believe users should be given info about whether saving to an external data store succeeded (this could even be incorporated in the end-to-end acknowledgements).

For now I can think of 2 ways to solve this

  1. Add additional transforms with sole responsibility to get/set data to external storesa
    1. These transforms can fetch the data and populate it in the event or set the data from a key to an external store.
    2. This would clearly segregate the IO-heavy transforms that can cause a bottleneck and easily reveal any problems created due to network/IO tasks.
  2. Update VRL to allow async functions (maybe opt-in)
    1. VRL already supports concurrent execution adding async support would allow it to support high throughput even with IO calls.
    2. VRL would also provide better ergonomics in terms of error handling and allowing users to chain IO calls.

Would we be fine with any of this? While Lua serves as an escape hatch having some sort of native support would add more safety & performance here.

@sandervandegeijn
Copy link

sandervandegeijn commented Jul 13, 2024

I'm looking into Vector and was missing support for this. We are using memcached with logstash to normalise user names and to enrich records from our CMDB. It's simple, but it works like a charm and enables us to provide more context in our logs that are written to the SIEM.

An additional transform sounds like a good idea and makes the user aware of potential performance pitfalls.

@coredump17
Copy link

This would also be very useful for me.

@jszwedko jszwedko requested a review from a team as a code owner October 3, 2024 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants