Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support get'ing and set'ing remote values from VRL #17195

Open
MadsRC opened this issue Apr 21, 2023 · 9 comments
Open

Support get'ing and set'ing remote values from VRL #17195

MadsRC opened this issue Apr 21, 2023 · 9 comments
Labels
domain: vrl Anything related to the Vector Remap Language type: feature A value-adding code addition that introduce new functionality. vrl: stdlib changes to VRL's standard library.

Comments

@MadsRC
Copy link

MadsRC commented Apr 21, 2023

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

Log Enrichment

Vector source produces a log:

{
"userID": "1234",
"message": "authentication successful"
}

A VRL remap transform queries enriches the log, using the userID, with data from an external system:

{
"userID": "1234",
"username": "MadsRC",
"message": "authentication successful"
}

Calculating user logins

Vector source produces a log:

{
"userID": "1234",
"message": "authentication successful"
}

Vector VRL remap transform increments counter in external system using userID as the key and gets the new total. An if statement is used to check the new total (which would be total logins over a period of time) against a threshold, and determines if it should produce a new message to a destination (using the routing functionality of Vector) to notify that something bad is happening.

Attempted Solutions

Data Enrichment can currently be done by hard-coding the enrichment data into VRL. While this is arguably faster than making several network calls to get the data, it is not very scaleable or dynamic.

Proposal

I would like to see VRL, and by extension Vector, support looking up values, and potentially setting values, in a remote system, such as Redis, Memcached or maybe a Relational Database of sorts.

On top of allowing for data enrichment, this would also allow one to use VRL/Vector as a proper detection engine. While one can already use Vector/VRL for simple detections, having the ability to reference a remote state of sorts would allow for some cool event correlation use-cases.

It is pretty common to have some sort of pipeline in front of a large, expensive enterprise SIEM system like Elasticsearch or Splunk. If one used Vector in this pipeline, and Vector supported get'ing and set'ing values in remote systems, one could offload some of the costs of these enterprise systems by doing real-time detections while one is processing the data anyways.

I am imagining a VRL syntax like this:

[transforms.increment]
type = "remap"
inputs = ["logs"]
source = '''
  .username = getLookup(.userID)
  .timestamp = now()
'''

or

[transforms.increment]
type = "remap"
inputs = ["logs"]
source = '''
  .username = setLookup(.userID, .username)
  .timestamp = now()
'''

and then setting connection info for the getLookup and setLookup function in the global settings.

Alternatively, supporting specific clients could also be in scope, so that one could use some of the more specialised functions of the lookup store, such as Redis's INCR function.

An example of where Redis's INCR function would be helpful is in the use-case of tracking amount of failed logins:

[transforms.increment]
type = "remap"
inputs = ["logs"]
source = '''
  if .eventType == "authentication" && .outcome == "failure" {
    .failCount = incrLookup(.src, 1)
  }
'''

[transforms.route]
type = "route"
inputs = ["increment"]
[transforms.route.route]
type = "vrl"
source = ".failCount > 10"

The above example would use the incrLookup function to increment a counter in Redis by 1, where the key is the value of the src field. The function would then return the new number (or alternatively, one could call getLookup or similar on the value) which is stored in failCount. The value of failCount is then used in a route transform to determine if it should forward the event somewhere.

References

No response

Version

No response

@MadsRC MadsRC added the type: feature A value-adding code addition that introduce new functionality. label Apr 21, 2023
@bruceg bruceg added type: enhancement A value-adding code change that enhances its existing functionality. domain: vrl Anything related to the Vector Remap Language vrl: stdlib changes to VRL's standard library. and removed type: feature A value-adding code addition that introduce new functionality. labels Apr 21, 2023
@peacand
Copy link

peacand commented Apr 23, 2023

Hi @MadsRC,

I love this idea ! I'm using a lot this kind of feature with Logstash and I'm quite sad it's currently not possible with Vector.
In your VRL example you're talking only about a "getLookup()" function to query Redis (or others) and enrich events.
But it woud be nice to be able to push data into the cache from events also, wouldn't it ?
With some "addLookup(.userId, .username)" for example.

On the technical perspective, Redis supports well multi threaded clients and async operations. So that it look quite compatible with Vector and remap task. But I'm thinking about the fact the remap tasks are supposed to be stateless. In that case the state of the connection with the backend cache must be kept somewhere. It cannot be in the remap transform itself, so the connection with the remote cache must be managed somewhere else globally. With maybe the issue of sharing this connection object with all remap tasks. Might be challenging.

So, maybe creating a new stateful transform would be more viable than creating a new VRL function for this case ?
A "remote_enrich" transform which could support get/set key/values in various remote network locations.

@MadsRC
Copy link
Author

MadsRC commented Apr 24, 2023

@peacand - You are right, it would be nice to be able to push data - It was supposed to be part of the issue, but unfortunately it slipped my mind. I've added it now.

I wouldn't be opposed to having it as a separate transform. That may even make it easier to implement the various methods/functions of the remote network location, as you could make a transform per implementation (ie, one for redis, one for memcached etc).

@jszwedko jszwedko added the domain: enrichment_tables Anything related to the Vector's enrichment tables label May 1, 2023
@jszwedko
Copy link
Member

jszwedko commented May 1, 2023

We have thought about adding these back-ends to Vector's enrichment_table feature though do need to figure out how to best model it so that it's clear that significant I/O latency could be being introduced.

@MadsRC is there a specific back-end you are most interested in? It sounds like Redis? We have a separate issue tracking SQL support already: #17181

@peacand
Copy link

peacand commented May 1, 2023

I would say Redis/Memcached caches are designed and optimized for very fast access and low latency response. Much more than SQL. I personally prefer Redis over Memcached.
About I/O latency, I don't know about Memcached or SQL, but Redis supports well async operations, which may introduce latency in events complete processing but should not block Vector pipeline.

@MadsRC
Copy link
Author

MadsRC commented May 1, 2023

@jszwedko thank you for your great work on Vector ;)

My preferred backend would be Redis. Another potential backend would be a generic HTTP backend (via GET and POST) that allowed for integration with inhouse systems - but that's mostly a nice-to-have ;)

@jszwedko
Copy link
Member

jszwedko commented May 2, 2023

Reading this again, this does feel like a bit of a different use-case than enrichment tables serve. I was going to roll it up into a general issue to add remote back-ends to enrichment tables, but will leave this open as a separate issue to allow arbitrary key/value setting/fetching from Vector.

As a workaround, users can fall back to using a lua transform. Lua seems to have clients for redis and memcached.

@jszwedko jszwedko added type: feature A value-adding code addition that introduce new functionality. and removed type: enhancement A value-adding code change that enhances its existing functionality. domain: enrichment_tables Anything related to the Vector's enrichment tables labels May 2, 2023
@coredump17
Copy link

having redis lookups as an enrichment source would be really beneficial for me as we use CSV enrichments heavily at the moment across multiple servers. Keeping the CSV's up to date on all nodes can be a pain! It would also be nice to lookup and cache the value locally for a TTL to remove a lot of the latency for the external call. That framework could be extended to the DNS lookup logic also :)

thanks for a great product.

@lsampras
Copy link
Contributor

Have we decided on a model or approach to solve this? and what would be the feature extent for this?

I'd love to help getting this added if contributions are accepted....

(I'm looking for a cassandra backend).

@jszwedko
Copy link
Member

Unfortunately this is likely to be a large project since there is no precedent for remote enrichment. I think it could potentially fit well as an enrichment table, though. I think the process would need to start with a proposal via an RFC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: vrl Anything related to the Vector Remap Language type: feature A value-adding code addition that introduce new functionality. vrl: stdlib changes to VRL's standard library.
Projects
None yet
Development

No branches or pull requests

6 participants