Support get'ing and set'ing remote values from VRL #17195

MadsRC · 2023-04-21T19:25:10Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

Log Enrichment

Vector source produces a log:

{
"userID": "1234",
"message": "authentication successful"
}

A VRL remap transform queries enriches the log, using the userID, with data from an external system:

{
"userID": "1234",
"username": "MadsRC",
"message": "authentication successful"
}

Calculating user logins

Vector source produces a log:

{
"userID": "1234",
"message": "authentication successful"
}

Vector VRL remap transform increments counter in external system using userID as the key and gets the new total. An if statement is used to check the new total (which would be total logins over a period of time) against a threshold, and determines if it should produce a new message to a destination (using the routing functionality of Vector) to notify that something bad is happening.

Attempted Solutions

Data Enrichment can currently be done by hard-coding the enrichment data into VRL. While this is arguably faster than making several network calls to get the data, it is not very scaleable or dynamic.

Proposal

I would like to see VRL, and by extension Vector, support looking up values, and potentially setting values, in a remote system, such as Redis, Memcached or maybe a Relational Database of sorts.

On top of allowing for data enrichment, this would also allow one to use VRL/Vector as a proper detection engine. While one can already use Vector/VRL for simple detections, having the ability to reference a remote state of sorts would allow for some cool event correlation use-cases.

It is pretty common to have some sort of pipeline in front of a large, expensive enterprise SIEM system like Elasticsearch or Splunk. If one used Vector in this pipeline, and Vector supported get'ing and set'ing values in remote systems, one could offload some of the costs of these enterprise systems by doing real-time detections while one is processing the data anyways.

I am imagining a VRL syntax like this:

[transforms.increment]
type = "remap"
inputs = ["logs"]
source = '''
  .username = getLookup(.userID)
  .timestamp = now()
'''

or

[transforms.increment]
type = "remap"
inputs = ["logs"]
source = '''
  .username = setLookup(.userID, .username)
  .timestamp = now()
'''

and then setting connection info for the getLookup and setLookup function in the global settings.

Alternatively, supporting specific clients could also be in scope, so that one could use some of the more specialised functions of the lookup store, such as Redis's INCR function.

An example of where Redis's INCR function would be helpful is in the use-case of tracking amount of failed logins:

[transforms.increment]
type = "remap"
inputs = ["logs"]
source = '''
  if .eventType == "authentication" && .outcome == "failure" {
    .failCount = incrLookup(.src, 1)
  }
'''

[transforms.route]
type = "route"
inputs = ["increment"]
[transforms.route.route]
type = "vrl"
source = ".failCount > 10"

The above example would use the incrLookup function to increment a counter in Redis by 1, where the key is the value of the src field. The function would then return the new number (or alternatively, one could call getLookup or similar on the value) which is stored in failCount. The value of failCount is then used in a route transform to determine if it should forward the event somewhere.

References

No response

Version

No response

The text was updated successfully, but these errors were encountered:

peacand · 2023-04-23T15:26:29Z

Hi @MadsRC,

I love this idea ! I'm using a lot this kind of feature with Logstash and I'm quite sad it's currently not possible with Vector.
In your VRL example you're talking only about a "getLookup()" function to query Redis (or others) and enrich events.
But it woud be nice to be able to push data into the cache from events also, wouldn't it ?
With some "addLookup(.userId, .username)" for example.

On the technical perspective, Redis supports well multi threaded clients and async operations. So that it look quite compatible with Vector and remap task. But I'm thinking about the fact the remap tasks are supposed to be stateless. In that case the state of the connection with the backend cache must be kept somewhere. It cannot be in the remap transform itself, so the connection with the remote cache must be managed somewhere else globally. With maybe the issue of sharing this connection object with all remap tasks. Might be challenging.

So, maybe creating a new stateful transform would be more viable than creating a new VRL function for this case ?
A "remote_enrich" transform which could support get/set key/values in various remote network locations.

MadsRC · 2023-04-24T19:43:09Z

@peacand - You are right, it would be nice to be able to push data - It was supposed to be part of the issue, but unfortunately it slipped my mind. I've added it now.

I wouldn't be opposed to having it as a separate transform. That may even make it easier to implement the various methods/functions of the remote network location, as you could make a transform per implementation (ie, one for redis, one for memcached etc).

jszwedko · 2023-05-01T15:00:14Z

We have thought about adding these back-ends to Vector's enrichment_table feature though do need to figure out how to best model it so that it's clear that significant I/O latency could be being introduced.

@MadsRC is there a specific back-end you are most interested in? It sounds like Redis? We have a separate issue tracking SQL support already: #17181

peacand · 2023-05-01T16:43:43Z

I would say Redis/Memcached caches are designed and optimized for very fast access and low latency response. Much more than SQL. I personally prefer Redis over Memcached.
About I/O latency, I don't know about Memcached or SQL, but Redis supports well async operations, which may introduce latency in events complete processing but should not block Vector pipeline.

MadsRC · 2023-05-01T17:12:38Z

@jszwedko thank you for your great work on Vector ;)

My preferred backend would be Redis. Another potential backend would be a generic HTTP backend (via GET and POST) that allowed for integration with inhouse systems - but that's mostly a nice-to-have ;)

jszwedko · 2023-05-02T20:52:15Z

Reading this again, this does feel like a bit of a different use-case than enrichment tables serve. I was going to roll it up into a general issue to add remote back-ends to enrichment tables, but will leave this open as a separate issue to allow arbitrary key/value setting/fetching from Vector.

As a workaround, users can fall back to using a lua transform. Lua seems to have clients for redis and memcached.

coredump17 · 2024-01-15T15:47:47Z

having redis lookups as an enrichment source would be really beneficial for me as we use CSV enrichments heavily at the moment across multiple servers. Keeping the CSV's up to date on all nodes can be a pain! It would also be nice to lookup and cache the value locally for a TTL to remove a lot of the latency for the external call. That framework could be extended to the DNS lookup logic also :)

thanks for a great product.

lsampras · 2024-04-26T19:21:22Z

Have we decided on a model or approach to solve this? and what would be the feature extent for this?

I'd love to help getting this added if contributions are accepted....

(I'm looking for a cassandra backend).

jszwedko · 2024-04-26T20:03:04Z

Unfortunately this is likely to be a large project since there is no precedent for remote enrichment. I think it could potentially fit well as an enrichment table, though. I think the process would need to start with a proposal via an RFC.

MadsRC added the type: feature A value-adding code addition that introduce new functionality. label Apr 21, 2023

jszwedko added the domain: enrichment_tables Anything related to the Vector's enrichment tables label May 1, 2023

jszwedko added type: feature A value-adding code addition that introduce new functionality. and removed type: enhancement A value-adding code change that enhances its existing functionality. domain: enrichment_tables Anything related to the Vector's enrichment tables labels May 2, 2023

jszwedko mentioned this issue Apr 26, 2024

Adding support for a mutable global external store for enrichment of data #20383

Closed

lsampras mentioned this issue May 14, 2024

chore(vrl): remote store/enrichment table RFC #20495

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support get'ing and set'ing remote values from VRL #17195

Support get'ing and set'ing remote values from VRL #17195

MadsRC commented Apr 21, 2023 •

edited

Loading

peacand commented Apr 23, 2023 •

edited

Loading

MadsRC commented Apr 24, 2023

jszwedko commented May 1, 2023

peacand commented May 1, 2023

MadsRC commented May 1, 2023

jszwedko commented May 2, 2023

coredump17 commented Jan 15, 2024

lsampras commented Apr 26, 2024

jszwedko commented Apr 26, 2024

Support get'ing and set'ing remote values from VRL #17195

Support get'ing and set'ing remote values from VRL #17195

Comments

MadsRC commented Apr 21, 2023 • edited Loading

A note for the community

Use Cases

Log Enrichment

Calculating user logins

Attempted Solutions

Proposal

References

Version

peacand commented Apr 23, 2023 • edited Loading

MadsRC commented Apr 24, 2023

jszwedko commented May 1, 2023

peacand commented May 1, 2023

MadsRC commented May 1, 2023

jszwedko commented May 2, 2023

coredump17 commented Jan 15, 2024

lsampras commented Apr 26, 2024

jszwedko commented Apr 26, 2024

MadsRC commented Apr 21, 2023 •

edited

Loading

peacand commented Apr 23, 2023 •

edited

Loading