-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(remap transform): named function arguments #3927
Conversation
src/mapping/parser/grammar.pest
Outdated
quoted_path_segment = ${ "\"" ~ inner_quoted_string ~ "\"" ~ path_index? } | ||
|
||
target_path = @{ ("." ~ (path_segment | quoted_path_segment))+ } | ||
|
||
// Functions | ||
function = { | ||
root_function = { | ||
deletion | | ||
only_fields | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll probably want to get rid of these as well. As suggested in OP, I think the best course of action is for all functions to always return a value, and these to always returning null
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, deletion
could return the value of the deleted field. However, right now it accepts an arbitrary number of fields to remove, so we'd need to update its signature to only accept one field per call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option would be to treat these two as named "operators" instead of functions, thus eliminating the need for them to have a value at all. That may complicate things though. As a function I would tend to agree more with your second proposal. only_fields
could possibly return the old event, but I don't know that that would be particularly useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's indeed another approach. I'm inclined to still start with the simplest set-up that treats all functions equally (and thus would enforce all of them to return a value, and be equally useable in all valid positions in the code), just to start with a consistent set-up with the least amount of parser logic.
But, if we think we're going to need more of these "operators" going forward, and we want to invest in having that option right now, we could move in that direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't done anything with this comment yet. I'll probably tackle this in a follow-up PR, as to not hold up this one for too long.
src/mapping/parser/grammar.pest
Outdated
quoted_path_segment = ${ "\"" ~ inner_quoted_string ~ "\"" ~ path_index? } | ||
|
||
target_path = @{ ("." ~ (path_segment | quoted_path_segment))+ } | ||
|
||
// Functions | ||
function = { | ||
root_function = { | ||
deletion | | ||
only_fields | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option would be to treat these two as named "operators" instead of functions, thus eliminating the need for them to have a value at all. That may complicate things though. As a function I would tend to agree more with your second proposal. only_fields
could possibly return the old event, but I don't know that that would be particularly useful.
e175a72
to
ea98203
Compare
Regarding this, I think #3837 is relevant. If all functions, including |
578bd5a
to
b27ae50
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is ready for another round of reviews.
I still need to expand a few more unit tests, and do some extra benchmarking, but overall this is 90% complete and in a good enough shape for a proper review.
} | ||
|
||
impl ToStringFn { | ||
#[cfg(test)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Production code now uses TryFrom<ArgumentList>
, but I've kept the new
functions around, as it makes testing much easier (i.e. way less casting to Literal
and Box<dyn Function>
).
@@ -32,12 +119,16 @@ impl Function for NotFn { | |||
#[derive(Debug)] | |||
pub(in crate::mapping) struct ToStringFn { | |||
query: Box<dyn Function>, | |||
default: Option<Value>, | |||
default: Option<Box<dyn Function>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all arguments are now lazily evaluated.
This has both advantages and disadvantages:
advantages:
- all values can uniformly be provided by other event fields (e.g.
to_string(.nope, default = .foo)
) - values are not computed unless required (e.g.
to_string(.foo, default = .nope)
)
disadvantage:
- value type validation has to happen at runtime
- some computational-heavy logic has to be pushed from compile to runtime (e.g. parsing timestamp string formats)
I think this is a good place to start, but we can optionally expand function parameters to expose if they accept computed values or not.
Alternatively, functions can try to downcast a dyn Function
to a Literal
, in which case the resulting literal can be used without the runtime context, which allows the function to pre-compute whatever complex logic it needs, to improve performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think trying to evaluate any constant values to literals would be a very good idea, but probably beyond the scope of this PR. It's worth considering a benchmark to see how much impact this has, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't done a before/after benchmark for the remap changes itself, but I have added a few more benchmarks to compare remap vs regular transforms, and those look okay to me:
parse JSON with remap time: [1.4285 us 1.4439 us 1.4610 us]
parse JSON with json_parser
time: [1.4625 us 1.4788 us 1.4978 us]
coerce with remap time: [2.4033 us 2.4255 us 2.4500 us]
coerce with coercer time: [2.1415 us 2.1752 us 2.2134 us]
The coerce with remap
benchmark parses the timestamp format string at runtime right now, whereas before it was done at compile-time, so there might be some overhead there that causes it to be ~0.3us slower than the regular coercer, but I'd call that good enough for now.
We can use a separate PR to start playing with optimizing functions by trying to downcast fields and store concrete fields in the function struct if the downcast worked.
The existing benchmark is on par with what was tested before:
# before
add fields with remap time: [1.3603 us 1.3614 us 1.3625 us]
add fields with add_fields
time: [2.2038 us 2.2086 us 2.2152 us]
# now
add fields with remap time: [1.3581 us 1.3825 us 1.4071 us]
add fields with add_fields
time: [2.6247 us 2.6537 us 2.6823 us]
type Error = String; | ||
|
||
fn try_from(mut arguments: ArgumentList) -> Result<Self> { | ||
let query = arguments.required("value")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should never return an error, since the parser validates at compile-time that required arguments are provided. I opted to still make this explicit though, by having it return a Result<Argument>
as opposed to optional
, which returns Option<Argument>
.
v => unexpected_type!(v), | ||
}; | ||
|
||
let conversion: Conversion = format.parse().map_err(|e| format!("{}", e))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example of computation moving from compile– to runtime. I think this is fine for now, but as mentioned, we could downcast to Literal
and create the Conversion
at compile time, if possible.
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
e6fecdd
to
7960ecc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The minutest of possible changes and a possible query which I'm not sure if it could be an issue or not.
src/event/value.rs
Outdated
@@ -18,6 +18,33 @@ pub enum Value { | |||
Null, | |||
} | |||
|
|||
#[derive(PartialEq, Debug, Clone, Copy, is_enum_variant)] | |||
pub enum ValueKind { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will be needing a parameter type for regular expressions - so there won't be a one to one correspondence with Value. I haven't found anywhere in the code where this is going to cause a problem, but it is worth being aware of. Can you think of any issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Right now the remap language closely ties itself to Value
, and that makes sense, since you want to be able to assign whatever return value you get to fields within the event.
Given that we want to still be able to provide field values as arguments to functions (e.g. to_string(.foo, default = .bar)
), we need to stay in lock-step with Value
.
Obviously the simplest solution is to accept a string value, and parse it as a regular expression, but that's fairly weakly typed, and requires us to validate the regular expression at runtime (which might will be expensive).
I think the best course of action for now is still to go this route, and then take this comment I made earlier into consideration for future improvements:
I think this is a good place to start, but we can optionally expand function parameters to expose if they accept computed values or not.
Alternatively, functions can try to downcast a
dyn Function
to aLiteral
, in which case the resulting literal can be used without the runtime context, which allows the function to pre-compute whatever complex logic it needs, to improve performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think there's value in being able to compute the regular expression (or any value) at runtime, so even if we had an explicit regular-expression value type for inputs, I'd still like it if you could supply that regular expression from another field, which kind of forces us to (also) support stringified regular expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could of course update the PEG to parse /<..>/
and have it be converted to Value::Bytes
, that way you don't have to write them in quotation marks.
However, without some more plumbing work, that would only be an alias, so I do think we still want to be able to capture that type as being different from regular strings, and allow parameters to limit their input types to this regular expression type.
I think a possible solution would be:
We change parameter.kinds
to be a list of enums, something like:
enum InputKind {
Value(ValueKind),
Regex,
}
A function would then (by choice) accept both InputKind::Value(ValueKind::Bytes)
and InputKind::Regex
. The former would compute the regex at runtime (but supports dynamic input from other fields), the latter could do the computation at compile-time.
There's probably other considerations to discuss, but this seems like a sensible way forward to allow specialized input values that don't map to Value
, while still supporting dynamic field input.
Regardless, supporting this is non-trivial, as all plumbing within the parser currently expects to work with Value
objects, almost all elements are parsed into a dyn Function
, which itself requires execute(...) -> Result<Value>
to be implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's true. But take, for example the replace function. The parameter takes both a string and a regular expression. If you do supply the regex from another field, you would need a way of specifying if the field is a regex or a string. There are also additional flags that can be set on a regex - such as ignore case, which may not fit in a string.
What about if we created a special function that takes a string and returns a regex? So you could explicitly call it if you wanted a regex from a field:
.foo = replace(Regex(.field, ignorecase = true), "something")
The parser could parse a static regex string down to this. So /a/i
would actually be Regex("a", ignorecase = true)
.
Would there be a way to make that fit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it took a parameter of a closure -
Fn (Regex) -> Value
then anything that needed to use a regex would pass a closure to do the regex specific work it needs to do.
Right. My point was about this comment you made:
What about if we created a special function that takes a string and returns a regex?
In that case, the function signature would have to be Fn(String) -> Regex
(it would probably return an enum Fn(String) -> Argument::Regex
), which we can't expose to the end-user similar to how we expose to_string
or md5
, etc.
So, this would be a special class of functions that can only be used as input to functions, you can't do .foo = to_regex("/foobar/")
.
Regardless, I agree with you that this discussion is moot, as we have a path forward that doesn't involve a function to convert a string to a regex.
We're going to update the Pest argument
rule, something like:
argument = { positional_item | keyword_item }
- positional_item = { query_arithmetic }
+ positional_item = { argument_item }
- keyword_item = { ident ~ "=" ~ query_arithmetic }
+ keyword_item = { ident ~ "=" ~ argument_item }
+ argument_item = { query_arithmetic | regex }
+ regex = { "/" ~ /* TODO */ ~ "/" }
Then we'll change the parser to convert the input into an enum:
enum ArgumentKind {
Value(event::Value),
Regex(regex::Regex),
}
(in actuality, ArgumentKind::Value
would contain a Box<dyn Function>
, as it needs to be resolved at runtime to a Value
)
And finally, we update function signatures to have them either accept or reject ArgumentKind::Regex
.
The only thing I'm still on the fence about is whether we should allow dynamic regex by default on anything that accepts a regex, or if each function should build its own support for dynamic regexes, by accepting both Value::Bytes
and Regex
as an input value for a single parameter.
Basically, should we do something like this:
enum ArgumentKind {
Value(event::Value),
Regex(Regex),
}
enum Regex {
Static(regex::Regex),
Dynamic(bytes::Bytes),
}
And have the parser accept a string as regex input, and then have it compile the regex at runtime by implementing Function
for Regex
and call regex::Regex::new
on the slice of bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That's a good approach.
I think regarding dynamic Regexes, there are functions that will accept both a String and a Regex. So, if the user is going to pass a dynamic Regex there will need to be a way for them to explicitly say that this field is a regex. Otherwise there will be no way to really know how to handle it.
eg.
.thing = replace(.field, .pattern, .with)
.pattern could be either a String or a Regex. It will be impossible to know which. There needs to be a way to disambiguate, such as:
.thing = replace(.field, regex(.pattern), .with)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's right. I'm still not a fan of regex()
though, as it still muddies the water of when a function can be used where. Again, because .foo = regex(.pattern)
won't compile.
But I don't have a good alternative yet (other than inspecting the string to see if it matches"/ ... /i..."
, so perhaps regex()
is the way forward, and we just need to document it as best as possible (or use other visual clues, such as REGEX()
or Regex
or input_regex
or argument(regex, .pattern)
, or ... to distinguish input conversion functions from regular query functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true.
Perhaps the problem is that we have a function that can accept both a regex or a string as a parameter. The problem would go away if we had seperate functions for each type of parameter - replace
and replace_regex
?
I think we should raise an issue for this and see if others have any ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could have two separate optional arguments, one to replace using a regex, one using a string:
using pattern
to use a static or dynamic regex:
replace(.foo, pattern = "/bar/i", with = "...")
— must be a valid regex-formatted stringreplace(.foo, pattern = .bar, with = "...")
—.bar
must resolve to a valid regex-formatted stringreplace(.foo, pattern = /bar/i, with = "...")
— compile-time checked to be a valid regex
using string
(or literal
, or some other term) to use a static or dynamic string:
replace(.foo, string = "bar", with = "...")
replace(.foo, string = .bar, with = "...")
—.bar
must resolve to a string
Using both — replace(.foo, pattern = .bar, string = .baz, with = "...")
— would either use both to replace (with one occurring before the other) or result in a runtime error.
But yes, it's worth pulling this out into an issue, or to pose the question in the relevant "add function" issue(s).
@@ -32,12 +119,16 @@ impl Function for NotFn { | |||
#[derive(Debug)] | |||
pub(in crate::mapping) struct ToStringFn { | |||
query: Box<dyn Function>, | |||
default: Option<Value>, | |||
default: Option<Box<dyn Function>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think trying to evaluate any constant values to literals would be a very good idea, but probably beyond the scope of this PR. It's worth considering a benchmark to see how much impact this has, though.
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
b2e87fb
to
af355d8
Compare
pub fn kind(&self) -> &str { | ||
match self { | ||
Value::Bytes(_) => "string", | ||
Value::Timestamp(_) => "timestamp", | ||
Value::Integer(_) => "integer", | ||
Value::Float(_) => "float", | ||
Value::Boolean(_) => "boolean", | ||
Value::Map(_) => "map", | ||
Value::Array(_) => "array", | ||
Value::Null => "null", | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we want to keep this in.
Right now it's only purpose is to have better error reports for the remap language to indicate that a certain type of argument value is not accepted for a given parameter.
I could also move this into the remap module for now, so it doesn't become part of the Value
public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think better error reports are valuable to improve UX, so I'd keep this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. The only thing I wasn't sure about is if we'd want to expose this at the Value
public API, or just have it live privately in the mapping
module.
But I'll leave it as is for now, we can always change this after the fact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit of a quick pass (sorry, I promised a review today and didn't manage it), this looks good, very happy to see the FN names out of the parser file. :)
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Brian Menges <brian.menges@anaplan.com>
This changes the remap language parser to:
to_string(.foo, "barbaz")
can now becometo_string(.foo, .bar + .baz)
)to_string(.foo, default = .bar)
)to_string(value = .foo, default = .bar) == to_string(.foo, .bar)
)to_string(default = .bar, value = .foo)
)to_string(default = .bar, .foo) == to_string(.foo, default = .bar)
Some observations:
We still have so-called (hard-coded) "root functions" that don't return any values and can be used at the root-level of the syntax:
del(.foo)
andonly_fields(.bar)
. I wonder if we should just have all functions return a value, with these always returningnull
, so that they can be slotted into the existing/new query functions system?Because argument values are computed at runtime and computing those values is done by the function implementations, a function can choose to not compute a value (e.g. in
to_string("foo", default = .bar)
, the.bar
value is never computed because"foo"
is an acceptable value), this means that any side effects introduced in the query itself will not happen, which might be unexpected, unless we document this. I do still think this is the right approach, conceptually.closes #3851
closes #3864
ref #3913