Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(remap transform): named function arguments #3927

Merged
merged 18 commits into from
Sep 24, 2020
Merged

Conversation

JeanMertz
Copy link
Contributor

@JeanMertz JeanMertz commented Sep 16, 2020

This changes the remap language parser to:

  1. Move hard-coded function signatures from the PEG into the Rust code. This also makes it easier to add new functions going forward (and enables the functionality listed next).
  2. Have more robust validity checks for function signatures:
    • invalid function arity
    • missing required fields
    • unknown keyword arguments
    • invalid argument types (this one is checked at runtime, by necessity)
  3. Uniformly accept queries for all function arguments (e.g. to_string(.foo, "barbaz") can now become to_string(.foo, .bar + .baz))
  4. Accept both positional and keyword arguments for all functions (e.g. to_string(.foo, default = .bar))
    • all positional arguments can be written as keyword arguments, and vice versa (e.g. to_string(value = .foo, default = .bar) == to_string(.foo, .bar))
    • keyword arguments can occupy any position (e.g. to_string(default = .bar, value = .foo))
    • positional arguments can occupy any position, position counting ignores keyword arguments (e.g. to_string(default = .bar, .foo) == to_string(.foo, default = .bar)

Some observations:

  • We still have so-called (hard-coded) "root functions" that don't return any values and can be used at the root-level of the syntax: del(.foo) and only_fields(.bar). I wonder if we should just have all functions return a value, with these always returning null, so that they can be slotted into the existing/new query functions system?

  • Because argument values are computed at runtime and computing those values is done by the function implementations, a function can choose to not compute a value (e.g. in to_string("foo", default = .bar), the .bar value is never computed because "foo" is an acceptable value), this means that any side effects introduced in the query itself will not happen, which might be unexpected, unless we document this. I do still think this is the right approach, conceptually.

closes #3851
closes #3864

ref #3913

quoted_path_segment = ${ "\"" ~ inner_quoted_string ~ "\"" ~ path_index? }

target_path = @{ ("." ~ (path_segment | quoted_path_segment))+ }

// Functions
function = {
root_function = {
deletion |
only_fields
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll probably want to get rid of these as well. As suggested in OP, I think the best course of action is for all functions to always return a value, and these to always returning null?

Copy link
Contributor Author

@JeanMertz JeanMertz Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, deletion could return the value of the deleted field. However, right now it accepts an arbitrary number of fields to remove, so we'd need to update its signature to only accept one field per call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to treat these two as named "operators" instead of functions, thus eliminating the need for them to have a value at all. That may complicate things though. As a function I would tend to agree more with your second proposal. only_fields could possibly return the old event, but I don't know that that would be particularly useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's indeed another approach. I'm inclined to still start with the simplest set-up that treats all functions equally (and thus would enforce all of them to return a value, and be equally useable in all valid positions in the code), just to start with a consistent set-up with the least amount of parser logic.

But, if we think we're going to need more of these "operators" going forward, and we want to invest in having that option right now, we could move in that direction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done anything with this comment yet. I'll probably tackle this in a follow-up PR, as to not hold up this one for too long.

src/mapping/parser/grammar.pest Outdated Show resolved Hide resolved
src/mapping/parser/mod.rs Outdated Show resolved Hide resolved
src/mapping/parser/mod.rs Outdated Show resolved Hide resolved
src/mapping/query/functions.rs Outdated Show resolved Hide resolved
src/mapping/query/functions.rs Outdated Show resolved Hide resolved
src/mapping/query/functions.rs Outdated Show resolved Hide resolved
tests/behavior/transforms/remap.toml Outdated Show resolved Hide resolved
@JeanMertz JeanMertz changed the title feat(remap transform): generalize function path arguments parser syntax feat(remap transform): named function arguments Sep 16, 2020
quoted_path_segment = ${ "\"" ~ inner_quoted_string ~ "\"" ~ path_index? }

target_path = @{ ("." ~ (path_segment | quoted_path_segment))+ }

// Functions
function = {
root_function = {
deletion |
only_fields
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to treat these two as named "operators" instead of functions, thus eliminating the need for them to have a value at all. That may complicate things though. As a function I would tend to agree more with your second proposal. only_fields could possibly return the old event, but I don't know that that would be particularly useful.

src/mapping/parser/grammar.pest Outdated Show resolved Hide resolved
src/mapping/parser/grammar.pest Outdated Show resolved Hide resolved
src/mapping/parser/mod.rs Outdated Show resolved Hide resolved
src/mapping/query/functions.rs Outdated Show resolved Hide resolved
Base automatically changed from jean/remap-upcase to master September 17, 2020 07:51
@JeanMertz JeanMertz force-pushed the jean/remap-named-args branch 3 times, most recently from e175a72 to ea98203 Compare September 17, 2020 11:19
@StephenWakely
Copy link
Contributor

I wonder if we should just have all functions return a value, with these always returning null, so that they can be slotted into the existing/new query functions system?

Regarding this, I think #3837 is relevant. If all functions, including if returned a value, we wouldn't need to implement a ternary operator. Or at least if we did, it would be simple syntactic sugar for the if statement.

@JeanMertz JeanMertz force-pushed the jean/remap-named-args branch from 578bd5a to b27ae50 Compare September 22, 2020 11:47
Copy link
Contributor Author

@JeanMertz JeanMertz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is ready for another round of reviews.

I still need to expand a few more unit tests, and do some extra benchmarking, but overall this is 90% complete and in a good enough shape for a proper review.

src/mapping/parser/mod.rs Show resolved Hide resolved
src/mapping/parser/mod.rs Outdated Show resolved Hide resolved
}

impl ToStringFn {
#[cfg(test)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Production code now uses TryFrom<ArgumentList>, but I've kept the new functions around, as it makes testing much easier (i.e. way less casting to Literal and Box<dyn Function>).

@@ -32,12 +119,16 @@ impl Function for NotFn {
#[derive(Debug)]
pub(in crate::mapping) struct ToStringFn {
query: Box<dyn Function>,
default: Option<Value>,
default: Option<Box<dyn Function>>,
Copy link
Contributor Author

@JeanMertz JeanMertz Sep 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all arguments are now lazily evaluated.

This has both advantages and disadvantages:

advantages:

  • all values can uniformly be provided by other event fields (e.g. to_string(.nope, default = .foo))
  • values are not computed unless required (e.g. to_string(.foo, default = .nope))

disadvantage:

  • value type validation has to happen at runtime
  • some computational-heavy logic has to be pushed from compile to runtime (e.g. parsing timestamp string formats)

I think this is a good place to start, but we can optionally expand function parameters to expose if they accept computed values or not.

Alternatively, functions can try to downcast a dyn Function to a Literal, in which case the resulting literal can be used without the runtime context, which allows the function to pre-compute whatever complex logic it needs, to improve performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think trying to evaluate any constant values to literals would be a very good idea, but probably beyond the scope of this PR. It's worth considering a benchmark to see how much impact this has, though.

Copy link
Contributor Author

@JeanMertz JeanMertz Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done a before/after benchmark for the remap changes itself, but I have added a few more benchmarks to compare remap vs regular transforms, and those look okay to me:

parse JSON with remap   time:   [1.4285 us 1.4439 us 1.4610 us]
parse JSON with json_parser                                                                             
                        time:   [1.4625 us 1.4788 us 1.4978 us]

coerce with remap       time:   [2.4033 us 2.4255 us 2.4500 us]
coerce with coercer     time:   [2.1415 us 2.1752 us 2.2134 us]

The coerce with remap benchmark parses the timestamp format string at runtime right now, whereas before it was done at compile-time, so there might be some overhead there that causes it to be ~0.3us slower than the regular coercer, but I'd call that good enough for now.

We can use a separate PR to start playing with optimizing functions by trying to downcast fields and store concrete fields in the function struct if the downcast worked.

The existing benchmark is on par with what was tested before:

# before
add fields with remap   time:   [1.3603 us 1.3614 us 1.3625 us]
add fields with add_fields                                                                             
                        time:   [2.2038 us 2.2086 us 2.2152 us]

# now
add fields with remap   time:   [1.3581 us 1.3825 us 1.4071 us]
add fields with add_fields                                                                             
                        time:   [2.6247 us 2.6537 us 2.6823 us]

src/mapping/query/functions.rs Outdated Show resolved Hide resolved
type Error = String;

fn try_from(mut arguments: ArgumentList) -> Result<Self> {
let query = arguments.required("value")?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never return an error, since the parser validates at compile-time that required arguments are provided. I opted to still make this explicit though, by having it return a Result<Argument> as opposed to optional, which returns Option<Argument>.

v => unexpected_type!(v),
};

let conversion: Conversion = format.parse().map_err(|e| format!("{}", e))?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of computation moving from compile– to runtime. I think this is fine for now, but as mentioned, we could downcast to Literal and create the Conversion at compile time, if possible.

@JeanMertz JeanMertz marked this pull request as ready for review September 22, 2020 12:47
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
@JeanMertz JeanMertz force-pushed the jean/remap-named-args branch from e6fecdd to 7960ecc Compare September 22, 2020 12:49
Copy link
Contributor

@StephenWakely StephenWakely left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minutest of possible changes and a possible query which I'm not sure if it could be an issue or not.

@@ -18,6 +18,33 @@ pub enum Value {
Null,
}

#[derive(PartialEq, Debug, Clone, Copy, is_enum_variant)]
pub enum ValueKind {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will be needing a parameter type for regular expressions - so there won't be a one to one correspondence with Value. I haven't found anywhere in the code where this is going to cause a problem, but it is worth being aware of. Can you think of any issues?

Copy link
Contributor Author

@JeanMertz JeanMertz Sep 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Right now the remap language closely ties itself to Value, and that makes sense, since you want to be able to assign whatever return value you get to fields within the event.

Given that we want to still be able to provide field values as arguments to functions (e.g. to_string(.foo, default = .bar)), we need to stay in lock-step with Value.

Obviously the simplest solution is to accept a string value, and parse it as a regular expression, but that's fairly weakly typed, and requires us to validate the regular expression at runtime (which might will be expensive).

I think the best course of action for now is still to go this route, and then take this comment I made earlier into consideration for future improvements:

I think this is a good place to start, but we can optionally expand function parameters to expose if they accept computed values or not.

Alternatively, functions can try to downcast a dyn Function to a Literal, in which case the resulting literal can be used without the runtime context, which allows the function to pre-compute whatever complex logic it needs, to improve performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think there's value in being able to compute the regular expression (or any value) at runtime, so even if we had an explicit regular-expression value type for inputs, I'd still like it if you could supply that regular expression from another field, which kind of forces us to (also) support stringified regular expressions.

Copy link
Contributor Author

@JeanMertz JeanMertz Sep 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could of course update the PEG to parse /<..>/ and have it be converted to Value::Bytes, that way you don't have to write them in quotation marks.

However, without some more plumbing work, that would only be an alias, so I do think we still want to be able to capture that type as being different from regular strings, and allow parameters to limit their input types to this regular expression type.

I think a possible solution would be:

We change parameter.kinds to be a list of enums, something like:

enum InputKind {
	Value(ValueKind),
	Regex,
}

A function would then (by choice) accept both InputKind::Value(ValueKind::Bytes) and InputKind::Regex. The former would compute the regex at runtime (but supports dynamic input from other fields), the latter could do the computation at compile-time.

There's probably other considerations to discuss, but this seems like a sensible way forward to allow specialized input values that don't map to Value, while still supporting dynamic field input.

Regardless, supporting this is non-trivial, as all plumbing within the parser currently expects to work with Value objects, almost all elements are parsed into a dyn Function, which itself requires execute(...) -> Result<Value> to be implemented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's true. But take, for example the replace function. The parameter takes both a string and a regular expression. If you do supply the regex from another field, you would need a way of specifying if the field is a regex or a string. There are also additional flags that can be set on a regex - such as ignore case, which may not fit in a string.

What about if we created a special function that takes a string and returns a regex? So you could explicitly call it if you wanted a regex from a field:

.foo = replace(Regex(.field, ignorecase = true), "something")

The parser could parse a static regex string down to this. So /a/i would actually be Regex("a", ignorecase = true).

Would there be a way to make that fit?

Copy link
Contributor Author

@JeanMertz JeanMertz Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it took a parameter of a closure - Fn (Regex) -> Value then anything that needed to use a regex would pass a closure to do the regex specific work it needs to do.

Right. My point was about this comment you made:

What about if we created a special function that takes a string and returns a regex?

In that case, the function signature would have to be Fn(String) -> Regex (it would probably return an enum Fn(String) -> Argument::Regex), which we can't expose to the end-user similar to how we expose to_string or md5, etc.

So, this would be a special class of functions that can only be used as input to functions, you can't do .foo = to_regex("/foobar/").


Regardless, I agree with you that this discussion is moot, as we have a path forward that doesn't involve a function to convert a string to a regex.

We're going to update the Pest argument rule, something like:

argument = { positional_item | keyword_item }

- positional_item = { query_arithmetic }
+ positional_item = { argument_item }

- keyword_item = { ident ~ "=" ~ query_arithmetic }
+ keyword_item = { ident ~ "=" ~ argument_item }

+ argument_item = { query_arithmetic | regex }

+ regex = { "/" ~ /* TODO */ ~ "/" }

Then we'll change the parser to convert the input into an enum:

enum ArgumentKind {
	Value(event::Value),
	Regex(regex::Regex),
}

(in actuality, ArgumentKind::Value would contain a Box<dyn Function>, as it needs to be resolved at runtime to a Value)

And finally, we update function signatures to have them either accept or reject ArgumentKind::Regex.

The only thing I'm still on the fence about is whether we should allow dynamic regex by default on anything that accepts a regex, or if each function should build its own support for dynamic regexes, by accepting both Value::Bytes and Regex as an input value for a single parameter.

Basically, should we do something like this:

enum ArgumentKind {
	Value(event::Value),
	Regex(Regex),
}

enum Regex {
	Static(regex::Regex),
	Dynamic(bytes::Bytes),
}

And have the parser accept a string as regex input, and then have it compile the regex at runtime by implementing Function for Regex and call regex::Regex::new on the slice of bytes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That's a good approach.

I think regarding dynamic Regexes, there are functions that will accept both a String and a Regex. So, if the user is going to pass a dynamic Regex there will need to be a way for them to explicitly say that this field is a regex. Otherwise there will be no way to really know how to handle it.

eg.

.thing = replace(.field, .pattern, .with)

.pattern could be either a String or a Regex. It will be impossible to know which. There needs to be a way to disambiguate, such as:

.thing = replace(.field, regex(.pattern), .with)

Copy link
Contributor Author

@JeanMertz JeanMertz Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's right. I'm still not a fan of regex() though, as it still muddies the water of when a function can be used where. Again, because .foo = regex(.pattern) won't compile.

But I don't have a good alternative yet (other than inspecting the string to see if it matches"/ ... /i...", so perhaps regex() is the way forward, and we just need to document it as best as possible (or use other visual clues, such as REGEX() or Regex or input_regex or argument(regex, .pattern), or ... to distinguish input conversion functions from regular query functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true.

Perhaps the problem is that we have a function that can accept both a regex or a string as a parameter. The problem would go away if we had seperate functions for each type of parameter - replace and replace_regex?

I think we should raise an issue for this and see if others have any ideas.

Copy link
Contributor Author

@JeanMertz JeanMertz Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could have two separate optional arguments, one to replace using a regex, one using a string:

using pattern to use a static or dynamic regex:

  • replace(.foo, pattern = "/bar/i", with = "...")must be a valid regex-formatted string
  • replace(.foo, pattern = .bar, with = "...").bar must resolve to a valid regex-formatted string
  • replace(.foo, pattern = /bar/i, with = "...")compile-time checked to be a valid regex

using string (or literal, or some other term) to use a static or dynamic string:

  • replace(.foo, string = "bar", with = "...")
  • replace(.foo, string = .bar, with = "...").bar must resolve to a string

Using both — replace(.foo, pattern = .bar, string = .baz, with = "...") — would either use both to replace (with one occurring before the other) or result in a runtime error.


But yes, it's worth pulling this out into an issue, or to pose the question in the relevant "add function" issue(s).

src/mapping/parser/mod.rs Show resolved Hide resolved
src/mapping/query/functions.rs Outdated Show resolved Hide resolved
src/mapping/parser/grammar.pest Show resolved Hide resolved
src/mapping/parser/mod.rs Show resolved Hide resolved
src/mapping/parser/mod.rs Outdated Show resolved Hide resolved
tests/behavior/transforms/remap.toml Show resolved Hide resolved
@@ -32,12 +119,16 @@ impl Function for NotFn {
#[derive(Debug)]
pub(in crate::mapping) struct ToStringFn {
query: Box<dyn Function>,
default: Option<Value>,
default: Option<Box<dyn Function>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think trying to evaluate any constant values to literals would be a very good idea, but probably beyond the scope of this PR. It's worth considering a benchmark to see how much impact this has, though.

src/mapping/query/functions.rs Outdated Show resolved Hide resolved
src/mapping/query/functions.rs Outdated Show resolved Hide resolved
src/mapping/query/functions.rs Outdated Show resolved Hide resolved
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Comment on lines +205 to +216
pub fn kind(&self) -> &str {
match self {
Value::Bytes(_) => "string",
Value::Timestamp(_) => "timestamp",
Value::Integer(_) => "integer",
Value::Float(_) => "float",
Value::Boolean(_) => "boolean",
Value::Map(_) => "map",
Value::Array(_) => "array",
Value::Null => "null",
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want to keep this in.

Right now it's only purpose is to have better error reports for the remap language to indicate that a certain type of argument value is not accepted for a given parameter.

I could also move this into the remap module for now, so it doesn't become part of the Value public API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better error reports are valuable to improve UX, so I'd keep this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree. The only thing I wasn't sure about is if we'd want to expose this at the Value public API, or just have it live privately in the mapping module.

But I'll leave it as is for now, we can always change this after the fact.

@bruceg bruceg added transform: remap Anything `remap` transform related type: enhancement A value-adding code change that enhances its existing functionality. labels Sep 23, 2020
Copy link
Contributor

@Hoverbear Hoverbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit of a quick pass (sorry, I promised a review today and didn't manage it), this looks good, very happy to see the FN names out of the parser file. :)

Signed-off-by: Jean Mertz <git@jeanmertz.com>
@JeanMertz JeanMertz merged commit b40b7b6 into master Sep 24, 2020
@JeanMertz JeanMertz deleted the jean/remap-named-args branch September 24, 2020 07:44
mengesb pushed a commit to jacobbraaten/vector that referenced this pull request Dec 9, 2020
Signed-off-by: Brian Menges <brian.menges@anaplan.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
transform: remap Anything `remap` transform related type: enhancement A value-adding code change that enhances its existing functionality.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lift named remap functions out of the parser Ensure all remap function optional arguments are named
6 participants