Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add grok function #4827

Merged
merged 10 commits into from
Dec 6, 2023
Merged

Add grok function #4827

merged 10 commits into from
Dec 6, 2023

Conversation

mattnibs
Copy link
Collaborator

@mattnibs mattnibs commented Oct 26, 2023

Closes #4140
Closes #2999

@mattnibs mattnibs requested a review from a team October 26, 2023 19:23
@mattnibs mattnibs force-pushed the grok branch 15 times, most recently from e00da0d to 2ad8892 Compare October 27, 2023 02:13
@philrz
Copy link
Contributor

philrz commented Oct 31, 2023

tl;dr

I've done some preliminary testing on this branch at commit 2ad8892. The tl;dr is that it seems like what's already working here meets a defensible definition of "doing Grok". I'd not stand in the way of shipping it as is and seeing what users do with it. That said, I did hit a few speedbumps in my testing such that if we ship what we've got, I'd feel like we'd need to highlight several caveats in the docs and be prepared for users to struggle and frequently come to Slack for help. In this regard it may be a pay-now-or-later zero-sum game where it might be worth addressing some gaps in the code now if it's not too difficult so it'll mean less user/support pain in the future. I'll dive into details in separate sections below.

Grok specification

When reading through dozens of online references, it's clear how under-specified Grok is. It's not like there's an RFC or other standard that lays out what specifically Grok is/isn't. That said, almost all pages point back at the Grok filter plugin in Logstash as if it's the reference implementation. Further digging finds a grok CLI tool written by Jordan Sissel who is basically "the Logstash guy". The first commit on that CLI tool was in 2007 while the first commit in the Logstash repo was in 2009. But the CLI tool is barely referenced in Google search results and hasn't seen a commit since 2017, so I'm going with the consensus that Logstash is the reference implementation.

Testing

After reproducing the example in the docs included in this PR, the next thing I wanted to test was using a "custom pattern" via the extra parameter. Due to an (un-)lucky coincidence, the way I went about this was to follow an example I found in this article. I figured it was a fine choice since it's written by Elastic (i.e., the people who bring us Logstash) and has a "for dummies" feel (so, our own users that don't know Grok might stumble onto it and see it as a good way to get started). Unfortunately, the example shown doesn't work as is, even before I bring the extra parameter into the picture.

$ zq -version
Version: v1.10.0-13-g2ad88925

$ echo '"55.3.244.1 GET /index.html 15824 0.043 other stuff"' | zq -Z 'yield grok("%{IP:host.ip} %{WORD:http.request.method} %{URIPATHPARAM:url.original} %{NUMBER:http.request.bytes:int} %{NUMBER:event.duration:double} %{GREEDYDATA:my_greedy_match}", this)' -
error("grok(): no match")

After further digging, there's two things in this example that don't seem to be supported with the current implementation in this branch, resulting in a failure to parse.

  1. The trailing "data type conversion" specifications (e.g., :int).
  2. Field names containing dots (e.g, http.request.method).

If I drop the data types and change the dots to something else like underscores, now it parses.

$ echo '"55.3.244.1 GET /index.html 15824 0.043 other stuff"' | zq -Z 'yield grok("%{IP:host_ip} %{WORD:http_request_method} %{URIPATHPARAM:url_original} %{NUMBER:http_request_bytes} %{NUMBER:event_duration} %{GREEDYDATA:my_greedy_match}", this)' -
{
    host_ip: "55.3.244.1",
    http_request_method: "GET",
    url_original: "/index.html",
    http_request_bytes: "15824",
    event_duration: "0.043",
    my_greedy_match: "other stuff"
}

Before I figured this out, I coincidentally happened to also try testing out the original pattern from that same article in https://grokdebugger.com because the more-frequently-linked-to https://grokdebug.herokuapp.com seems to be dead. Well, despite grokdebugger.com having what feels like a nice UX, it turns out it also fails completely when asked to create fields whose names contain dots, and this is disclosed in the repo's README along with other limitations.

I'd have ideally hoped to be able to tell users in our docs "feel free to educate yourself and steal from all the Grok materials already available online" and/or "we suggest using a Grok debugger to develop your patterns", but this one-two punch of unexpected findings gave me reason to pause and go down the rabbit hole about these gotchas.

Proposals

Field names containing dots

It seems there's a rich and miserable history in ELK stack as relates to field names that contain dots. This doc from around 2015 describes how field names containing dots were ok in Elasticsearch 1.x, then weren't ok in 2.0, but then this doc describes how it became possible again in 5.0. Also regarding Elasticsearch, it's worth pointing out how this article I referenced previously actually showed Grok being used in a modern (year 2020) context of an ingest pipeline (i.e., Logstash isn't mentioned once) and interestingly it shows the dotted field names turning directly into nested fields with no special handling.

While that's all interesting, I'm once again more interested in Logstash as the reference example. This is not only because it's the one most often referenced but also because users that need Grok (or other parsing) functionality we can't yet provide natively in Zed can still use Logstash (see #3151). Here what we see is that by default Logstash leaves the dotted field names alone. (This and other Logstash tests are being performed with the current latest available, which is v8.10.4).

Using config logstash-fields-with-dots.conf:

input {
  file {
    path => "/Users/phil/work/logstash/input.txt"
  }
}

filter {
  grok {
    match => { "message" =>  "%{IP:host.ip} %{WORD:http.request.method} %{URIPATHPARAM:url.original} %{NUMBER:http.request.bytes:int} %{NUMBER:event.duration:double} %{GREEDYDATA:my_greedy_match}" }
  }
}

output {
  stdout {
    codec => "json"
  }
}

And input file input.txt:

55.3.244.1 GET /index.html 15824 0.043 other stuff

We get:

$ bin/logstash -f logstash-fields-with-dots.conf
{
  "message": "55.3.244.1 GET /index.html 15824 0.043 other stuff",
  "event.duration": "0.043",
  "http.request.method": "GET",
  "url.original": "/index.html",
  "@version": "1",
  "log": {
    "file": {
      "path": "/Users/phil/work/logstash/input.txt"
    }
  },
  "@timestamp": "2023-10-31T21:03:31.952526Z",
  "event": {
    "original": "55.3.244.1 GET /index.html 15824 0.043 other stuff"
  },
  "host": {
    "name": "wpm.local"
  },
  "http.request.bytes": 15824,
  "my_greedy_match": "other stuff",
  "host.ip": "55.3.244.1"
}

However, this 2015 article (right around the time Elasticsearch was reckoning with its own treatment of dotted fields) Logstash added a de_dot filter that can turn these into nested fields before producing output. So adding this filter to create an alternate config logstash-de-dot.conf:

input {
  file {
    path => "/Users/phil/work/logstash/input.txt"
  }
}

filter {
  grok {
    match => { "message" =>  "%{IP:host.ip} %{WORD:http.request.method} %{URIPATHPARAM:url.original} %{NUMBER:http.request.bytes:int} %{NUMBER:event.duration:double} %{GREEDYDATA:my_greedy_match}" }
  }
  de_dot {
    nested => true
  }
}

output {
  stdout {
    codec => "json"
  }
}

Now we see the nesting.

$ bin/logstash -f logstash-de-dot.conf
{
  "http": {
    "request": {
      "method": "GET",
      "bytes": 15824
    }
  },
  "message": "55.3.244.1 GET /index.html 15824 0.043 other stuff",
  "event": {
    "original": "55.3.244.1 GET /index.html 15824 0.043 other stuff",
    "duration": "0.043"
  },
  "log": {
    "file": {
      "path": "/Users/phil/work/logstash/input.txt"
    }
  },
  "@version": "1",
  "@timestamp": "2023-10-31T21:08:04.107244Z",
  "my_greedy_match": "other stuff",
  "host": {
    "name": "wpm.local",
    "ip": "55.3.244.1"
  },
  "url": {
    "original": "/index.html"
  }
}

Strictly speaking, de_dot is just one of many filters in Logstash and not "part of Grok", though.

Conclusion: In theory we could maintain the current behavior of the code in this branch and require users to create dot-free field names and use downstream Zed operators like put and cut to move the values to nested fields. However, in addition to this being potentially lots of extra typing for users, the fact there's dot-having examples out there in Google search results may set many users up for some total-failure-to-parse pain. Meanwhile, Zed has its own nest_dotted function that's doing the equivalent of what's shown with de_dot above. Therefore, if it's not too difficult, I'd advocate doing what Logstash's Grok does here and just let the dot-having field names through so the users can make their own decision about whether to leave them alone or apply nest_dotted to them.

Square brackets for nesting

The limitations disclosed in the Grok debugger README also made me hip to an alternate syntax for creating nested fields that may be used within the Grok pattern. This syntax is only shown in passing in the Logstash Grok filter doc with no attention being called to it. Further digging reveals this is because once again we're looking at a general Logstash concept for how nested fields can be referenced, i.e., not something specific to Grok but that we may run into just because there's Logstash-centric Grok examples out there that use it.

When used with Logstash, it provides a handy "a la carte" way someone might create some nested fields while leaving others as dots, e.g., with config logstash-brackets.conf:

input {
  file {
    path => "/Users/phil/work/logstash/input.txt"
  }
}

filter {
  grok {
    match => { "message" =>  "%{IP:host.ip} %{WORD:[http][request][method]} %{URIPATHPARAM:url.original} %{NUMBER:http.request.bytes:int} %{NUMBER:event.duration:double} %{GREEDYDATA:my_greedy_match}" }
  }
}

output {
  stdout {
    codec => "json"
  }
}

And processing the same input as before, we see:

$ bin/logstash -f logstash-brackets.conf
{
  "message": "55.3.244.1 GET /index.html 15824 0.043 other stuff",
  "url.original": "/index.html",
  "http.request.bytes": 15824,
  "@version": "1",
  "@timestamp": "2023-10-31T21:25:49.545440Z",
  "log": {
    "file": {
      "path": "/Users/phil/work/logstash/input.txt"
    }
  },
  "event": {
    "original": "55.3.244.1 GET /index.html 15824 0.043 other stuff"
  },
  "host.ip": "55.3.244.1",
  "host": {
    "name": "wpm.local"
  },
  "event.duration": "0.043",
  "http": {
    "request": {
      "method": "GET"
    }
  },
  "my_greedy_match": "other stuff"
}

Whereas with what's on this branch at the moment, we're in another fail-to-parse situation. If I start from the working example above that used underscores and use just the bracketed notation for that one field, we see:

$ echo '"55.3.244.1 GET /index.html 15824 0.043 other stuff"' | zq -Z 'yield grok("%{IP:host_ip} %{WORD:[http][request][method]} %{URIPATHPARAM:url_original} %{NUMBER:http_request_bytes} %{NUMBER:event_duration} %{GREEDYDATA:my_greedy_match}", this)' -
error("grok(): no match")

Conclusion: Similar to the argument above regarding field names containing dots, it would be great if this syntax could create the nested fields like Logstash does just so more found examples will work out-of-the-box and users would see fewer "fail to parse" situations. That said, compared to field names containing dots, in my reading I've seen fewer examples that use the square bracket notation, so if it was deemed too much effort I'd be prepared to live without it, disclose it in the docs, and see if users notice/care.

Data type conversion

The data type conversion syntax (e.g., :int) was something I'd anticipated in my write-up in #4140. Indeed, whereas the Logstash Grok filter doc says they only support int and float right now, I'd thought a future Zed differentiator would be to support all the Zed primitive types in this syntax to cast to our many rich types. That said, since shaping is such a first class concept in Zed (and hence our users should hopefully know what to do if they see only string data coming back from our Grok function), I'd be less inclined to beg for this in a first implementation. And supporting just :int and :float to have feature parity with Logstash feels like it would be a pretty lame mini step.

Conclusion: While I'm fine living without any actual casting being triggered by this syntax, the current failure-to-parse behavior seen on the code in this branch feels hazardous, once again due to the fact that there's plenty of examples out there that use it. If not too difficult, I'd be in favor of effectively recognizing the trailing :whatever as valid syntax but then ignoring it. That would make it so more found examples at least produce output, and then we could divulge in the docs that it's currently unsupported and point to an open issue to gather interest for doing this with the Zed types. (Incidentally, the "allow but ignore" approach is what I see in the Grok debugger I've linked to a couple times above, though it also seems to do some clever things where it turns numeric values into numbers rather than leaving them as strings like Logstash does).

Regular expression libraries

I'm a bit out of my league with the specifics here, but something @mattnibs and @nwt mentioned that I've also seen referenced frequently in various docs is how Logstash's implementation uses the Oniguruma regexp library whereas ours currently uses the same Go one that Zed already uses in other contexts like the regexp function. After trying to read some stuff online about the differences, much of it seems to boil down to (paraphrasing) "Oniguruma potentially performs much worse, but can do things the other libraries can't". What I don't see are write-ups that quantify the degree to which this affects real world log parsing use cases, e.g., "the following log formats simply cannot be parsed without Oniguruma!" or "look how much uglier I have to make this Go regexp compared to this beautiful Oniguruma one!" Instead I just see some version of "Waah! Waah! 👶 🍼 The patterns I've copy/pasted directly from Logstash don't work in this other Grok tool!" which is valid but I find less helpful.

Perhaps the difference I see most often cited is the first syntax for custom patterns shown in the Logstash Grok filter doc. Using the example they show there, if we have config logstash-oniguruma-named-capture.conf:

input {
  file {
    path => "/Users/phil/work/logstash/custom.txt"
  }
}

filter {
  grok {
    match => { "message" => "(?<queue_id>[0-9A-F]{10,11})" }
  }
}

output {
  stdout {
    codec => "json"
  }
}

And input custom.txt:

BEF25A72965

We get:

$ bin/logstash -f logstash-oniguruma-named-capture.conf
{
  "message": "BEF25A72965",
  "event": {
    "original": "BEF25A72965"
  },
  "@version": "1",
  "host": {
    "name": "wpm.local"
  },
  "log": {
    "file": {
      "path": "/Users/phil/work/logstash/custom.txt"
    }
  },
  "queue_id": "BEF25A72965",
  "@timestamp": "2023-10-31T21:50:24.419974Z"
}

So, ok, that's a handy shorthand. However, even that same Logstash doc goes on to describe how the equivalent is possible by defining a named custom Grok pattern and referencing it, and indeed, the same is possible with the current implementation on this branch.

$ echo '"BEF25A72965"' | zq -Z 'yield grok("POSTFIX_QUEUEID [0-9A-F]{10,11}", "%{POSTFIX_QUEUEID:queue_id}", this)' -
{
    queue_id: "BEF25A72965"
}

Incidentally, this same issue is cited in logrusorgru/grokky#4 (which is from the repo for the code the implementation this branch is based on) and vjeantet/grok#36 (which is from the repo I found when I wrote up #4140) so we're definitely not alone in this being a limitation. In fact, the implementation currently on this branch lines up with what's described in second of those quoted issues:

$ echo '"BEF25A72965"' | zq -Z 'yield grok("(?<queue_id>[0-9A-F]{10,11})", this)' -
error("grok(): error parsing regexp: invalid or unsupported Perl syntax: `(?<`")

Conclusion: @mattnibs and @nwt have already set my expectations that using an alternate regexp library like Oniguruma would be a non-trivial amount of work. Furthermore, having one regexp library for Grok and another for other parts of the language seems like it could come with its own set of hazards, and moving all Zed regexp functionality to depend on a library that's believed to perform worse is also probably a bad choice. Since Logstash's own doc shows two approaches to achieve the same result, and we support one of those approaches, I'd be fine in sticking with what we've got here and disclosing in the docs that the other is unsupported and see if it turns out to be a problem. The error message currently being surfaced is perhaps not ideal, but it's specific enough that users can react to it (e.g., if we reference it in our Grok doc, a search for "perl" on the docs site would surely bring the user to the right place).

References

I read through all the following as research for the write-up above. They're all pages that come up as top links for Google searches like "grok log parsing", "grok specification", and "grok create nested fields".

docs/language/functions/grok.md Outdated Show resolved Hide resolved
@philrz
Copy link
Contributor

philrz commented Nov 3, 2023

@mattnibs: While I was proud to show my work in my exhaustive write-up above [ 😄 ] I realize the conclusions are in danger of getting lost in the text. Just to summarize the changes I advocate making to what's currently in this branch:

  1. Accept fields with dots in their names. We don't need to automatically turn these into nested fields, though.
  2. Accept the square bracket syntax for creating nested fields. I'm willing to lose this if it's super hard to do, though.
  3. Accept the :int-style syntax for type conversion, but ignore it. Fields can all be string-typed and the user can use Zed casting. We could maybe do casting to Zed types with this syntax one day.
  4. When a string fails to parse via Grok, wrap it in the error value. That way users can study/repair later.
  5. Move the extra parameter to the end. This point from @nwt makes sense to me, as it's what I see with optional parameters in other languages.

@mattnibs
Copy link
Collaborator Author

mattnibs commented Nov 4, 2023

Thanks for the detail writeup @philrz

Accept fields with dots in their names. We don't need to automatically turn these into nested fields, though.
Accept the square bracket syntax for creating nested fields. I'm willing to lose this if it's super hard to do, though.

I'll make sure all characters possible are allowed as the name, however in order to pair down the size of an already large pr I'm going to punt on handling to another pr.

Accept the :int-style syntax for type conversion, but ignore it.

For now I'm just going to have any type hints as part of the resultant field name.

When a string fails to parse via Grok, wrap it in the error value.
Move the extra parameter to the end.

Will fix these.

@nwt
Copy link
Member

nwt commented Nov 9, 2023

For now I'm just going to have any type hints as part of the resultant field name.

Ignoring or honoring type hints strikes me as MVP functionality so I think we ought to do that here.

@mattnibs mattnibs requested review from nwt and philrz November 15, 2023 18:23
@mattnibs mattnibs force-pushed the grok branch 5 times, most recently from cb6feea to 5c113be Compare November 15, 2023 18:37
@philrz
Copy link
Contributor

philrz commented Nov 15, 2023

@mattnibs: docs/language/functions/README.md needs an entry that links to the new grok.md.

@philrz
Copy link
Contributor

philrz commented Nov 16, 2023

👍

I've retested things with the branch now at cc68962 and everything I cited above seems to be well-addressed. I'm therefore functionally 👍 on this and happy to see it merged once someone in Dev signs off on the code. Once merged, I'll eventually follow up with a PR of my own to add some more examples to the docs and link to some other Grok resources to help users understand our remaining limitations and gather demand for possible future improvements.


Details

Just to summarize how I see things have settled, revisiting my checklist from above, the example from this Elastic article now works out-of-the-box, which is great!

$ zq -version
Version: v1.11.1-3-gcc689620

$ echo '"55.3.244.1 GET /index.html 15824 0.043 other stuff"' | zq -Z 'yield grok("%{IP:host.ip} %{WORD:http.request.method} %{URIPATHPARAM:url.original} %{NUMBER:http.request.bytes:int} %{NUMBER:event.duration:double} %{GREEDYDATA:my_greedy_match}", this)' -
{
    "host.ip": "55.3.244.1",
    "http.request.method": "GET",
    "url.original": "/index.html",
    "http.request.bytes": "15824",
    "event.duration": "0.043",
    my_greedy_match: "other stuff"
}

That shows these two were taken care of:

Accept the :int-style syntax for type conversion, but ignore it.
Accept fields with dots in their names.

On that latter one, I had high hopes of being able to then apply nest_dotted(), but it sadly fails. I've opened separate issue #4884 to plead the case that maybe we should change that.

For this one:

Accept the square bracket syntax for creating nested fields.

Rather than the previous no match, now the parsing succeeds and the square brackets are present in the field name.

$ echo '"55.3.244.1 GET /index.html 15824 0.043 other stuff"' | zq -Z 'yield grok("%{IP:host_ip} %{WORD:[http][request][method]} %{URIPATHPARAM:url_original} %{NUMBER:http_request_bytes} %{NUMBER:event_duration} %{GREEDYDATA:my_greedy_match}", this)' -
{
    host_ip: "55.3.244.1",
    "[http][request][method]": "GET",
    url_original: "/index.html",
    http_request_bytes: "15824",
    event_duration: "0.043",
    my_greedy_match: "other stuff"
}

That seems like a fine improvement since if the user expected it to work and has a "WTF?" moment it should lead them to the docs where I can list it as a known limitation and link to an open issue seeing if enough people notice/care for us to add that as an enhancement later.

Then for this one:

When a string fails to parse via Grok, wrap it in the error value.

...I can repeat the failure of the "named capture" syntax we don't support and see the wrapping of the original text in the error. Great!

$ echo '"BEF25A72965"' | zq -Z 'yield grok("(?<queue_id>[0-9A-F]{10,11})", this)' -
error({
    message: "grok(): error parsing regexp: invalid or unsupported Perl syntax: `(?<`",
    on: "(?<queue_id>[0-9A-F]{10,11})"
})

Finally, this one:

Move the extra parameter to the end.

This is not the most sophisticated example, but I tested this by making custom pattern called FOO that's just a reference to the existing IP one. 👍

$ echo '"55.3.244.1 GET /index.html 15824 0.043 other stuff"' | zq -Z 'yield grok("%{FOO:host.ip} %{WORD:http.request.method} %{URIPATHPARAM:url.original} %{NUMBER:http.request.bytes:int} %{NUMBER:event.duration:double} %{GREEDYDATA:my_greedy_match}", this, "FOO %{IP}")' -
{
    "host.ip": "55.3.244.1",
    "http.request.method": "GET",
    "url.original": "/index.html",
    "http.request.bytes": "15824",
    "event.duration": "0.043",
    my_greedy_match: "other stuff"
}

pkg/grok/grok-patterns Outdated Show resolved Hide resolved
docs/language/functions/grok.md Outdated Show resolved Hide resolved
mattnibs and others added 3 commits December 5, 2023 16:39
Co-authored-by: Phil Rzewski <phil@brimdata.io>
Co-authored-by: Noah Treuhaft <noah.treuhaft@gmail.com>
Co-authored-by: Phil Rzewski <phil@brimdata.io>
docs/language/functions/grok.md Outdated Show resolved Hide resolved
mattnibs and others added 2 commits December 6, 2023 10:36
Co-authored-by: Noah Treuhaft <noah.treuhaft@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parsing text lines into records (Grok) Turning parsed strings into field/value pairs
3 participants