-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add grok function #4827
Add grok function #4827
Conversation
e00da0d
to
2ad8892
Compare
tl;drI've done some preliminary testing on this branch at commit 2ad8892. The tl;dr is that it seems like what's already working here meets a defensible definition of "doing Grok". I'd not stand in the way of shipping it as is and seeing what users do with it. That said, I did hit a few speedbumps in my testing such that if we ship what we've got, I'd feel like we'd need to highlight several caveats in the docs and be prepared for users to struggle and frequently come to Slack for help. In this regard it may be a pay-now-or-later zero-sum game where it might be worth addressing some gaps in the code now if it's not too difficult so it'll mean less user/support pain in the future. I'll dive into details in separate sections below. Grok specificationWhen reading through dozens of online references, it's clear how under-specified Grok is. It's not like there's an RFC or other standard that lays out what specifically Grok is/isn't. That said, almost all pages point back at the Grok filter plugin in Logstash as if it's the reference implementation. Further digging finds a TestingAfter reproducing the example in the docs included in this PR, the next thing I wanted to test was using a "custom pattern" via the
After further digging, there's two things in this example that don't seem to be supported with the current implementation in this branch, resulting in a failure to parse.
If I drop the data types and change the dots to something else like underscores, now it parses.
Before I figured this out, I coincidentally happened to also try testing out the original pattern from that same article in https://grokdebugger.com because the more-frequently-linked-to https://grokdebug.herokuapp.com seems to be dead. Well, despite grokdebugger.com having what feels like a nice UX, it turns out it also fails completely when asked to create fields whose names contain dots, and this is disclosed in the repo's README along with other limitations. I'd have ideally hoped to be able to tell users in our docs "feel free to educate yourself and steal from all the Grok materials already available online" and/or "we suggest using a Grok debugger to develop your patterns", but this one-two punch of unexpected findings gave me reason to pause and go down the rabbit hole about these gotchas. ProposalsField names containing dotsIt seems there's a rich and miserable history in ELK stack as relates to field names that contain dots. This doc from around 2015 describes how field names containing dots were ok in Elasticsearch 1.x, then weren't ok in 2.0, but then this doc describes how it became possible again in 5.0. Also regarding Elasticsearch, it's worth pointing out how this article I referenced previously actually showed Grok being used in a modern (year 2020) context of an ingest pipeline (i.e., Logstash isn't mentioned once) and interestingly it shows the dotted field names turning directly into nested fields with no special handling. While that's all interesting, I'm once again more interested in Logstash as the reference example. This is not only because it's the one most often referenced but also because users that need Grok (or other parsing) functionality we can't yet provide natively in Zed can still use Logstash (see #3151). Here what we see is that by default Logstash leaves the dotted field names alone. (This and other Logstash tests are being performed with the current latest available, which is v8.10.4). Using config
And input file
We get:
However, this 2015 article (right around the time Elasticsearch was reckoning with its own treatment of dotted fields) Logstash added a
Now we see the nesting.
Strictly speaking, Conclusion: In theory we could maintain the current behavior of the code in this branch and require users to create dot-free field names and use downstream Zed operators like Square brackets for nestingThe limitations disclosed in the Grok debugger README also made me hip to an alternate syntax for creating nested fields that may be used within the Grok pattern. This syntax is only shown in passing in the Logstash Grok filter doc with no attention being called to it. Further digging reveals this is because once again we're looking at a general Logstash concept for how nested fields can be referenced, i.e., not something specific to Grok but that we may run into just because there's Logstash-centric Grok examples out there that use it. When used with Logstash, it provides a handy "a la carte" way someone might create some nested fields while leaving others as dots, e.g., with config
And processing the same input as before, we see:
Whereas with what's on this branch at the moment, we're in another fail-to-parse situation. If I start from the working example above that used underscores and use just the bracketed notation for that one field, we see:
Conclusion: Similar to the argument above regarding field names containing dots, it would be great if this syntax could create the nested fields like Logstash does just so more found examples will work out-of-the-box and users would see fewer "fail to parse" situations. That said, compared to field names containing dots, in my reading I've seen fewer examples that use the square bracket notation, so if it was deemed too much effort I'd be prepared to live without it, disclose it in the docs, and see if users notice/care. Data type conversionThe data type conversion syntax (e.g., Conclusion: While I'm fine living without any actual casting being triggered by this syntax, the current failure-to-parse behavior seen on the code in this branch feels hazardous, once again due to the fact that there's plenty of examples out there that use it. If not too difficult, I'd be in favor of effectively recognizing the trailing Regular expression librariesI'm a bit out of my league with the specifics here, but something @mattnibs and @nwt mentioned that I've also seen referenced frequently in various docs is how Logstash's implementation uses the Oniguruma regexp library whereas ours currently uses the same Go one that Zed already uses in other contexts like the Perhaps the difference I see most often cited is the first syntax for custom patterns shown in the Logstash Grok filter doc. Using the example they show there, if we have config
And input
We get:
So, ok, that's a handy shorthand. However, even that same Logstash doc goes on to describe how the equivalent is possible by defining a named custom Grok pattern and referencing it, and indeed, the same is possible with the current implementation on this branch.
Incidentally, this same issue is cited in logrusorgru/grokky#4 (which is from the repo for the code the implementation this branch is based on) and vjeantet/grok#36 (which is from the repo I found when I wrote up #4140) so we're definitely not alone in this being a limitation. In fact, the implementation currently on this branch lines up with what's described in second of those quoted issues:
Conclusion: @mattnibs and @nwt have already set my expectations that using an alternate regexp library like Oniguruma would be a non-trivial amount of work. Furthermore, having one regexp library for Grok and another for other parts of the language seems like it could come with its own set of hazards, and moving all Zed regexp functionality to depend on a library that's believed to perform worse is also probably a bad choice. Since Logstash's own doc shows two approaches to achieve the same result, and we support one of those approaches, I'd be fine in sticking with what we've got here and disclosing in the docs that the other is unsupported and see if it turns out to be a problem. The error message currently being surfaced is perhaps not ideal, but it's specific enough that users can react to it (e.g., if we reference it in our Grok doc, a search for "perl" on the docs site would surely bring the user to the right place). ReferencesI read through all the following as research for the write-up above. They're all pages that come up as top links for Google searches like "grok log parsing", "grok specification", and "grok create nested fields".
|
@mattnibs: While I was proud to show my work in my exhaustive write-up above [ 😄 ] I realize the conclusions are in danger of getting lost in the text. Just to summarize the changes I advocate making to what's currently in this branch:
|
Thanks for the detail writeup @philrz
I'll make sure all characters possible are allowed as the name, however in order to pair down the size of an already large pr I'm going to punt on handling to another pr.
For now I'm just going to have any type hints as part of the resultant field name.
Will fix these. |
Ignoring or honoring type hints strikes me as MVP functionality so I think we ought to do that here. |
cb6feea
to
5c113be
Compare
@mattnibs: |
👍I've retested things with the branch now at cc68962 and everything I cited above seems to be well-addressed. I'm therefore functionally 👍 on this and happy to see it merged once someone in Dev signs off on the code. Once merged, I'll eventually follow up with a PR of my own to add some more examples to the docs and link to some other Grok resources to help users understand our remaining limitations and gather demand for possible future improvements. DetailsJust to summarize how I see things have settled, revisiting my checklist from above, the example from this Elastic article now works out-of-the-box, which is great!
That shows these two were taken care of:
On that latter one, I had high hopes of being able to then apply For this one:
Rather than the previous
That seems like a fine improvement since if the user expected it to work and has a "WTF?" moment it should lead them to the docs where I can list it as a known limitation and link to an open issue seeing if enough people notice/care for us to add that as an enhancement later. Then for this one:
...I can repeat the failure of the "named capture" syntax we don't support and see the wrapping of the original text in the
Finally, this one:
This is not the most sophisticated example, but I tested this by making custom pattern called
|
Closes #4140
Conflicts: runtime/expr/function/function.go
This matches name of the original file in github.com/logrusorgru/grokky.
It contains more patterns than https://github.com/vjeantet/grok/blob/master/patterns/grok-patterns.
Co-authored-by: Phil Rzewski <phil@brimdata.io>
Co-authored-by: Noah Treuhaft <noah.treuhaft@gmail.com>
Co-authored-by: Phil Rzewski <phil@brimdata.io>
Co-authored-by: Noah Treuhaft <noah.treuhaft@gmail.com>
Closes #4140
Closes #2999