-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore trailing spaces in CEF messages #17253
Conversation
This patch updates the ragel state machine to skip trailing whitespace at the end of CEF messages. Some CEF exporters, Check Point for example, have been observed to add a trailing whitespace to CEF messages: > "CEF:0:| [...] src=127.0.0.1 " Currently, this space character is interpreted as part of the last field's value, which can cause decoding errors if the value is an integer or an IP address. For maximizing compatibility, we also want to ignore other kinds of whitespace characters (new line, carriage return, tab). For example we can get a trailing newline when processing CEF messages from UDP input instead of syslog, which removes newlines. Whitespace in non-final extensions is preserved, as the CEF standard permits (but discourages) it's use in non-final extensions.
Pinging @elastic/siem (Team:SIEM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see you got it working! Two things I'm curious about. 1) Maybe run benchcmp on this to see how/if the result changed. 2) How does this compare vs a TrimRight solution in a benchmark.
extensions = " "* extension (" " extension)*; | ||
extension_value = (space* extension_value_chars_nospace @extension_value_mark)* >extension_value_start $err(extension_err); | ||
extension = extension_key equal extension_value; | ||
extensions = " "* extension (space* " " extension)* space* %/extension_eof; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was surprised that %/extension_eof
worked for each individual extension, rather than just the final one. I'm definitely no Ragel expert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I guess I didn't explain myself properly. extension_eof worked as expected. It's just that it captures a value until the last point saved by @extension_value_mark
while extension_key saves the previous value up to the start of the separator. This made no difference before, but now it helps to capture trailing spaces in non-final extensions and to drop it in the final extension.
I had to move it to a different place here to adjust for the case where a msg is padded and EOF is never reached by extension_value due to it not accepting trailing spaces.
@andrewkroh I ran some benchmarks with count=10 and comparing with benchstat. All implementations are compared against the original state machine that doesn't trim. Modified SM (this PR)
strings.TrimRight [changes]
Custom Loop [changes]
Dedicated state-machine [changes]Did this one just for the sake of it.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for running the comparison. I didn't know about benchstat 😄 .
LGTM, but needs a changelog.
This patch updates the ragel state machine to skip trailing spaces at the end of CEF messages. Some CEF exporters, Check Point for example, have been observed to add a trailing space to CEF messages: > "CEF:0:| [...] src=127.0.0.1 " Currently, this space character is interpreted as part of the last field's value, which can cause decoding errors if the value is an integer or an IP address. For maximizing compatibility, we also want to ignore other kinds of space characters (new line, carriage return, tab). For example we can get a trailing newline when processing CEF messages from UDP input instead of syslog, which removes newlines. Spaces in non-final extensions are preserved, as the CEF standard permits (but discourages) it's use in non-final extensions. (cherry picked from commit a1ec55a)
This patch updates the ragel state machine to skip trailing spaces at the end of CEF messages. Some CEF exporters, Check Point for example, have been observed to add a trailing space to CEF messages: > "CEF:0:| [...] src=127.0.0.1 " Currently, this space character is interpreted as part of the last field's value, which can cause decoding errors if the value is an integer or an IP address. For maximizing compatibility, we also want to ignore other kinds of space characters (new line, carriage return, tab). For example we can get a trailing newline when processing CEF messages from UDP input instead of syslog, which removes newlines. Spaces in non-final extensions are preserved, as the CEF standard permits (but discourages) it's use in non-final extensions. (cherry picked from commit a1ec55a)
This patch updates the ragel state machine to skip trailing spaces at the end of CEF messages. Some CEF exporters, Check Point for example, have been observed to add a trailing space to CEF messages: > "CEF:0:| [...] src=127.0.0.1 " Currently, this space character is interpreted as part of the last field's value, which can cause decoding errors if the value is an integer or an IP address. For maximizing compatibility, we also want to ignore other kinds of space characters (new line, carriage return, tab). For example we can get a trailing newline when processing CEF messages from UDP input instead of syslog, which removes newlines. Spaces in non-final extensions are preserved, as the CEF standard permits (but discourages) it's use in non-final extensions. (cherry picked from commit a1ec55a)
This patch updates the ragel state machine to skip trailing spaces at the end of CEF messages. Some CEF exporters, Check Point for example, have been observed to add a trailing space to CEF messages: > "CEF:0:| [...] src=127.0.0.1 " Currently, this space character is interpreted as part of the last field's value, which can cause decoding errors if the value is an integer or an IP address. For maximizing compatibility, we also want to ignore other kinds of space characters (new line, carriage return, tab). For example we can get a trailing newline when processing CEF messages from UDP input instead of syslog, which removes newlines. Spaces in non-final extensions are preserved, as the CEF standard permits (but discourages) it's use in non-final extensions. (cherry picked from commit a1ec55a)
This patch updates the ragel state machine to skip trailing spaces at the end of CEF messages. Some CEF exporters, Check Point for example, have been observed to add a trailing space to CEF messages: > "CEF:0:| [...] src=127.0.0.1 " Currently, this space character is interpreted as part of the last field's value, which can cause decoding errors if the value is an integer or an IP address. For maximizing compatibility, we also want to ignore other kinds of space characters (new line, carriage return, tab). For example we can get a trailing newline when processing CEF messages from UDP input instead of syslog, which removes newlines. Spaces in non-final extensions are preserved, as the CEF standard permits (but discourages) it's use in non-final extensions. (cherry picked from commit ee5d5bd)
What does this PR do?
This patch updates the Ragel state machine to skip trailing space at the end of CEF messages.
Some CEF exporters, Check Point for example, have been observed to add a trailing space to CEF messages:
Currently, this space character is interpreted as part of the last field's value, which can cause decoding errors if the value is an integer or an IP address:
For maximizing compatibility, we also want to ignore other kinds of white-space characters (newline, carriage return, tab). For example we can get a trailing newline when processing CEF messages from UDP input instead of syslog, which removes newlines.
Trailing space in other extensions' values is preserved, as the CEF standard permits (but discourages) it's use in non-final extensions.
Why is it important?
We've observed users experiencing this problem, trying to fix it (unsuccessfully) with a processor:
Why a draft?
While the current solution is complete, I'm not satisfied of the changes that introduces to the Ragel SM definition.
Originally I wanted to get something more elegant like this to work:
So that we can keep the
extension
machine as-is, and add a specializedfinal_extension
machine that will disallow trailing space. However, I didn't manage to get this kind of pattern to work. It requires rewriting a lot of the capture actions to allow for the necessary backtracking, and I wasn't confident enough to get that right without introducing more problems than I was solving, or dedicating too many hours to this fix, due to my limited experience with ragel.The current solution works by accident. I decided to try a different approach starting by disallowing trailing space in all extensions, and found out that it works as I wanted and captures white-space as value in all extensions but the last. This is a side effect of how an extension value is captured differently in
extension_key
vsextension_eof
. The former captures the previous value up tomark-1
, that is the start of the current key minus the space separator. The later captures up toextValueEnd
, which is not incremented for trailing space.The current machine definition is an ugly mix of
" "
andspace
usage, because we want to have the space character as the only separator, but trim al trailing space:[\t\v\f\n\r ]