Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Generic logline placeholder replacement and tokenization #12799

Merged
merged 4 commits into from
Apr 26, 2024

Conversation

na--
Copy link
Member

@na-- na-- commented Apr 26, 2024

What this PR does / why we need it:

This code allows us to preprocess generic logs and replace highly variable dynamic data (timestamps, IPs, numbers, UUIDs, hex values, bytesizes and durations) with static placeholders for easier pattern extraction and more efficient and user-friendly matching by the Drain algorithm.

Additionally, there is logic that splits generic log lines into discrete tokens that can be used with Drain for better results than just naively splitting the logs on every whitespace. The tokenization here handles quote counting and emits quoted strings as a part of the same token. On the other side, it also handles likely JSON logs without any white spaces in them better, by trying to split {"key":value} pairs (without actually parsing the JSON).

All of this is done without using regular expressions and without actually parsing the log lines in any specific format. That's why it works very efficiently in terms of CPU usage and alloctions, and should handle all log formats and unformatted logs equally well.

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

This code allows us to preprocess generic logs and replace highly variable dynamic data (timestamps, IPs, numbers, UUIDs, hex values, bytesizes and durations) with static placeholders for easier pattern extraction and more efficient and user-friendly matching by the Drain algorithm.

Additionally, there is logic that splits generic log lines into discrete tokens that can be used with Drain for better results than just naively splitting the logs on every whitespace. The tokenization here handles quote counting and emits quoted strings as a part of the same token. On the other side, it also handles likely JSON logs without any white spaces in them better, by trying to split `{"key":value}` pairs (without actually parsing the JSON).

All of this is done without using regular expressions and without actually parsing the log lines in any specific format. That's why it works very efficiently in terms of CPU usage and alloctions, and should handle all log formats and unformatted logs equally well.
@na-- na-- requested a review from a team as a code owner April 26, 2024 07:28
Copy link
Contributor

@benclive benclive left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tough to review but as far as I can tell you've already covered the vast majority of test cases and it seems to work well. I have some concerns about maintainability but I could follow along & it is well commented so I think it's as good as it can be.
The rest of my comments are relatively minor so I'm happy to approve this. Nice job, very impressive implementation!

pkg/pattern/tokenization/tokenization.go Outdated Show resolved Hide resolved
var boundryChars = [256]bool{}

func init() {
for i := 0; i < 128; i++ { // for now, we don't consider non-ASCII chars as boundary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the array is length 256 but the iterator only goes up to 128. Is that intentional?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the rest of the code, it looks like it is intentional. I assume this is a performance optimization?

Copy link
Member Author

@na-- na-- Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I commented with part of my reasoning on the line here, that "for now, we don't consider non-ASCII chars as boundary". And the reason for that, as you've surmised, is mostly performance, but also avoiding extra complexity for arguably very little benefit.

Processing the string rune by rune would probably be significantly more computationally expensive than going byte by byte, so I think we should avoid it if we can. A lot of Go string processing functions do the same, they have a happy path that works with bytes if they can, just search for utf8.RuneSelf (which is 128) in the strings.go file from the Go stdlib 😅

For our use case, while there are quite a few non-ASCII whitespace and non-word characters in Unicode we could consider as "boundary" characters, it's highly unlikely for any of them to be used in logs in a way that matters for the purposes of this code 🤞 If they are used, it's likely going to be in some quoted strings like msg="unicode “съобщение” 😊", not in the actual log message structure.

And in the vanishingly few cases where that assumption doesn't hold, it's unlikely that this code would result in a worse outcome than if we just used Drain + whitespace tokenization without it, so it shouldn't hurt to have it 😅

In other words, I'd prefer to avoid dealing with that mess unless we see that we really need to 😅

} else if n[1] == 's' && (n[0] == 'm' || n[0] == 'n' || n[0] == 'u') && boundryChars[n[2]] {
secsLen = 2
} else if n[2] == 's' && ((n[0] == 0xC2 && n[1] == 0xB5) || (n[0] == 0xCE && n[1] == 0xBC)) && boundryChars[n[3]] {
// This checks for the unicode "µs" (U+00B5 = micro symbol) and "μs" (U+03BC = Greek letter mu)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://pkg.go.dev/time#ParseDuration, only one of these is actually valid. But they look identical so I guess it's good to support both.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's where I actually got both from, see this 😅

pkg/pattern/tokenization/tokenization.go Outdated Show resolved Hide resolved
[]byte("Saturday"),
}

func (r *replacer) handleWeirdTimestamp() (endsWithBoundary bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not useful: I love this method name, but I was secretly hoping for handleInsaneTimestamp!

// - "Mon Jan 2 15:04:05 2006"
// - "Mon Jan 2 15:04:05 MST 2006"
// - "Mon Jan 02 15:04:05 -0700 2006"
if r.advanceChar(' ') && r.advanceMonthName() && r.advanceChar(' ') && (r.advanceChar(' ') || true) && r.advanceUintUpTo(31, 2) && r.advanceChar(' ') && r.advanceTime(false) && r.advanceStringOrNumericTimeZone(true) && r.advanceChar(' ') && r.advanceYear() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&& (r.advanceChar(' ') || true) this is clever 😻

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the potential minor performance improvements I want to benchmark at some future point is actually getting rid of advanceOneOf. We only use it in a couple of places, here:

return r.advanceOneOf('+', '-') && r.advanceUintUpTo(14000, 5)

and here:
if r.advanceUintUpTo(12, 2) && r.advanceChar(delim) && r.advanceUintUpTo(31, 2) && r.advanceOneOf('T', ' ') && r.advanceTime(false) {

And both of these can be easily rewritten like this:

(r.advanceChar('+') || r.advanceChar('-'))
(r.advanceChar('T') || r.advanceChar(' '))

I'll actually comment on a few other potential minor performance improvements I've thought of but haven't had time to benchmark, just so I don't forget them in the future 😅

@na--
Copy link
Member Author

na-- commented Apr 26, 2024

Sorry if this makes the code review harder, but I decided that now was probably the best moment to split the tokenizer and replacer code in different files with c62c6cf, doing so after merging this PR would have just added more noise

(No code was actually changed in that commit, just moved around)

break
}
length++
val = val*10 + uint(c-'0') // TODO: use bitwise trick?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I meant by "bitwise trick": https://www.guidodiepen.nl/2017/03/clever-bit-tricks-with-ascii/#integervaluesinascii

I doubt it's actually faster on modern CPU architectures (and it's certainly less readable), but worth benchmarking to confirm, just in case 😅

if pPos != -1 && !boundaryChars[r.source[pPos]] {
continue
}
if bytes.HasPrefix(r.source[pPos+1:], key) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably can be optimized, the numPreservingKeys values can be of a string type and if this check is something like string(r.source[pPos+1:pPos+len(key)]) == key), it might be faster. This should not result in an actual string allocation because of some Go compiler optimizations for cases like this. Besides, this is basically what bytes.HasPrefix does internally as well, just with an extra bounds check and function calls... 😅 Which are probably inlined, but it can't hurt to compare the values directly 🤷‍♂️

Comment on lines +710 to +713
if bytes.HasPrefix(backtrackedSlice, dw) {
matchedDay = dw
break
}
Copy link
Member Author

@na-- na-- Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If longDayNames was a map[[3]byte] or map[string], the lookup might be a bit faster if we can do it without an allocation, though not sure if that matters a lot for a slice of 7 elements 😅 Still, while this specific part of the code is probably executed fairly rarely, it couldn't hurt to benchmark 😅

Comment on lines +30 to +31
if i < '0' || (i > '9' && i < 'A') || (i > 'Z' && i < 'a') || i > 'z' {
boundaryChars[i] = true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not certain if the slice lookups like if boundaryChars[c] { /* do something */} in the code below are actually faster than if we had a simple function like this:

func isBoundary(c byte) {
	return c < '0' || (c > '9' && c < 'A') || (c > 'Z' && c < 'a') || (c > 'z' && c < 128)
}

I suspect the function will be faster because of how CPU caching works, but I haven't actually benchmarked it. I adopted this approach originally because I had a couple of exceptions here (e.g. I didn't consider _ to be a boundary character initially).


// We weren't at a boundary and now we are, so check if we match one
// of the special keywords that will preserve numbers
if onBoundary {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we also add another check here to avoid the extra processing here? r.cur - r.tail > minNumPresevingKey

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm 🤔 Maybe, though I am not sure if this will always hold. For example, the _ in status_code= is technically a boundary character, so it's not inconceivable for status to already be "flushed" when we reach the =.

Copy link
Member Author

@na-- na-- Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just stepped through the status_code=123 example with a debugger and that's exactly what happens, because _ is a boundary and c is a potential start of a hex string, we encounter this branch and copy status_ to the r.dest slice (so r.cur - r.tail becomes 0) 😞

Let's keep this thread open though, as a potential performance improvement along with the other ones I opened, because that's an implementation detail that's fixable. If profiling proves that this would make a big difference, we can change the other behavior to be less aggressive in copying the source string

Copy link
Contributor

@ashwanthgoli ashwanthgoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀 thanks again for this @na--

not for this pr, but it might be worth having separate tests for the replacer and tokeniser, helps narrowing down the problem if we see any bugs down the line.

@na-- na-- merged commit 4047902 into main Apr 26, 2024
58 checks passed
@na-- na-- deleted the ned/log-placeholder-replacement-and-tokenization branch April 26, 2024 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants