Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add new Drain tokenizer that splits on most punctuation #13143

Merged
merged 6 commits into from
Jun 7, 2024

Conversation

benclive
Copy link
Contributor

@benclive benclive commented Jun 5, 2024

What this PR does / why we need it:

  • Implement new Tokenizer that splits log lines on most punctuation. - characters are treated as part of a single token.
  • Add new feature to the Tokenizer interface: an opaque state object can be utilised by the Tokenizer to successfully tokenize and join the results. Here I'm returning an array of token indexes that indicate where to put spaces when joining the string.
  • Implement a new deduplicatePlaceholders method which operates on a string instead of tokens. The token-based method stopped working when using the state objects since the space indexes no longer lined up with the tokens and I couldn't think of an efficient way to handle this at the token level.
  • Take a look at the tests to see the new output: Generally it generates 10% less patterns for a given stream & they tend to be higher quality (subjectively).

Perf wise, this PR is ~50% higher CPU usage compared to previous Drain & much less allocations (so hopefully less GC). I will continue to do some perf optimizations in a separate PR to try and improve this.

Data 1:
Benchmark for using the new "punctuation" tokenizer vs the old "splitting" tokenizer:

$ benchstat drain-splitting-tokenizer.txt drain-punctuation-tokenizer.txt                                                                                                                                                                                                                                                                                               ok | 2m 5s | 16:38:23 
goos: darwin
goarch: arm64
pkg: github.com/grafana/loki/v3/pkg/pattern/drain
                                                               │ drain-splitting-tokenizer.txt │   drain-punctuation-tokenizer.txt    │
                                                               │            sec/op             │    sec/op     vs base                │
Drain_TrainExtractsPatterns/testdata/agent-logfmt.txt-14                           1.650m ± 2%    2.531m ± 3%  +53.45% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/ingester-logfmt.txt-14                        115.3µ ± 1%    178.1µ ± 1%  +54.48% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/drone-json.txt-14                             255.8µ ± 0%    415.4µ ± 1%  +62.40% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/distributor-logfmt.txt-14                     4.992m ± 1%    7.667m ± 1%  +53.60% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/journald.txt-14                               1.952m ± 4%    2.746m ± 1%  +40.67% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/kafka.txt-14                                  911.3µ ± 1%   1559.7µ ± 2%  +71.15% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/kubernetes.txt-14                             2.090m ± 0%    2.218m ± 3%   +6.14% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/vault.txt-14                                  798.0µ ± 1%   1468.7µ ± 2%  +84.05% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/calico.txt-14                                 1.633m ± 1%    2.172m ± 2%  +32.99% (p=0.000 n=10)
geomean                                                                            1.018m         1.521m       +49.36%

                                                               │ drain-splitting-tokenizer.txt │    drain-punctuation-tokenizer.txt     │
                                                               │             B/op              │     B/op       vs base                 │
Drain_TrainExtractsPatterns/testdata/agent-logfmt.txt-14                          2.046Mi ± 0%    5.766Mi ± 0%  +181.79% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/ingester-logfmt.txt-14                       163.0Ki ± 0%    380.0Ki ± 0%  +133.16% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/drone-json.txt-14                            294.1Ki ± 0%    844.6Ki ± 0%  +187.15% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/distributor-logfmt.txt-14                    6.498Mi ± 0%   16.697Mi ± 0%  +156.96% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/journald.txt-14                              2.843Mi ± 0%    5.662Mi ± 0%   +99.14% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/kafka.txt-14                                 1.106Mi ± 0%    3.180Mi ± 0%  +187.58% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/kubernetes.txt-14                            2.967Mi ± 0%    4.750Mi ± 0%   +60.12% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/vault.txt-14                                 970.6Ki ± 0%   3503.3Ki ± 0%  +260.93% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/calico.txt-14                                2.380Mi ± 0%    4.449Mi ± 0%   +86.93% (p=0.000 n=10)
geomean                                                                           1.327Mi         3.231Mi       +143.41%

                                                               │ drain-splitting-tokenizer.txt │   drain-punctuation-tokenizer.txt   │
                                                               │           allocs/op           │  allocs/op   vs base                │
Drain_TrainExtractsPatterns/testdata/agent-logfmt.txt-14                          16.447k ± 0%   6.180k ± 0%  -62.42% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/ingester-logfmt.txt-14                        1456.0 ± 0%    675.0 ± 0%  -53.64% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/drone-json.txt-14                             3.577k ± 0%   1.299k ± 0%  -63.68% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/distributor-logfmt.txt-14                     65.27k ± 0%   20.46k ± 0%  -68.66% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/journald.txt-14                               17.73k ± 0%   10.46k ± 0%  -41.02% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/kafka.txt-14                                  9.925k ± 0%   5.119k ± 0%  -48.42% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/kubernetes.txt-14                            13.601k ± 0%   6.744k ± 0%  -50.42% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/vault.txt-14                                 11.047k ± 0%   4.169k ± 0%  -62.26% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/calico.txt-14                                14.991k ± 0%   8.462k ± 0%  -43.55% (p=0.000 n=10)
geomean                                                                            10.92k        4.823k       -55.85%

Data 2:
Benchmark for my custom deduplicatePlaceholders vs a solution using regexp.MustCompile("<_>+").ReplaceAllLiteralString:

$ benchstat dedup-regex.txt dedup-loops.txt                                                                                                                                                                                                                                                                                                                                        ok | 15:58:42 
goos: darwin
goarch: arm64
                 │ dedup-regex.txt │           dedup-loops.txt           │
                 │     sec/op      │   sec/op     vs base                │
Dedup/Dedup_0-14       1.838n ± 2%   1.880n ± 0%   +2.29% (p=0.001 n=10)
Dedup/Dedup_1-14      142.60n ± 0%   15.63n ± 1%  -89.04% (p=0.000 n=10)
Dedup/Dedup_2-14      1716.0n ± 0%   142.4n ± 0%  -91.70% (p=0.000 n=10)
Dedup/Dedup_3-14       4.567n ± 0%   5.012n ± 0%   +9.75% (p=0.000 n=10)
Dedup/Dedup_4-14       197.3n ± 1%   193.1n ± 2%   -2.13% (p=0.000 n=10)
Dedup/Dedup_5-14       4.567n ± 0%   5.067n ± 0%  +10.95% (p=0.000 n=10)
Dedup/Dedup_6-14       3.490n ± 0%   3.759n ± 0%   +7.71% (p=0.000 n=10)
Dedup/Dedup_7-14      195.78µ ± 0%   10.40µ ± 6%  -94.69% (p=0.000 n=10)
Dedup/Dedup_8-14      176.15n ± 1%   23.87n ± 0%  -86.45% (p=0.000 n=10)
geomean                84.63n        29.91n       -64.66%

                 │ dedup-regex.txt │              dedup-loops.txt              │
                 │      B/op       │     B/op       vs base                    │
Dedup/Dedup_0-14      0.000 ± 0%        0.000 ± 0%          ~ (p=1.000 n=10) ¹
Dedup/Dedup_1-14      32.00 ± 0%        16.00 ± 0%    -50.00% (p=0.000 n=10)
Dedup/Dedup_2-14      32.00 ± 0%       320.00 ± 0%   +900.00% (p=0.000 n=10)
Dedup/Dedup_3-14      0.000 ± 0%        0.000 ± 0%          ~ (p=1.000 n=10) ¹
Dedup/Dedup_4-14      0.000 ± 0%        0.000 ± 0%          ~ (p=1.000 n=10) ¹
Dedup/Dedup_5-14      0.000 ± 0%        0.000 ± 0%          ~ (p=1.000 n=10) ¹
Dedup/Dedup_6-14      0.000 ± 0%        0.000 ± 0%          ~ (p=1.000 n=10) ¹
Dedup/Dedup_7-14    1.421Ki ± 0%     32.000Ki ± 0%  +2152.10% (p=0.000 n=10)
Dedup/Dedup_8-14      56.00 ± 0%        24.00 ± 0%    -57.14% (p=0.000 n=10)
geomean                          ²                    +53.84%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                 │ dedup-regex.txt │           dedup-loops.txt            │
                 │    allocs/op    │ allocs/op   vs base                  │
Dedup/Dedup_0-14      0.000 ± 0%     0.000 ± 0%        ~ (p=1.000 n=10) ¹
Dedup/Dedup_1-14      3.000 ± 0%     1.000 ± 0%  -66.67% (p=0.000 n=10)
Dedup/Dedup_2-14      3.000 ± 0%     1.000 ± 0%  -66.67% (p=0.000 n=10)
Dedup/Dedup_3-14      0.000 ± 0%     0.000 ± 0%        ~ (p=1.000 n=10) ¹
Dedup/Dedup_4-14      0.000 ± 0%     0.000 ± 0%        ~ (p=1.000 n=10) ¹
Dedup/Dedup_5-14      0.000 ± 0%     0.000 ± 0%        ~ (p=1.000 n=10) ¹
Dedup/Dedup_6-14      0.000 ± 0%     0.000 ± 0%        ~ (p=1.000 n=10) ¹
Dedup/Dedup_7-14      9.000 ± 0%     1.000 ± 0%  -88.89% (p=0.000 n=10)
Dedup/Dedup_8-14      4.000 ± 0%     1.000 ± 0%  -75.00% (p=0.000 n=10)
geomean                          ²               -47.39%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

@benclive benclive requested a review from a team as a code owner June 5, 2024 15:42
@benclive benclive requested a review from cyriltovena June 5, 2024 15:42
@@ -0,0 +1 @@
package output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

@@ -139,7 +141,7 @@ func DefaultConfig() *Config {
// MaxClusterDepth and SimTh, the less the chance that there will be
// "similar" clusters, but the greater the footprint.
SimTh: 0.3,
MaxChildren: 100,
MaxChildren: 15,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that better ?


type LineTokenizer interface {
Tokenize(line string) []string
Join(tokens []string) string
Tokenize(line string) ([]string, interface{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if generics would work here, just a thought. I know interface have a cost when casting for instance.


func (p *punctuationTokenizer) Tokenize(line string) ([]string, interface{}) {
tokens := make([]string, len(line)) // Maximum size is every character is punctuation
spacesAfter := make([]int, strings.Count(line, " ")) // Could be a bitmap, but it's not worth it for a few bytes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to use a pool for this one. Prometheus has a good sync.Pool that works in buckets.

Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Let's try it !

@cyriltovena cyriltovena merged commit 6a0fdd0 into main Jun 7, 2024
59 checks passed
@cyriltovena cyriltovena deleted the add-new-tokenizer-that-splits-aggressively branch June 7, 2024 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants