25x speed improvement in tokenizing jquery.min: maxCachedIndex should not dicrease due to a cache miss #40

alexdima · 2015-09-19T18:01:41Z

@kevinsawicki @zcbenz I would be most thankful if you take a look at this PR, as both VSCode and Atom would benefit from it :).

TL;DR
maxCachedIndex is a safeguard to protect against unitialized access to results (to ensure the regex at a certain index has been tried at least once), therefore it should only grow in the Search method, since it gets correctly reset to -1 in the Clear method.

Long version
I have noticed the following wrong behavior through logging and running the first-mate benchmark. Suppose the following example:

There is a scanner with 3 regular expressions that is asked to scan repeatedly over a string of 31k chars.
the first time it is asked to scan (location 0):
- regex0: cache miss, matches at location 3
- regex1: cache miss, matches at location 30245
- regex2: cache miss, matches at location 0
scanning will stop, suppose regex 2 consumes 4 characters
the scanner will now scan for location 4 in the same string
- regex0: cache miss (due to result->LocationAt(0) >= byteOffset)
- maxCachedIndex gets incorrectly set to 0
- regex1: cache miss, the regex looks again at the same string to match again at location 30245

Turns out this makes a huge difference in practice:

Before my proposed change:

c:\Alex\src\first-mate>npm run benchmark

> first-mate@5.0.0 benchmark c:\Alex\src\first-mate
> coffee benchmark/benchmark.coffee


Tokenizing jQuery v2.0.3
Generated tokens for 8830 lines in 718ms (0 tokens/ms)

Tokenizing jQuery v2.0.3 minified
Generated tokens for 7 lines in 26253ms (0 tokens/ms)

Tokenizing Bootstrap CSS v3.1.1
Generated tokens for 5786 lines in 281ms (0 tokens/ms)

Tokenizing Bootstrap CSS v3.1.1 minified
Generated tokens for 7 lines in 15196ms (0 tokens/ms)

After my proposed change:

c:\Alex\src\first-mate>npm run benchmark

> first-mate@5.0.0 benchmark c:\Alex\src\first-mate
> coffee benchmark/benchmark.coffee


Tokenizing jQuery v2.0.3
Generated tokens for 8830 lines in 678ms (0 tokens/ms)

Tokenizing jQuery v2.0.3 minified
Generated tokens for 7 lines in 1084ms (0 tokens/ms)

Tokenizing Bootstrap CSS v3.1.1
Generated tokens for 5786 lines in 266ms (0 tokens/ms)

Tokenizing Bootstrap CSS v3.1.1 minified
Generated tokens for 7 lines in 1744ms (0 tokens/ms)

About running the benchmarks:

using node v0.10.40
running on Windows
I am not familiar enough with v8, but I had to do the following unrelated change to even get the string cache to work (IsSame was always false for me in this node version). I ran both benchmarks with this change in, without this I gave up after 20min of waiting:

bool OnigStringContext::IsSame(Handle<String> other) const {
    if (v8String->Length() != other->Length()) {
        return false;
    }
    return v8String->StrictEquals(other);
    //return v8String == other;
}

I had to modify the first-mate benchmark.coffee to not count resulting tokens because the returned format from tokenizeLines has changed

I would be very thankful if you cherry pick this to master and also publish a new v4 on npm with this change, if you agree with my reasoning.

Thank you

zcbenz · 2015-09-21T07:55:39Z

This change looks good to me, and the improvement is awesome!

I am not familiar enough with v8, but I had to do the following unrelated change to even get the string cache to work (IsSame was always false for me in this node version). I ran both benchmarks with this change in, without this I gave up after 20min of waiting:

Current implementation of OnigStringContext::IsSame is wrong, can you replace it with your pasted one? We don't need to compare the length of string though, the StrictEquals method already included many optimizations.

I would be very thankful if you cherry pick this to master and also publish a new v4 on npm with this change, if you agree with my reasoning.

It sounds good to me.

alexdima · 2015-09-21T08:22:40Z

@zcbenz Thank you for the feedback.

In node v0.12 Persistent doesn't inherit anymore from Handle, so looking for examples at https://strongloop.com/strongblog/node-js-v0-12-c-apis-breaking/

here's what I came up with:

bool OnigStringContext::IsSame(Handle<String> other) const {
#if (0 == NODE_MAJOR_VERSION && 11 <= NODE_MINOR_VERSION) || (1 <= NODE_MAJOR_VERSION)
  return other->StrictEquals(v8::Local<String>::New(Isolate::GetCurrent(), v8String));
#else
  return other->StrictEquals(v8String);
#endif
}

Is this a good way to do it? (i.e. no memory leaks due to Local::New ?)

Thank you

zcbenz · 2015-09-21T11:20:40Z

It looks perfect to me, thanks!

25x speed improvement in tokenizing jquery.min: maxCachedIndex should not dicrease due to a cache miss

kevinsawicki · 2015-09-21T15:49:16Z

Thanks so much for this 🐎

maxCachedIndex should not dicrease due to a cache miss

f9cff2e

winstliu added the needs-review label Sep 19, 2015

Fix OnigStringContext::IsSame

8f255eb

zcbenz added a commit that referenced this pull request Sep 21, 2015

Merge pull request #40 from alexandrudima/alex/fix-cache

7da0686

25x speed improvement in tokenizing jquery.min: maxCachedIndex should not dicrease due to a cache miss

zcbenz merged commit 7da0686 into atom:v4.x Sep 21, 2015

zcbenz mentioned this pull request Sep 29, 2015

Rules overflowing since version 5.0 (aka atom 1.0.15) atom/first-mate#62

Open

winstliu mentioned this pull request Apr 18, 2017

first-mate not taking advantage of caching in Oniguruma atom/first-mate#93

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

25x speed improvement in tokenizing jquery.min: maxCachedIndex should not dicrease due to a cache miss #40

25x speed improvement in tokenizing jquery.min: maxCachedIndex should not dicrease due to a cache miss #40

alexdima commented Sep 19, 2015

zcbenz commented Sep 21, 2015

alexdima commented Sep 21, 2015

zcbenz commented Sep 21, 2015

kevinsawicki commented Sep 21, 2015

25x speed improvement in tokenizing jquery.min: maxCachedIndex should not dicrease due to a cache miss #40

25x speed improvement in tokenizing jquery.min: maxCachedIndex should not dicrease due to a cache miss #40

Conversation

alexdima commented Sep 19, 2015

zcbenz commented Sep 21, 2015

alexdima commented Sep 21, 2015

zcbenz commented Sep 21, 2015

kevinsawicki commented Sep 21, 2015