Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer performance improvement #139

Merged
merged 2 commits into from
Apr 9, 2015
Merged

Tokenizer performance improvement #139

merged 2 commits into from
Apr 9, 2015

Conversation

satchmorun
Copy link
Contributor

I know that the CONTRIBUTING doc says to open an issue first, but this was a small change (and just a refactor). Please feel free to close if it's not inbounds.

Made a couple of changes to the tokenizer that make the implementation shorter and faster:

  1. Use String.prototype.trim
  2. Change the regex so that filtering out empty matches is not necessary
  3. Call toLowerCase once instead of repeatedly in map

Removing the calls to map and filter results in pretty big speed
improvements (2X-4X), and probably (haven't tested it) lower memory
usage.

http://jsperf.com/stringsplitting

Arun Srinivasan added 2 commits April 1, 2015 14:11
Making the hour `12` prevents daylight savings differences from
being a factor, since all that I know of happen in the very early
morning.
1. Use `String.prototype.trim`
2. Change the regex so that filtering out empty matches is not necessary
3. Call `toLowerCase` once instead of repeatedly in `map`

Removing the calls to `map` and `filter` results in pretty big speed
improvements (2X-4X), and probably (haven't tested it) lower memory
usage.

http://jsperf.com/stringsplitting
@olivernn olivernn merged commit ac932bd into olivernn:master Apr 9, 2015
@olivernn
Copy link
Owner

olivernn commented Apr 9, 2015

Thanks a lot for this change. In isolation it is a massive improvement, the gains are more modest when the function is used in searches and adding but it still moves the performance in the right direction! I've cut a new release, 0.5.8 that includes this change. I also added a performance test for this function so that its easier to show future performance improvements/regressions.

@satchmorun
Copy link
Contributor Author

@olivernn

Very cool, thanks. And absolutely, the gains are modest when searching and adding, but measurable, which is nice.

One thing I wanted to suggest with respect to tokenizing: what do you think about changing the line:

return obj.toString().trim().toLowerCase().split(/[\s\-]+/)

to be

return obj.toString().toLowerCase().match(/\w+/g)

in order to strip out more "special" characters.

> obj = '[the](quick brown! fox)'
'[the](quick brown! fox)'
> obj.toString().trim().toLowerCase().split(/[\s\-]+/)
[ '[the](quick', 'brown!', 'fox)' ]
> obj.toString().toLowerCase().match(/\w+/g)
[ 'the',  'quick',  'brown',  'fox' ]

I know that I can set up my own pipeline, but it feels like this is going to produce better results for the stopword/stemming steps down the line, so might be worth making official.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants