Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace the tokenizer with a flex-based scanner #3846

Merged
merged 17 commits into from
Oct 31, 2017
Merged

Replace the tokenizer with a flex-based scanner #3846

merged 17 commits into from
Oct 31, 2017

Conversation

kivikakk
Copy link
Contributor

@kivikakk kivikakk commented Oct 5, 2017

Preliminary benchmarks put this in at a 12x speedup.

[1] ~/github/linguist  $ bundle exec bin/bench
Rehearsal -----------------------------------------
c:      0.460000   0.000000   0.460000 (  0.464052)
rb:     5.120000   0.020000   5.140000 (  5.142903)
-------------------------------- total: 5.600000sec
            user     system      total        real
c:      0.420000   0.000000   0.420000 (  0.426902)
rb:     5.170000   0.020000   5.190000 (  5.188762)
Tokens extracted by each method:
 * extract_tokens_c: 3041830
 * extract_tokens_rb: 2836080

It doesn't produce identical results, but very near enough to. (Enough that all the tests should pass.)

/cc @vmg because he luuuuurves C

@vmg
Copy link
Contributor

vmg commented Oct 5, 2017

Looking good. I assume we're now properly handling all the weird Flex crashes you were seeing?

I also think it'd be neat if the Rakefile had a way to rebuild the lexer (and also to check that we're using the right version of Flex).

@lildude
Copy link
Member

lildude commented Oct 5, 2017

I've not tested this yet, but I wonder how well this new tokenizer will fair with non-ASCII. From the look of things, it should be 👌 , but thought I'd ask to be sure.

For context, an attempt to improve Linguist's support of non-ASCII in the ruby implementation has been started in #3748.

@kivikakk
Copy link
Contributor Author

kivikakk commented Oct 6, 2017

I assume we're now properly handling all the weird Flex crashes you were seeing?

Yeah; I constrained our use of features which turn out to be dangerous (LOOKING AT YOU, TRAILING CONTEXT), and everything works as expected now. ✨

I also think it'd be neat if the Rakefile had a way to rebuild the lexer (and also to check that we're using the right version of Flex).

+1, will add.

I've not tested this yet, but I wonder how well this new tokenizer will fair with non-ASCII.

It'll do as well as it currently does, which is to say Not Hugely Well; non-ASCII stuff will get skipped. It wouldn't be too hard to make it grok things we're likely to see in UTF-8 text, though it'd be a lot harder to do this and only match word-characters (since we'd have to add actual Unicode understanding to our lexer at that stage).

* Don't read and split the entire file if we only ever use the first/last n
  lines

* Only consider the first 50KiB when using heuristics/classifying.  This can
  save a *lot* of time; running a large number of regexes over 1MiB of text
  takes a while.

* Memoize File.size/read/stat; re-reading in a 500KiB file every time `data` is
  called adds up a lot.
@@ -289,6 +287,44 @@ def lines
end
end

def encoded_newlines_re
@encoded_newlines_re ||= Regexp.union(["\r\n", "\r", "\n"].
Copy link
Collaborator

@Alhadis Alhadis Oct 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the \R extension not work here?

I also take it Ruby's regex engine doesn't have the equivalent of Perl's /a modifier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm changing as little code as I can; this is just a refactor from:

https://github.com/github/linguist/blob/0b9c05f989a66e3e92ca9a1e0b236781ef54229f/lib/linguist/blob_helper.rb#L278-L279

\R also catches [\v\f] which we definitely don't want.

I also take it Ruby's regex engine doesn't have the equivalent of Perl's /a modifier?

~$ ruby -e '//a'
-e:1: unknown regexp option - a

It doesn't, and more to the point, it wouldn't help for our use here, which isn't about Unicode-aware matching so much as avoiding terrible encoding exceptions rising from the deep. /a modifies the meaning of several sequences in the regular expressions itself, rather than changing how a regular expression is applied to a given byte-sequence-tagged-with-an-encoding (i.e. a String), whatever the meaning of its contents.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. ;) Just thought to ask, since it's used very little in Perl (for good reasons). Thanks!

@kivikakk
Copy link
Contributor Author

I'd like to merge this! Anyone feel like doing a final review?

Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caveat pre-emptor: I have a copy of Dennis Richie's book but I'm far from being a C expert.

From what I do know this looks good to me, and the perf improvement is fantastic!!

@kivikakk
Copy link
Contributor Author

@lildude Thank you! The responsibility is mine if this somehow goes belly-up.

@kivikakk kivikakk merged commit 99eaf5f into github-linguist:master Oct 31, 2017
@kivikakk kivikakk deleted the c-ext branch October 31, 2017 00:07
@kivikakk kivikakk mentioned this pull request Oct 31, 2017
kivikakk pushed a commit that referenced this pull request Nov 9, 2017
@smola smola mentioned this pull request Jan 28, 2019
4 tasks
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants