Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use separate tokenizers #125

Closed
tlienart opened this issue Sep 6, 2023 · 6 comments
Closed

Use separate tokenizers #125

tlienart opened this issue Sep 6, 2023 · 6 comments

Comments

@tlienart
Copy link

tlienart commented Sep 6, 2023

Hello,

In the docs, there's the following general pattern to tokenize a string:

rules = [
 :name => re"...",
...
]
@eval @enum Token errortoken $(first.(tokens)...)
make_tokenizer((errortoken, 
    [Token(i) => j for (i,j) in enumerate(last.(tokens))]
)) |> eval
tokens = collect(tokenize(Token, s))

Is there an approach where I could call make_tokenizer for several different set of "rules" and then call tokenize with the respective tokenizer? Ideally I'd like to have:

function get_tokens_1(s)
  # uses first set of rules
  return collect(tokenize(...))
end
function get_tokens_2(s)
  # uses second set of rules
  return collect(tokenize(...))
end

Using make_tokenizer in each of these functions would work but is undesirable as its overhead is significant (and I need to call these functions a lot).

I tried with version but either I didn't understand how to use it or it didn't work

make_tokenizer(set_of_rules_1, 1) |> eval
make_tokenizer(set_of_rules_2, 2) |> eval

tokenize(s, 2)

I'd have expected this to correspond to the second set of rules but it doesn't look like that's the case

Thanks in advance

@jakobnissen
Copy link
Member

You should be able to just call make_tokenizer for each of your different rules, I think that's the best approach.

Using make_tokenizer in each of these functions would work but is undesirable as its overhead is significant (and I need to call these functions a lot).

The thing is that make_tokenizer creates code, which when evaluated defines a bunch of methods. It's designed to be called at top-level in the module, such that the work happens at precompile time when the package is installed. So, while it has a lot of overhead, this should be completely compiled away.
Of course, this also implies you can't make hundreds of different rulesets, because then you'd generate too much code and both compile times and package image sizes would get too high.

Here is an example:

julia> rules_1 = [
        :name => re"[A-Z]+",
        :digits => re"[0-9]+"
       ]
       rules_2 = [
        :name => re"[a-zA-Z]+", # different rule here
        :digits => re"[0-9]+"
       ]
       @assert first.(rules_1) == first.(rules_2)
       @eval @enum Token errortoken $(first.(rules_1)...)
       make_tokenizer(
           (errortoken,
           [Token(i) => j for (i,j) in enumerate(last.(rules_1))]);
           version=1
       ) |> eval
       make_tokenizer(
           (errortoken,
           [Token(i) => j for (i,j) in enumerate(last.(rules_2))]);
           version=2
       ) |> eval

julia> collect(Tokenizer{Token, String, 1}("hello123"))
2-element Vector{Tuple{Int64, Int32, Token}}:
 (1, 5, errortoken)
 (6, 3, digits)

julia> collect(Tokenizer{Token, String, 2}("hello123"))
2-element Vector{Tuple{Int64, Int32, Token}}:
 (1, 5, name)
 (6, 3, digits)

Now there IS a bug where calling tokenize will simply ignore the version and always use version=1. I'll get that patched and tested.

@jakobnissen
Copy link
Member

You don't actually need the token names of the two rules to be the same

@jakobnissen
Copy link
Member

Oh dear, looking through this code again and looking at the tests for the tokenizer, I see several quite embarassing issues that I don't know how I could have missed! I even get LLVM errors running the test suite on Julia 1.10-beta2. I'm unfortunately quite busy the next five days or so, but I'll put it in my calendar to check this early next week.

@tlienart
Copy link
Author

tlienart commented Sep 6, 2023

thanks a lot, your example should be all I need with the version thing. Glad it made you find some bugs along the way too 😄 .

@jakobnissen
Copy link
Member

An upstream bug prevents me from working too much on this now: JuliaLang/julia#51267 - but once that it fixed, I'll update the Tokenizer docs and fixup the code.

@jakobnissen
Copy link
Member

Solved in #126 - I just worked around the upstream issue since any bugfix will not be backported to 1.6 anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants