Use separate tokenizers #125

tlienart · 2023-09-06T08:32:38Z

Hello,

In the docs, there's the following general pattern to tokenize a string:

rules = [
 :name => re"...",
...
]
@eval @enum Token errortoken $(first.(tokens)...)
make_tokenizer((errortoken, 
    [Token(i) => j for (i,j) in enumerate(last.(tokens))]
)) |> eval
tokens = collect(tokenize(Token, s))

Is there an approach where I could call make_tokenizer for several different set of "rules" and then call tokenize with the respective tokenizer? Ideally I'd like to have:

function get_tokens_1(s)
  # uses first set of rules
  return collect(tokenize(...))
end
function get_tokens_2(s)
  # uses second set of rules
  return collect(tokenize(...))
end

Using make_tokenizer in each of these functions would work but is undesirable as its overhead is significant (and I need to call these functions a lot).

I tried with version but either I didn't understand how to use it or it didn't work

make_tokenizer(set_of_rules_1, 1) |> eval
make_tokenizer(set_of_rules_2, 2) |> eval

tokenize(s, 2)

I'd have expected this to correspond to the second set of rules but it doesn't look like that's the case

Thanks in advance

The text was updated successfully, but these errors were encountered:

jakobnissen · 2023-09-06T09:35:17Z

You should be able to just call make_tokenizer for each of your different rules, I think that's the best approach.

Using make_tokenizer in each of these functions would work but is undesirable as its overhead is significant (and I need to call these functions a lot).

The thing is that make_tokenizer creates code, which when evaluated defines a bunch of methods. It's designed to be called at top-level in the module, such that the work happens at precompile time when the package is installed. So, while it has a lot of overhead, this should be completely compiled away.
Of course, this also implies you can't make hundreds of different rulesets, because then you'd generate too much code and both compile times and package image sizes would get too high.

Here is an example:

julia> rules_1 = [
        :name => re"[A-Z]+",
        :digits => re"[0-9]+"
       ]
       rules_2 = [
        :name => re"[a-zA-Z]+", # different rule here
        :digits => re"[0-9]+"
       ]
       @assert first.(rules_1) == first.(rules_2)
       @eval @enum Token errortoken $(first.(rules_1)...)
       make_tokenizer(
           (errortoken,
           [Token(i) => j for (i,j) in enumerate(last.(rules_1))]);
           version=1
       ) |> eval
       make_tokenizer(
           (errortoken,
           [Token(i) => j for (i,j) in enumerate(last.(rules_2))]);
           version=2
       ) |> eval

julia> collect(Tokenizer{Token, String, 1}("hello123"))
2-element Vector{Tuple{Int64, Int32, Token}}:
 (1, 5, errortoken)
 (6, 3, digits)

julia> collect(Tokenizer{Token, String, 2}("hello123"))
2-element Vector{Tuple{Int64, Int32, Token}}:
 (1, 5, name)
 (6, 3, digits)

Now there IS a bug where calling tokenize will simply ignore the version and always use version=1. I'll get that patched and tested.

jakobnissen · 2023-09-06T09:36:50Z

You don't actually need the token names of the two rules to be the same

jakobnissen · 2023-09-06T13:09:06Z

Oh dear, looking through this code again and looking at the tests for the tokenizer, I see several quite embarassing issues that I don't know how I could have missed! I even get LLVM errors running the test suite on Julia 1.10-beta2. I'm unfortunately quite busy the next five days or so, but I'll put it in my calendar to check this early next week.

tlienart · 2023-09-06T13:13:16Z

thanks a lot, your example should be all I need with the version thing. Glad it made you find some bugs along the way too 😄 .

jakobnissen · 2023-09-11T13:57:13Z

An upstream bug prevents me from working too much on this now: JuliaLang/julia#51267 - but once that it fixed, I'll update the Tokenizer docs and fixup the code.

jakobnissen · 2023-09-12T06:55:53Z

Solved in #126 - I just worked around the upstream issue since any bugfix will not be backported to 1.6 anyway.

jakobnissen closed this as completed Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use separate tokenizers #125

Use separate tokenizers #125

tlienart commented Sep 6, 2023 •

edited

Loading

jakobnissen commented Sep 6, 2023

jakobnissen commented Sep 6, 2023

jakobnissen commented Sep 6, 2023

tlienart commented Sep 6, 2023

jakobnissen commented Sep 11, 2023

jakobnissen commented Sep 12, 2023

Use separate tokenizers #125

Use separate tokenizers #125

Comments

tlienart commented Sep 6, 2023 • edited Loading

jakobnissen commented Sep 6, 2023

jakobnissen commented Sep 6, 2023

jakobnissen commented Sep 6, 2023

tlienart commented Sep 6, 2023

jakobnissen commented Sep 11, 2023

jakobnissen commented Sep 12, 2023

tlienart commented Sep 6, 2023 •

edited

Loading