-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use separate tokenizers #125
Comments
You should be able to just call
The thing is that Here is an example: julia> rules_1 = [
:name => re"[A-Z]+",
:digits => re"[0-9]+"
]
rules_2 = [
:name => re"[a-zA-Z]+", # different rule here
:digits => re"[0-9]+"
]
@assert first.(rules_1) == first.(rules_2)
@eval @enum Token errortoken $(first.(rules_1)...)
make_tokenizer(
(errortoken,
[Token(i) => j for (i,j) in enumerate(last.(rules_1))]);
version=1
) |> eval
make_tokenizer(
(errortoken,
[Token(i) => j for (i,j) in enumerate(last.(rules_2))]);
version=2
) |> eval
julia> collect(Tokenizer{Token, String, 1}("hello123"))
2-element Vector{Tuple{Int64, Int32, Token}}:
(1, 5, errortoken)
(6, 3, digits)
julia> collect(Tokenizer{Token, String, 2}("hello123"))
2-element Vector{Tuple{Int64, Int32, Token}}:
(1, 5, name)
(6, 3, digits) Now there IS a bug where calling |
You don't actually need the token names of the two rules to be the same |
Oh dear, looking through this code again and looking at the tests for the tokenizer, I see several quite embarassing issues that I don't know how I could have missed! I even get LLVM errors running the test suite on Julia 1.10-beta2. I'm unfortunately quite busy the next five days or so, but I'll put it in my calendar to check this early next week. |
thanks a lot, your example should be all I need with the version thing. Glad it made you find some bugs along the way too 😄 . |
An upstream bug prevents me from working too much on this now: JuliaLang/julia#51267 - but once that it fixed, I'll update the Tokenizer docs and fixup the code. |
Solved in #126 - I just worked around the upstream issue since any bugfix will not be backported to 1.6 anyway. |
Hello,
In the docs, there's the following general pattern to tokenize a string:
Is there an approach where I could call
make_tokenizer
for several different set of "rules" and then calltokenize
with the respective tokenizer? Ideally I'd like to have:Using
make_tokenizer
in each of these functions would work but is undesirable as its overhead is significant (and I need to call these functions a lot).I tried with
version
but either I didn't understand how to use it or it didn't workI'd have expected this to correspond to the second set of rules but it doesn't look like that's the case
Thanks in advance
The text was updated successfully, but these errors were encountered: