BUG: Multicharacter support #466

boyter · 2024-05-20T22:29:39Z

Describe the bug

The addition of wenyan language via #465 highlights an issue in how scc matches. If you run it against any wenyan file it will not count the complexity. This is down to how the scc state machine works where it is matching on characters not runes.

To Reproduce

go run . ./examples/language/wenyan.wy

from inside the scc directory on a recent checkout.

Expected behavior

It is expected that there will be some complexity counted here.

The text was updated successfully, but these errors were encountered:

dbaggerman · 2024-05-27T23:02:19Z

It's been a while since I touched the code, but from what I remember the matching was done on bytes rather than characters. It should be able to match unicode byte sequences just as well as ASCII ones.

However, that relies on both languages.json and the source code using the same encoding. If they're both utf8 encoded with the same byte sequence then in theory it should be able to match them. On the other hand, if one is utf8 and the other is utf16 the underlying binary representation won't match so a byte sequence comparison will fail.

boyter · 2024-05-28T01:23:56Z

Yep you are correct. That's the exact reason for it.

Where I am debating is if it should be fixed in the state machine OR there be the ability to add new state machines for individual languages, which is something I wanted anyway in order to resolve some annoying issues, and as a potential performance improvement, for common languages such as Java as they could get their own specific code path.

dbaggerman · 2024-05-29T10:35:41Z

I remember seeing comments about the idea of language specific state machines to support the case of having one language nested within another.

The current state machine is built around looking up the bytes against a fixed size 256 element array, which wouldn't adapt well to converting to unicode runes. The trees are probably not dense enough to justify a binary search within a node, so doing a linear search within each node might be the most obvious solution.

Another solution would be to use the current Trie implementation, but re-encode the language tokens into a Trie per encoding. It would probably require some refactoring to track a Trie or LanguageFeature per encoding, but might perform better than a linear search.

boyter · 2024-05-29T22:02:10Z

The idea about the language specific state machine was yes to support that idea, as well as skip a heap of logic that does not apply where possible. For example java does not supported nested multiline comments, so skip the bookkeeping for that and hopefully gain a performance improvement. I would only be looking to do this for the most common languages though, as the generic version is much better for dealing with ad-hoc addition. It also would allow for dealing with any really oddball languages that pop up.

My initial plan is to just see what affect some rune conversion has as a baseline and go from there.

boyter added the bug Something isn't working label May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Multicharacter support #466

BUG: Multicharacter support #466

boyter commented May 20, 2024

dbaggerman commented May 27, 2024

boyter commented May 28, 2024

dbaggerman commented May 29, 2024

boyter commented May 29, 2024

BUG: Multicharacter support #466

BUG: Multicharacter support #466

Comments

boyter commented May 20, 2024

dbaggerman commented May 27, 2024

boyter commented May 28, 2024

dbaggerman commented May 29, 2024

boyter commented May 29, 2024