Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrating to package:characters? #80

Open
domesticmouse opened this issue Aug 16, 2020 · 16 comments
Open

Migrating to package:characters? #80

domesticmouse opened this issue Aug 16, 2020 · 16 comments

Comments

@domesticmouse
Copy link

Hi Lukas,

I'm curious as to your feelings about the potential for porting Petit Parser to the Characters package?

I'm wondering if the CharacterRange iterator is sufficient for Petit Parser's backtracking requirements. The upside of migrating to the Character package is that Petit Parser would be parsing in terms of Unicode Grapheme clusters instead of characters.

brett

@renggli
Copy link
Member

renggli commented Aug 16, 2020

Hi Brett,

The characters package is great, I've already converted some of my other code to use it. For example, the more package uses it for its configurable printers. I've also tried to use it for the char_matcher in the same package (which is used in similar form in PetitParser), with less success. I found it challenging to implement character-based predicates efficiently, for example character ranges or character sets (whitespaces, letters, digits, lower-case, upper-case, ...).

PetitParser needs random access to characters. It currently does it mostly through codeUnitAt, however the original implementation in Smalltalk used something similar to CharacterRange to navigate forward and backward through the input. I expect this could easily be adopted in Dart.

What I am more concerned about is performance: I expect moveNext, current and copy to be the most frequently called methods. They all execute quite a bit of code, and most of them also allocate new objects. In version 2.2.0 I introduced a fast-parse mode that avoids memory allocations during parsing, which brought a huge speed improvement on some inputs for lexer-like matching. Not sure how this could be adopted with characters?

@domesticmouse
Copy link
Author

I'm sure we'd love some feedback at https://github.com/dart-lang/characters if you have a chance =)

@Hamza5
Copy link

Hamza5 commented Nov 9, 2021

Now after more than a year, is there any Unicode support available? We are in 2021 and Unicode support is really required.

@renggli
Copy link
Member

renggli commented Nov 9, 2021

This library is designed to parse over streams of bytes (or UTF-16 characters), and as such it is agnostic to the encoding of your input. For example, the xml library is successfully parsing unicode input without the need for either of the libraries to "understand" the underlying characters.

While I understand that out of the box decoding would be desirable, it comes at a hefty cost in performance. So far I haven't seen a compelling need to support it natively. A real-world use-case where the current infrastructure doesn't work would definitely help to motivate investing time into this issue.

@RandalSchwartz
Copy link

To only slightly hijack this thread, I'm wondering if it's possible to use the higher-level logic provided by the parser to extract values out of an arbitrary jsonDecode. It'd be nice if I could specify a pattern of a json object containing some strings and bools and a json list of further objects, and then the .map action could map that into a Dart object via a constructor. I suspect it's only a matter of coming up with a useful set of primitives, and then using the rest of the mechanics without change, but if there's been any thought about this, I'd be interested. I know the switch stuff makes some of this easier, but it still doesn't feel as satisfying as just writing a composable Parser rule.

@amake
Copy link

amake commented Dec 2, 2024

A real-world use-case where the current infrastructure doesn't work would definitely help to motivate investing time into this issue.

I'm not sure if this fits the bill, but I just discovered that I can't make a pattern parser with Unicode codepoints above U+FFFF because toCharCode only sees the first surrogate pair.

To make everything safe for characters above U+FFFF, I think you need to replace every instance of String.codeUnitAt with the equivalent of String.runes.first.

@renggli
Copy link
Member

renggli commented Dec 2, 2024

As mentioned in #147 (comment) you can create a parser that accepts surrogate pairs. That said, there should probably be helpers that would make this simpler and the pattern parser should likely support this out of the box.

@amake
Copy link

amake commented Dec 2, 2024

What I want is to say pattern('\u{20000}-\u{2FFFF}') (this is an actual pattern I wanted to make) but this blows up because codeunit-wise the ranges are invalid. This isn't the same as merely recognizing surrogate pairs; I need to recognize only certain combinations of surrogate pairs.

I suppose I can make a big OptionParser with string('\u{20000}') | string('\u20001') | ... but this doesn't seem reasonable to me.

@renggli
Copy link
Member

renggli commented Dec 2, 2024

Understood, unfortunately this is currently not possible. Switching to use runes or characters everywhere comes at a hefty performance penalty for everybody, while most parsers do not require it.

In the meantime, I suggest you work around with the proposed solution, there is no need for a big OptionParser:

final surrogatePair = seq2(
    pattern('\uD800-\uDBFF'),
    pattern('\uDC00-\uDFFF'),
);
final decodedSurrogatePair = surrogatePair.map2((hi, lo) =>
    0x400 * (hi.codeUnitAt(0) - 0xD800) +
    (lo.codeUnitAt(0) - 0xDC00) +
    0x10000);
final patternInQuestion = decodedSurrogatePair
    .where((value) => 0x20000 <= value && value <= 0x2FFFF);

@amake
Copy link

amake commented Dec 2, 2024

Thank you for the suggestion.

Switching to use runes or characters everywhere comes at a hefty performance penalty for everybody, while most parsers do not require it.

I agree with this. I had two possibilities in mind:

  • Only use runes when necessary (peek at the first code unit, and if it's a surrogate then use runes)
  • Introduce alternate parsers that use runes: in my experience it is not an uncommon API design (especially among languages that use UTF-16 as the internal string encoding) to have a unicode flag or even parallel set of classes/functions/etc. In this case it could be pattern('\u{20000}-\u{2FFFF}', unicode: true) or even unicodePattern('\u{20000}-\u{2FFFF}')

Do either of these seem compelling to you?

@renggli
Copy link
Member

renggli commented Dec 2, 2024

I like the unicode: true param, that matches the API of the official RegExp class.

@amake
Copy link

amake commented Dec 2, 2024

I started looking at this and it's fairly hairy.

  • Most parser classes make some assumption that 1 char = 1 code unit, and the conditionals and fiddling needed to handle the Unicode case make the code much harder to read.
  • The top-level functions like pattern, any, etc., frequently have optional positional parameters, which prevents the addition of a unicode named argument. A boolean positional argument is not very nice in my opinion.

It looks to me like a parallel API would be nicer. I wouldn't blame you for not wanting to deal with that.

@renggli
Copy link
Member

renggli commented Dec 2, 2024

Agreed, a boolean positional argument is not desired. In this case probably better to introduce a parallel API (patternUnicode, anyUnicode, etc.) and then deprecate it again when Dart adds support to Allow both optional positional and optional named arguments in the same function signature.

I can look into some options later today.

@amake
Copy link

amake commented Dec 2, 2024

Actually I think some of the difficulty is in going back and forth between ints and strings. Ultimately I am looking to do range comparisons and use e.g. UnicodeCharMatcher from the more package, so rather than an entire parallel API it might make more sense to add some parsers that return ints rather than strings.

I'll play around with this and see what I come up with.

@renggli
Copy link
Member

renggli commented Dec 2, 2024

Some initial prototype of how a unicode character parser could look like: #183. In the tests it seems to work well, not sure how to create a nice API yet.

@amake
Copy link

amake commented Dec 3, 2024

Thanks for that. I ended up with something similar but more specialized—my only use case allows me to get by with a Parser because I always discard the result.
amake/org_parser@5056bcf#diff-8f6acb6c09132272f801008ec6c7197d34f97ad78f3f37be4335d90618e4f5c4R20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants