Migrating to package:characters? #80

domesticmouse · 2020-08-16T13:14:39Z

Hi Lukas,

I'm curious as to your feelings about the potential for porting Petit Parser to the Characters package?

I'm wondering if the CharacterRange iterator is sufficient for Petit Parser's backtracking requirements. The upside of migrating to the Character package is that Petit Parser would be parsing in terms of Unicode Grapheme clusters instead of characters.

brett

renggli · 2020-08-16T15:47:48Z

Hi Brett,

The characters package is great, I've already converted some of my other code to use it. For example, the more package uses it for its configurable printers. I've also tried to use it for the char_matcher in the same package (which is used in similar form in PetitParser), with less success. I found it challenging to implement character-based predicates efficiently, for example character ranges or character sets (whitespaces, letters, digits, lower-case, upper-case, ...).

PetitParser needs random access to characters. It currently does it mostly through codeUnitAt, however the original implementation in Smalltalk used something similar to CharacterRange to navigate forward and backward through the input. I expect this could easily be adopted in Dart.

What I am more concerned about is performance: I expect moveNext, current and copy to be the most frequently called methods. They all execute quite a bit of code, and most of them also allocate new objects. In version 2.2.0 I introduced a fast-parse mode that avoids memory allocations during parsing, which brought a huge speed improvement on some inputs for lexer-like matching. Not sure how this could be adopted with characters?

domesticmouse · 2020-08-17T06:19:10Z

I'm sure we'd love some feedback at https://github.com/dart-lang/characters if you have a chance =)

Hamza5 · 2021-11-09T08:07:22Z

Now after more than a year, is there any Unicode support available? We are in 2021 and Unicode support is really required.

renggli · 2021-11-09T08:58:10Z

This library is designed to parse over streams of bytes (or UTF-16 characters), and as such it is agnostic to the encoding of your input. For example, the xml library is successfully parsing unicode input without the need for either of the libraries to "understand" the underlying characters.

While I understand that out of the box decoding would be desirable, it comes at a hefty cost in performance. So far I haven't seen a compelling need to support it natively. A real-world use-case where the current infrastructure doesn't work would definitely help to motivate investing time into this issue.

RandalSchwartz · 2023-09-09T21:30:59Z

To only slightly hijack this thread, I'm wondering if it's possible to use the higher-level logic provided by the parser to extract values out of an arbitrary jsonDecode. It'd be nice if I could specify a pattern of a json object containing some strings and bools and a json list of further objects, and then the .map action could map that into a Dart object via a constructor. I suspect it's only a matter of coming up with a useful set of primitives, and then using the rest of the mechanics without change, but if there's been any thought about this, I'd be interested. I know the switch stuff makes some of this easier, but it still doesn't feel as satisfying as just writing a composable Parser rule.

amake · 2024-12-02T01:20:45Z

A real-world use-case where the current infrastructure doesn't work would definitely help to motivate investing time into this issue.

I'm not sure if this fits the bill, but I just discovered that I can't make a pattern parser with Unicode codepoints above U+FFFF because toCharCode only sees the first surrogate pair.

To make everything safe for characters above U+FFFF, I think you need to replace every instance of String.codeUnitAt with the equivalent of String.runes.first.

renggli · 2024-12-02T08:09:50Z

As mentioned in #147 (comment) you can create a parser that accepts surrogate pairs. That said, there should probably be helpers that would make this simpler and the pattern parser should likely support this out of the box.

amake · 2024-12-02T08:15:00Z

What I want is to say pattern('\u{20000}-\u{2FFFF}') (this is an actual pattern I wanted to make) but this blows up because codeunit-wise the ranges are invalid. This isn't the same as merely recognizing surrogate pairs; I need to recognize only certain combinations of surrogate pairs.

I suppose I can make a big OptionParser with string('\u{20000}') | string('\u20001') | ... but this doesn't seem reasonable to me.

renggli · 2024-12-02T08:44:39Z

Understood, unfortunately this is currently not possible. Switching to use runes or characters everywhere comes at a hefty performance penalty for everybody, while most parsers do not require it.

In the meantime, I suggest you work around with the proposed solution, there is no need for a big OptionParser:

final surrogatePair = seq2(
    pattern('\uD800-\uDBFF'),
    pattern('\uDC00-\uDFFF'),
);
final decodedSurrogatePair = surrogatePair.map2((hi, lo) =>
    0x400 * (hi.codeUnitAt(0) - 0xD800) +
    (lo.codeUnitAt(0) - 0xDC00) +
    0x10000);
final patternInQuestion = decodedSurrogatePair
    .where((value) => 0x20000 <= value && value <= 0x2FFFF);

amake · 2024-12-02T10:23:05Z

Thank you for the suggestion.

Switching to use runes or characters everywhere comes at a hefty performance penalty for everybody, while most parsers do not require it.

I agree with this. I had two possibilities in mind:

Only use runes when necessary (peek at the first code unit, and if it's a surrogate then use runes)
Introduce alternate parsers that use runes: in my experience it is not an uncommon API design (especially among languages that use UTF-16 as the internal string encoding) to have a unicode flag or even parallel set of classes/functions/etc. In this case it could be pattern('\u{20000}-\u{2FFFF}', unicode: true) or even unicodePattern('\u{20000}-\u{2FFFF}')

Do either of these seem compelling to you?

renggli · 2024-12-02T10:34:02Z

I like the unicode: true param, that matches the API of the official RegExp class.

amake · 2024-12-02T11:36:09Z

I started looking at this and it's fairly hairy.

Most parser classes make some assumption that 1 char = 1 code unit, and the conditionals and fiddling needed to handle the Unicode case make the code much harder to read.
The top-level functions like pattern, any, etc., frequently have optional positional parameters, which prevents the addition of a unicode named argument. A boolean positional argument is not very nice in my opinion.

It looks to me like a parallel API would be nicer. I wouldn't blame you for not wanting to deal with that.

renggli · 2024-12-02T12:04:18Z

Agreed, a boolean positional argument is not desired. In this case probably better to introduce a parallel API (patternUnicode, anyUnicode, etc.) and then deprecate it again when Dart adds support to Allow both optional positional and optional named arguments in the same function signature.

I can look into some options later today.

amake · 2024-12-02T12:40:28Z

Actually I think some of the difficulty is in going back and forth between ints and strings. Ultimately I am looking to do range comparisons and use e.g. UnicodeCharMatcher from the more package, so rather than an entire parallel API it might make more sense to add some parsers that return ints rather than strings.

I'll play around with this and see what I come up with.

renggli · 2024-12-02T13:43:09Z

Some initial prototype of how a unicode character parser could look like: #183. In the tests it seems to work well, not sure how to create a nice API yet.

amake · 2024-12-03T23:34:37Z

Thanks for that. I ended up with something similar but more specialized—my only use case allows me to get by with a Parser because I always discard the result.
amake/org_parser@5056bcf#diff-8f6acb6c09132272f801008ec6c7197d34f97ad78f3f37be4335d90618e4f5c4R20

renggli added the enhancement label Aug 16, 2020

renggli added a commit that referenced this issue Feb 7, 2021

#80: Attempt at abstracting over String, to be able to use Characters.

8468d7e

renggli mentioned this issue Apr 2, 2023

Higher Unicode planes support? #147

Closed

renggli added a commit that referenced this issue Dec 2, 2024

Add test for workaround described in #80.

a48ffec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrating to package:characters? #80

Migrating to package:characters? #80

domesticmouse commented Aug 16, 2020

renggli commented Aug 16, 2020

domesticmouse commented Aug 17, 2020

Hamza5 commented Nov 9, 2021

renggli commented Nov 9, 2021 •

edited

Loading

RandalSchwartz commented Sep 9, 2023

amake commented Dec 2, 2024

renggli commented Dec 2, 2024

amake commented Dec 2, 2024 •

edited

Loading

renggli commented Dec 2, 2024 •

edited

Loading

amake commented Dec 2, 2024 •

edited

Loading

renggli commented Dec 2, 2024

amake commented Dec 2, 2024

renggli commented Dec 2, 2024

amake commented Dec 2, 2024 •

edited

Loading

renggli commented Dec 2, 2024 •

edited

Loading

amake commented Dec 3, 2024

Migrating to package:characters? #80

Migrating to package:characters? #80

Comments

domesticmouse commented Aug 16, 2020

renggli commented Aug 16, 2020

domesticmouse commented Aug 17, 2020

Hamza5 commented Nov 9, 2021

renggli commented Nov 9, 2021 • edited Loading

RandalSchwartz commented Sep 9, 2023

amake commented Dec 2, 2024

renggli commented Dec 2, 2024

amake commented Dec 2, 2024 • edited Loading

renggli commented Dec 2, 2024 • edited Loading

amake commented Dec 2, 2024 • edited Loading

renggli commented Dec 2, 2024

amake commented Dec 2, 2024

renggli commented Dec 2, 2024

amake commented Dec 2, 2024 • edited Loading

renggli commented Dec 2, 2024 • edited Loading

amake commented Dec 3, 2024

renggli commented Nov 9, 2021 •

edited

Loading

amake commented Dec 2, 2024 •

edited

Loading

renggli commented Dec 2, 2024 •

edited

Loading

amake commented Dec 2, 2024 •

edited

Loading

amake commented Dec 2, 2024 •

edited

Loading

renggli commented Dec 2, 2024 •

edited

Loading