-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow reverse to apply to a string #412
Comments
I like this, though, implementation will be difficult because of Unicode's combining character sequences. Things like "\u0061\u0304" (latin small letter a + combining macron) is technically two codepoints, but to properly reverse the string, those two need to stay in that order in the reversal, while "\u0101" (latin small letter a with macron) looks identical and would be simple to reverse. Anyone have any thoughts on this? |
My thoughts exactly. This really requires more of a Unicode library. Or |
Is |
That depends on your definition of correctly. Most libraries I've used count the number of codepoints, not the number of graphemes, which is what jq's length builtin does. I would say this is correct behavior. |
My previous comment is misleading. I meant to say that jq, like the other libraries I've used (including the Objective-C and Java standard libraries, counts codepoints. So "\u0061\u0304" has a length of 2, while "\u0101" has a length of 1, even though both render as a single grapheme (looks like this: ā). |
Sure. I was trying to make the point that if |
@wtlangford @miracle2k Most of the time counting codepoints is what you want, and anyways, it's the next cheapest operation after counting bytes. Counting characters is hard enough, and counting graphemes (if you include support for grapheme clusters) is more expensive still. For string reversal you really want to distinguish characters, not codepoints. IMO anyways. In the interim you can always do this:
and now you can reverse either strings or arrays without further ado. (And since we try to preserve object key order, we could even "reverse" objects, but let's not :) This approach lets us off the hook for now. In the longer term we might have a function that knows the combining codepoint ranges and deals with characters. In the longer longer term we might want a bit of a Unicode library: for normalization, normalization-insensitive string comparison, grapheme cluster detection, grapheme counting, and so on. I'd rather not think about it for now :) |
Why does anyone ever want to reverse a string? Reversing a list, sure. Programming assignment to implement string reversing, sure. But in an actual program? It's not even a well-defined operation on a general (unicode) string. If for some reason someone does want to reverse a string codepoint by codepoint, then converting to a list of codepoints, reversing that, and converting back doesn't seem like too much work. |
Interestingly, I believe this is how most standard libraries do it anyways. Some of them have ways to make sure you're reversing composed character sequences properly (Objective-C's Foundation gives you substrings that represent each composed character sequence). But most just assume you know what you're getting into when you start reversing strings. |
The more I think about this, the more I like just providing conversion to and from a list of codepoints and list reversal. Programmers who reverse strings should acknowledge that they're doing something horrible by converting to a list of codepoints, rather than calling a library function that hides their sins :) |
Yeah. Here there be demons. |
@stedolan We already have those converters: |
I'm thinking that it should be possible to write jq-coded functions to do this correctly by grouping codepoints that make up characters. It would require checking for all combining codepoint ranges, but that's not so bad. A filter on |
No description provided.
The text was updated successfully, but these errors were encountered: