Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reverse to apply to a string #412

Open
miracle2k opened this issue Jun 14, 2014 · 13 comments
Open

Allow reverse to apply to a string #412

miracle2k opened this issue Jun 14, 2014 · 13 comments

Comments

@miracle2k
Copy link

No description provided.

@wtlangford
Copy link
Contributor

I like this, though, implementation will be difficult because of Unicode's combining character sequences. Things like "\u0061\u0304" (latin small letter a + combining macron) is technically two codepoints, but to properly reverse the string, those two need to stay in that order in the reversal, while "\u0101" (latin small letter a with macron) looks identical and would be simple to reverse. Anyone have any thoughts on this?

@nicowilliams
Copy link
Contributor

My thoughts exactly. This really requires more of a Unicode library. Or
at least the ranges of combining codepoints (though that isn't quite
sufficient).

@miracle2k
Copy link
Author

Is length handling unicode correctly?

@wtlangford
Copy link
Contributor

That depends on your definition of correctly. Most libraries I've used count the number of codepoints, not the number of graphemes, which is what jq's length builtin does. I would say this is correct behavior.

@wtlangford
Copy link
Contributor

My previous comment is misleading. I meant to say that jq, like the other libraries I've used (including the Objective-C and Java standard libraries, counts codepoints. So "\u0061\u0304" has a length of 2, while "\u0101" has a length of 1, even though both render as a single grapheme (looks like this: ā).

@miracle2k
Copy link
Author

Sure. I was trying to make the point that if length doesn't handle graphemes, like most programming languages, it might be ok if reverse doesn't either (also like most programming languages).

@nicowilliams
Copy link
Contributor

@wtlangford @miracle2k Most of the time counting codepoints is what you want, and anyways, it's the next cheapest operation after counting bytes. Counting characters is hard enough, and counting graphemes (if you include support for grapheme clusters) is more expensive still.

For string reversal you really want to distinguish characters, not codepoints. IMO anyways.

In the interim you can always do this:

def reverse_orig: reverse;
def reverse: if type == "string" then explode | reverse | implode else reverse_orig end;

and now you can reverse either strings or arrays without further ado. (And since we try to preserve object key order, we could even "reverse" objects, but let's not :)

This approach lets us off the hook for now.

In the longer term we might have a function that knows the combining codepoint ranges and deals with characters.

In the longer longer term we might want a bit of a Unicode library: for normalization, normalization-insensitive string comparison, grapheme cluster detection, grapheme counting, and so on. I'd rather not think about it for now :)

@stedolan
Copy link
Contributor

Why does anyone ever want to reverse a string? Reversing a list, sure. Programming assignment to implement string reversing, sure. But in an actual program? It's not even a well-defined operation on a general (unicode) string.

If for some reason someone does want to reverse a string codepoint by codepoint, then converting to a list of codepoints, reversing that, and converting back doesn't seem like too much work.

@wtlangford
Copy link
Contributor

Interestingly, I believe this is how most standard libraries do it anyways. Some of them have ways to make sure you're reversing composed character sequences properly (Objective-C's Foundation gives you substrings that represent each composed character sequence). But most just assume you know what you're getting into when you start reversing strings.

@stedolan
Copy link
Contributor

The more I think about this, the more I like just providing conversion to and from a list of codepoints and list reversal. Programmers who reverse strings should acknowledge that they're doing something horrible by converting to a list of codepoints, rather than calling a library function that hides their sins :)

@wtlangford
Copy link
Contributor

Yeah. Here there be demons.

@nicowilliams
Copy link
Contributor

@stedolan We already have those converters: explode and implode.

@nicowilliams
Copy link
Contributor

I'm thinking that it should be possible to write jq-coded functions to do this correctly by grouping codepoints that make up characters. It would require checking for all combining codepoint ranges, but that's not so bad. A filter on explode. I think a lot of advanced Unicode support, if we want it, could be jq-coded. (For the new import module facility, I've been thinking it'd be nice to have an import library data option, so that large Unicode tables could be stored as JSON instead of in .jq files.) I'd be much happier with that than with a dependency on some C Unicode library...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants