-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement #qC string literals #1327
Conversation
I handn't settled because I wasn't that picky and wanted to hear other opinions, but I don't see how |
So the opening syntax would be something like |
Good point. So do opening like |
If we did the opening like |
Python does have a regex library on PyPI that can do recursive regexes like Perl. It has the same license as Python. I wonder if we can just drop that in without rewriting rply. |
A bold idea. I'll try it out. |
a621754
to
9e71c00
Compare
It works! (https://github.com/Kodiologist/hy/tree/hashstrings-recursive-regex) The catch is that |
Maybe I could achieve this instead with a |
Probably yes. This would make it so easy to quote code from any language with a balanced delimiter pair, which is most of them. (Even J balances parentheses! [Edit: no it doesn't.]). I was even considering gating the feature (e.g. like we do for Is a rule subclass like that part of rply's public interface? If not, this is a harder call, since it would make updating rply more difficult if its implementation ever changes. |
This is making me reconsider the choice of |
It isn't, but I'm not too worried about that, because I'll only be making assumptions about a small amount of the code, and the author of rply seems pretty responsive.
Sounds good to me. |
hy/lex/lexer.py
Outdated
# Unicode General Category "Pi" with the matching closing mark | ||
("«", "»"), ("‘", "’"), ("‛", "’"), ("“", "”"), ("‹", "›"), ("⸂", "⸃"), ("⸄", "⸅"), ("⸉", "⸊"), ("⸌", "⸍"), ("⸜", "⸝"), ("⸠", "⸡"), # noqa | ||
# BidiBrackets.txt | ||
("(", ")"), ("[", "]"), ("{", "}"), ("༺", "༻"), ("༼", "༽"), ("᚛", "᚜"), ("⁅", "⁆"), ("⁽", "⁾"), ("₍", "₎"), ("⌈", "⌉"), ("⌊", "⌋"), ("〈", "〉"), ("❨", "❩"), ("❪", "❫"), ("❬", "❭"), ("❮", "❯"), ("❰", "❱"), ("❲", "❳"), ("❴", "❵"), ("⟅", "⟆"), ("⟦", "⟧"), ("⟨", "⟩"), ("⟪", "⟫"), ("⟬", "⟭"), ("⟮", "⟯"), ("⦃", "⦄"), ("⦅", "⦆"), ("⦇", "⦈"), ("⦉", "⦊"), ("⦋", "⦌"), ("⦍", "⦐"), ("⦏", "⦎"), ("⦑", "⦒"), ("⦓", "⦔"), ("⦕", "⦖"), ("⦗", "⦘"), ("⧘", "⧙"), ("⧚", "⧛"), ("⧼", "⧽"), ("⸢", "⸣"), ("⸤", "⸥"), ("⸦", "⸧"), ("⸨", "⸩"), ("〈", "〉"), ("《", "》"), ("「", "」"), ("『", "』"), ("【", "】"), ("〔", "〕"), ("〖", "〗"), ("〘", "〙"), ("〚", "〛"), ("﹙", "﹚"), ("﹛", "﹜"), ("﹝", "﹞"), ("(", ")"), ("[", "]"), ("{", "}"), ("⦅", "⦆"), ("「", "」")) # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to list all these out. https://docs.python.org/3.6/library/unicodedata.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unicodedata
can tell which characters are Pi
, but not identify the matching end-quote characters, nor does it seem to have BidiBrackets.txt.
9e71c00
to
ef325e2
Compare
Okay, |
I just realized, this will prevent us from using any other tag macro symbols that start with The obvious fix is to make them use the same rules as other tag macros, with whitespace when it would be ambiguous, so it would be But |
(Kek, I can't keep my own syntax straight.) |
Quite so.
I would rather not allow or require this whitespace. It looks weird and leads to a construct like |
Removing the plain style is fine with me. It seems kind of redundant given the balanced style. The balanced style works fine with The remaining styles are still a problem though. Naked symbols are currently allowed to contain Do we need to support so many bracket types for the balanced style? Users could be confused about which characters are allowed in tags. Are the three bracket types enough? If not, would the addition of Is requiring the separation so bad for the pointy style? It's clear what's in the delimiter from the |
Even if the only change from what I have now is to remove the plain style, the only way you could get your tag macro shadowed is if its first character is
|
ef325e2
to
c55fc71
Compare
@kirbyfan64 @tuturto Can you guys weigh in on what you would accept? |
Out of all options, I like balanced style best. |
@tuturto Yeah, but see the back-and-forth between me and Matthew above. Would you accept dropping the plain style but still not requiring or allowing whitespace after the |
Plain style can go. I don't have strong opinion about whitespace, but I'm actually slightly leaning towards requiring it. But for me either (whitespace, no whitespace) is ok. |
Okay, thanks. @kirbyfan64? |
Looks like we're Ryanless for this PR. @gilch, would you accept a spaceless balanced style with all the delimiters allowed here? You're proposing a different PR that will prevent calling tag macros whose names begin with |
I did notice that, and originally thought about implementing it differently. But Clojure also does it that way. I checked. But I'd rather not add any more exceptions on what's allowed in tags. (I think it would be nice if Hy's parser could read EDN. This would make it easier to use a nice exchange format with the Clojure ecosystem. With the addition of the I've also become very worried about how any new string syntax will interact with our tooling. Hy is currently similar enough to Clojure that we can use Clojure editors on Hy with fairly good results. It's too bad Clojure doesn't have something like this already. If we add something new, it's very important that vim-hy and hy-mode (with Parinfer!) can support it. So I'd rather keep the new syntax simple. The Balanced Style would work in most cases, but for arbitrary strings, we need some kind of custom delimiters. (Can we escape a terminating bracket?) I think we need to discuss it more, and try out some of these proposals with the editors before we put it on master. |
Don't worry, I've gotten myself utterly confused, too. |
Yeah, this is a mess. I still have no clue what the hell "pointy style" is. I think we just need to start from the top. How about we start from the top... My understanding is that the current state of the code is that is looks like Here's my belief: Why is the Lua syntax awesome? For starters, it's dead-simple if you're not using custom delimiters (which is the case most of the time). e.g.
In addition, it plays nicely with Python's one way to do it philosophy: it allows just enough flexibility, but most of the time it'll end up being roughly the same thing. |
Also, I'm not sure what this has to do with list syntax. Isn't everything handled in the lexer anyway? |
It looks like
No. The current state of the code allows three styles, none of which allow
Matthew wants a syntax that does not restrict what tag macros you can call, compared to what's already the case. |
Actually, that's not quite true. The real reason |
What @Kodiologist said.
Like
That's only if we don't have the
So my current best proposal is both
is just like But unlike Lua, you have to start with And
This is like Proposal A from #1287. It works just like the normal double-quote style, but since it's paired, you can have It's really too bad double quotes aren't paired as a historical artifact from typewriters, or we'd already be doing this in Python. But given the |
I endorse this proposal and will implement it if it gets sufficient traction. |
Although, I think that allowing unescaped nested guillemets is probably a bad idea because (1) it complicates the lexer, (2) it will complicate syntax coloring (which will no longer be possible with non-recursive regexes alone), and (3) you're much less likely to need nested guillemets than you would nested parentheses or square brackets or whatever, and if you do, there's still the Lua style. But if we really need it, at least I've already gotten most of the code down. |
Same. (TBH I still don't really grasp the use case for proposal 2, but I don't feel like figuring out anything else...) |
@gilch It looks like at least the three of us (you, me, and Ryan) agree on your proposal. Can you just comment on the question of nested guillemets? Then I'll write it up and make a new PR for it. |
We seem to agree on the Lua strings, at least.
Not too much though. And it could get better in the future. I wonder what the chances are of that
A very important concern. We do not want to make tooling for Hy difficult to implement. Yes, we'd need the equivalent power of a pushdown automaton. But PDAs are old technology. This is a solved problem. Both Vimscript and Elisp are Turing Complete. They can handle it even if their regexes can't. Python has been called "executable pseudocode". We could publish the Python algorithm to match these strings in our docs under a public domain license for future tool writers to copy in whatever language they prefer. We could publish the equivalent PCRE recursive "regex" too. Both Perl and Ruby have balanced-style strings with the same issue. How much of a problem is it for them, really? How does their tooling handle it? Any general-purpose editor or syntax highlighting library that can handle those properly could also handle Hy with an appropriate script. If you're using an editor that can't, you could escape them anyway. (And consider getting a better editor.)
True, but I think paired string delimiters imply this feature. Not having it would be weird. There might be use for it in i18n. It does seem like a lot of problems for a feature we'd rarely use. I'd also be okay with only the Lua strings though. Guillements are an interesting choice. If we're using Unicode anyway we could have used the English-style 66/99 quotes with I'd like to think about this some more. |
I think I've got a good alternative. Implementing #1117 would allow us to create |
I can't speak for Ruby, but the complexity of Perl's quoting forms tends to be behind the corner cases that Perl syntax highlighters have trouble with. That's probably the hardest Perl feature to highlight, except perhaps the magic variables with weird names like
Agreed, and hard to distinguish from ASCII double quotes, too.
Oh dear, that sounds like a recipe for insanity. If annoying or confusing people who use guillemets in the opposite direction in their native language is a concern, it's probably better to choose different characters, like My first objection to quoting with vertical bars would be, how does the parser tell whether |
This does mean we can no longer write |
FWIW is there a reason we can't just use PLY? It works much better with things like this. |
I'd rather not rename a bunch of operators when we could just use other syntax. So, if there are no objections, I'm going to implement Lua style and a guillemet style supporting nesting. We've rather drawn this out, so it will be nice to conclude it. |
We'd have to do that for #1117 anyway, which is important for a lot of other issues. I'd rather not have three kinds of string literals when two will do. A change in grammar is a much bigger deal than renaming some core operators. And it's confusing that bitwise-not is the same symbol as unquote, so I want to change that one anyway. For example,
Does the above immediately call Freeing up And if we free up [Edit: and We can leave the other bitwise operators alone. |
The grammar has to change in any case in order to implement a new form of symbol quoting or string literal or whatever. Wouldn't it be better to quote with some syntax that doesn't require changing an operator from the Python name, like And seeing as people are going to use funny characters in strings more often than in symbols, shouldn't we have the default for the syntax be a string rather than a symbol? |
Related to that last point, easier entry of symbol with weird names doesn't actually need a new quoting syntax—it would suffice to add another prefix to string literals that makes the result a symbol instead of a plain string. |
I don't know, not having used it, but rewriting the lexer and parser is out of the scope of this PR. |
@kirbyfan64 PLY has a compatible BSD license, so I don't know of a reason. In what way is it better though? Can it do this kind of nested parsing we've discussed for the balanced styles any better without adding a new regex engine dependency? If so, and if we settle on a balanced style, it might be worth it. But @Kodiologist pointed out that any balanced style would complicate our tooling. |
But it would be a less complex grammar with only the two string literals and the arbitrary symbol syntax, than with three string literal types and the arbitrary symbol syntax.
I actually think the I'd rather keep
We've already discussed A and B, but one of them is paired and the other isn't.
No, because we want to be able to use funny characters in tag macros. We've already got the super-short It feels like we're quibbling over saving one character. Compare the overhead of Lua We could perhaps have I'm still not getting the use case where neither So, compared to Lua Style, it's not easier to read (on Emacs). It's not easier to write (unless you have a foreign keyboard, and even they require AltGr). It's not easier to implement tooling for. Many terminals can't print Unicode, so it's not very good for the command line either, and Unicode is still hard to type. I could maybe see |
For what it's worth, I use the bitwise operators much more often for vectorized logic with NumPy or Pandas objects than for actual bitwise logic.
Yeah, we've been bikeshedding mercilessly about this since the beginning. Somebody's gotta give in order for this to end. So, I give. Would you accept a PR that adds the Lua style (without nesting of the customized delimiter) and doesn't do anything else? |
I didn't see the discussion as trivial. The grammar is something we should try to keep simple and understandable. Changes here should be carefully thought out in the context of the whole language instead of blindly accepting the first thing that seems like a good idea. I felt like the options have been improving because of the discussion. I saw good points that I hadn't thought of on my own.
A single change could probably get approved faster than two. Which version exactly are you proposing? I'd accept the version like This is not a balanced style, so a pure FSM regex engine could highlight it, e.g. |
Yeah, that one. |
Closes #1287.
There's no documentation or NEWS update yet.
Note that matching delimiters are supported, but not inner pairs of matching delimiters (e.g.,Matching delimiters are supported.#q(())
is parsed the same as"(" )
). This is because the lexer is regex-based, and Python doesn't implement recursive regexes.@gilch, I didn't reserve a character for indicating multi-character delimiters since you hadn't seemed to settle onMulti-character delimiters are indicated with[
,=
, or<
.<
.