Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long single line literal string #827

Closed
teucer opened this issue Jun 10, 2021 · 21 comments
Closed

Long single line literal string #827

teucer opened this issue Jun 10, 2021 · 21 comments

Comments

@teucer
Copy link

teucer commented Jun 10, 2021

I have a long single line literal string. Is there a way to input it multi-line and trim it, as with multi-line basic strings?

@eksortso
Copy link
Contributor

eksortso commented Jun 12, 2021

Sadly, as long as you keep it a literal string, there isn't. There are no escapes in MLL strings, so the line-ending backslash syntax available in multiline basic strings is not available.

You could convert your literal string to an MLB string by doubling up your backslashes and changing the quotes. Then you can split it into multiple lines with line-ending backslashes, making sure to keep significant whitespace before each backslash.

Simple example, inspired by dozens of my former coworkers over the years:

mypath = 'C:\src\RepoCity\bighunkymonorepo\Regional\JPJK\rpts\Custom Reporting Projects\eee\eee\Class1.cs'

mypath_readable = """
C:\\src\\RepoCity\\bighunkymonorepo\
\\regional\\JPJK\\rpts\\Custom \
Reporting Projects\\eee\\eee\
\\Class1.cs"""

@teucer
Copy link
Author

teucer commented Jun 13, 2021

The whole value add of literal strings is to avoid escaping. I believe it would be useful to handle this case. Maybe a special character after the triple quotes, e.g.

key = '''
blah 
bluh
'''-

@eksortso
Copy link
Contributor

I believe it would be useful to handle this case.

@teucer By that, you mean you'd want it so that instead of preserving newlines, a MLL string with a special modifier would just concatenate all the lines into a single line? That sort of thing?

It's a compelling idea. I admit that I would find that very useful for regex expessions and Windows paths. Perhaps a solid case for it could be made here?

It would be better if a special character came before the triple quotes, so the parser (and other users!) could catch it right away. A symbol that suggests concatenation, like a + plus sign, would work better. The plus is commonly used by many languages to "add" strings to the ends of other strings.

Let's take my previous example and try this on. Would this look better to folks? Would it be more practical? Or could it be more confusing or just redundant?

mypath_perhaps = +'''
C:\src\RepoCity\bighunkymonorepo
\regional\JPJK
\rpts\Custom Reporting Projects
\eee\eee\Class1.cs'''

@teucer
Copy link
Author

teucer commented Jun 14, 2021

I was inspired by jinja2, where "-" is used to surpress whitespaces. Your proposal works as well, I would really like to see something like this.

One issue to signal the intentional whitespaces, e.g. consider,

key = +'''
blah 
bluh
'''

If my goal was to have 'blah bluh', how can I achieve that?

@eksortso
Copy link
Contributor

One issue to signal the intentional whitespaces, e.g. consider,

[...]

If my goal was to have 'blah bluh', how can I achieve that?

All whitespace other than newlines ought to be preserved. So your example would yield a value of blahbluh. We shouldn't add spaces, because these strings must remain as literal as possible, without doing any more violence to that word than we're already doing.

So an explicit space must be included in order to get blah bluh. This would work, and it's obvlous what we're doing:

key = +'''
blah
 bluh
'''

Two things to note here. The first thing is that triple backtick at the end, on a line of its own. In a normal MLL string, that would put a newline on the very end of the string. I bring this up because I see it too often in examples. In the case of a concatenated MLL, it would make no difference.

The second thing is subtle, so read carefully. I am tempted to change the proposal so that instead of just newlines, all the trailing whitespace up to and including the newlines would be removed. That would discourage text that looks like this:

# NOTE: single space after blah.
key = +'''
blah 
bluh
'''

But that resembles and is not consistent with line trailing backslash syntax in MLB strings. And it's harder to explain. So let's just remove newlines, and remember that people who use significant trailing whitespace are only causing themselves trouble anyway.

@teucer
Copy link
Author

teucer commented Jun 14, 2021

Regarding the last bit, I agree that it is important to ignore all the trailing whitespaces:

  1. The spec should not become highly sensitive to whitespaces
  2. I would not have to change my settings in VS Code to disable the trimming of whitespaces 😄

I think we have a good proposal, what are the next steps?

@eksortso
Copy link
Contributor

Personally, I would give the concept a brand new name, because it is different than anything seen so far. It's not single-line, based on form alone. It's not really multiline, because all trailing whitespace and newlines would be ignored when parsed. Lacking a better name, I'd call it a concatenated string, or concat string for short.

Then we'd need community feedback to determine if this is a viable concept or not. We already have four ways to write strings. Do we need this fifth way? We have to convince others that it's worth the additional overhead. @teucer and I are just two people, and although I've contributed time and code in the past, I still must vet every proposed change. Hoping more people will react, one way or another.

We must prepare to answer any and all questions that come our way, and adjust accordingly. Concat strings as we've so far defined them begin with +''', end with ''', and permit no escape sequences. Why shouldn't +""" and """, i.e. the same thing but with double quotes, allow for escape sequences or line ending backslashes? Why not "basic concat strings" and "literal concat strings"? Users would expect these things to be allowed, given the precedents.

Once there's some consensus, we'd compose a PR to match. This involves updates to spec text, examples, and the ABNF code. This, presumably, would be the easiest part, from my perspective at least. Perhaps just adding an optional concat-symbol token in the appropriate places would do the trick for the syntax. That can lead to further questions.

@teucer
Copy link
Author

teucer commented Jun 15, 2021

Is there an official process (like PEPs) to propose changes and ask the community for feedback? I don't know where to start.

PS: I like the name concat string

@ChristianSi
Copy link
Contributor

One quick note: I would consider -''' a better start marker than +'''. + suggests that something is added, - that something is subtracted. We subtract, or remove, something here, namely linebreaks and possibly trailing whitespace, so let's not confuse people by using the wrong kind of sign.

@eksortso
Copy link
Contributor

We're doing the process right now! We don't often talk about a proposal with a million little facets, so a formal PEP-like process doesn't exist. But I will admit that when we do (#499, #553), I wish it did!

We arguably could use a PR's first comment as a makeshift TOML Enhancement Proposal document. For a great idea with some momentum behind it, it gets the ball rolling. Or if there's no great support for the idea, it just kills the momentum.

But feature requests must start here, in the issues. It's bad practice in the open-source world to just dump code updates in folks' laps and expect a favorable reception. We've got to talk it out first.

We could appeal for action from the project leads, but it's way too early for that now.

That's enough meta-process. Hope this was a useful tangent.

@arp242
Copy link
Contributor

arp242 commented Jun 15, 2021

While I can certainly see how this is useful, the O ("obvious") part of TOML is becoming less and less with every addition like this. Especially when it comes to strings things are already somewhat non-obvious now (unfortunately, it's too late to change the semantics on that now, because compatibility).

key = +""" ... """ (or -""") is nice and concise, at the very least a more obvious signal wouldn't be a bad idea IMHO. I can't really think of something that I really like right now, but think in the direction of something like key = oneline|""" ... """. This can also be extended with e.g. key = trim|""" ... """.

Is this proposal a good trade-off of obviousness vs. usefulness? I don't know; I'm not so sure. Certainly for me one of the advantages of TOML is that you can send it to pretty much anyone vaguely computer-literate and they don't have too many problems with it (unlike, say, YAML).

@teucer
Copy link
Author

teucer commented Jun 16, 2021

I do not have any doubt any about the usefulness. This is IMHO a limitation of the current spec and a discrepancy between """ and '''.

While I like -'''...''' (or +'''...'''), key = concat|''' ... ''' is more explicit and might be extended for other purposes.

@ChristianSi
Copy link
Contributor

ChristianSi commented Jun 16, 2021

My judgment is: nice idea, but too rare a use case to warrant further complication of the spec. This would be a rare beast indeed: a string that's multiline in the file, but represents a single line. If it's a single line, why not write it as such? Editors these days are quite hardy: they won't crash when encountering a line with 100 or 150 characters.

Also, when you want to add linebreaks, why not use a multi-line basic string, which has a facility explicitly designed for this purpose ("line ending backslash")? Backslashes will need to be escaped in such cases, but few strings include so many backslashes that this will be really inconvenient. One possible exception, already given as example, are long filenames – but filenames, even under Windows, can be written with slashes instead of backslashes, so that case doesn't count.

All in all my impression is that TOML's four string types should be flexible enough to cover all usual use cases. There may be exceptions where a fifth type may come handy – but these exception are too rare to justify yet another addition to the spec.

@eksortso
Copy link
Contributor

Glad to see this activity. I had a very long response written up yesterday that I had to abandon. So please forgive me if I've missed an important conversation point.

@ChristianSi Complex regular expressions get long and hairy with an unwieldy number of backslashes. Since this directly affects me in non-TOML applications, I am inclined to keep pressing for something like this.

Also, string folding is a different but related thing which strips leading and trailing whitespace and replaces newlines with spaces. This is more like the examples that @teucer is using, and is addressed (very coarsely) by YAML.

So like it or not, there's a recognized user demand for concatenated string syntax that we can't easily dismiss.

@arp242
Copy link
Contributor

arp242 commented Jun 16, 2021

So like it or not, there's a recognized user demand for concatenated string syntax that we can't easily dismiss.

I'd argue that almost every feature that has ever been added to any language, config file, etc. is useful and driven by user demand. I think "do people want this?" is kind of the wrong question to be asking, because they can almost always be answered with "yes".

Better questions are:

  • What problems does it solve that are hard to solve otherwise?
  • How will it impact the learning curve?
  • How will it impact readability? Especially for people not familiar with TOML?
  • How much potential for confusion is there (i.e. TOML being parsed to strings people weren't expecting)?
  • How much harder will it be to implement?

And probably a few more. Designing these sort of things is an exercise in trade-offs where it's impossible to satisfy everyone and every use case.

A classic example are \-escapes. Useful? Clearly. But it also comes with downsides, such as people writing stuff like path = "C:\Users\martin" and then have trouble figuring out why it doesn't work. You can use path = 'C:\..', but overall it increases the learning curve and confusion. A good trade-off? Well, you can decide that for yourself; but it certainly is a trade-off. What's beneficial to one user can detrimental to another.

I don't want to "dismiss" user demand; but I'm not so sure the advantages of +""" .. """ are greater than the disadvantages.

@teucer
Copy link
Author

teucer commented Jun 17, 2021

Below is my attempt to answer the questions:


  • What problems does it solve that are hard to solve otherwise?

Currrently it is not possible to concatenate multiline literal strings, whereas it is possible with it is with basic strings. This feature would be really useful if one wants to concatenate and yet avoid escaping everything. Irrespective of the usefulness (whichI don't doubt), I think the lack of consistency is the bigger issue here.

  • How will it impact the learning curve?

This is mostly a documentation issue and the feature we would add does not come with a big overhead IMHO

  • How will it impact readability? Especially for people not familiar with TOML?

The syntax choice is crucial, key = concat|''' ... ''' might be preferable here.

  • How much potential for confusion is there (i.e. TOML being parsed to strings people weren't expecting)?

The syntax is clear enough that there would be not surprises IMHO

  • How much harder will it be to implement?

Knowing a little bit the internals of a Python library (tomlkit) it would be easy to implement.

@eksortso
Copy link
Contributor

@arp242 Well put. I don't entirely agree, but you're right that there are more important questions to pose.

My time is limited, so I'll respond to your initial questions in a series of posts, and the trade-offs and their consequences will be front and center in each response.

What problems does it solve that are hard to solve otherwise?

The previous examples clearly indicate the increase in readability that line breaks offer without doubling up every backslash. With Windows paths, the advantage is certainly aesthetic. With regular expressions, this new syntax aids comprehension while enhancing efficiency. Imagine that users would never again have to write "\\\\" to express a single backslash, and this mechanism would allow for that clarity of expression. This would be a big win for everyone.

A new, obvious, expression now suggests itself: using a literal backslash as a string modifier! We could have \''' indicate automatic line-ending-backslash behavior. That single character would make a big difference to users dealing with complicated strings. Users would need to be aware of that character, but as already mentioned, those users already deal with these more technical string expressions. (I'll expand upon the "string modifier" concept in a later post, when we revisit the 'blah bluh' string.)

Which leads me to modify the proposal: let us trim both leading and trailing whitespace before concatenation. This allows for internal consistency, which would make learning the concept easier. As for implementation, we can reuse what we already have. Current parsers can handle line ending backslashes within triple-double quotes. Future parsers could use the same line-ending-backslash logic here, after identifying each line's trailing whitespace. Picking up subsequent leading whitespace then would already be accounted for.

I will need to write some examples to illustrate these points. More to come.

@arp242
Copy link
Contributor

arp242 commented Jun 17, 2021

Oh yes, I absolutely agree that this feature as such has a lot of value. If I would design anything like TOML I would make sure it worked right from the first version as I hate dealing with \ escapes and general muckery with whitespace. That's not something I need to be convinced of, because we can quickly agree on that particular point.

The question is whether adding on this new behaviour/feature on top of the existing behaviour is a good trade-off. My main concern is that a whole bunch of subtly different behaviours is just confusing; TOML already has too many IMHO, and I find the way TOML deals with strings in general is somewhat unfortunate. But it is what it is.

Overall, I feel that "suboptimal but simple" is a better trade-off, but a reasonable case can be made for the other side as well.

Also: I'm not super against this or anything; if this would be added tomorrow then I would have no strong issues with that, even though I'm skeptical it would really make TOML better I don't think it's a super important issue either. Just to clarify 😅

Imagine that users would never again have to write "\\" to express a single backslash

You can already use ' and ''' for that? I'm not sure if I follow how that relates to the +''' .. ''' proposal?

@ChristianSi
Copy link
Contributor

@eksortso:

With regular expressions, this new syntax aids comprehension while enhancing efficiency.

I agree that regexes are a good use case for literal strings – but a regex that spans multiple lines? Anyone who frequently uses such a kind of thing may exaggerate the use of regex magic, I feel. More to the point, modern regex syntaxes usually support an "extended mode" (/x) where linebreaks and other whitespace are ignored. So for cases where regexes get really long, a multi-line literal string parsed in extended mode may be the way to go.

So no, I'm still not convinced that we have a convincing use case for yet another string type. TOML has long ago stopped being minimal, but I feel that we must nevertheless resist the urge to re-invent the M as "Maximal".

@eksortso
Copy link
Contributor

Although I said a few months ago that I would respond to the questions that @arp242 put forward, I'm afraid that I've run out of steam. And since nobody's said anything recently, the appropriate approach may be just to go back to the original question.

@teucer Upon reflection, I think that @ChristianSi has the only sensible approach here:

If it's a single line, why not write it as such? Editors these days are quite hardy: they won't crash when encountering a line with 100 or 150 characters.

If we saw a lot more use cases, or examples in the wild of unwieldy long literal strings, then maybe attitudes towards a special syntax for long single-line literal strings will change, and somebody will suggest a more obvious, and less clever, means of expressing long lines of text across a single line than I've suggested.

@ChristianSi The "extended mode" (/x) isn't an option in the use cases that I had in mind. Most of the time, this regex mode ignores all whitespace, including the space between words. Explaining this to my relatively non-technical users would be more of a headache than it would be worth. They're already dealing with a regex engine imposed upon them. On top of all this, they'd said that they'd like to see new features that traditional regex substitution does not provide. But that was a long time ago, and the work issue is not in my hands any longer. So the one use case that I've got to apply to this issue is no longer any of my concern.

@pradyunsg
Copy link
Member

Answering OP's question: No, there is not.

Regarding adding some sort of string prefixes that allow you to parse a string with difference characteristics... I don't think this is common enough to justify making such a change to strings. I'm going to close this eagerly, but if we a lot more of a similar concern being raised, we can revisit this then. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants