Allow \n literals in strings #21

fredsted · 2020-01-28T21:55:03Z

Was looking for a way to make a string in Primi with a literal "\n". Since it was converting \n on string creation, some data is lost, so I've tried to keep the internal state as-is, but return the right state in the getInternalValue and getStringValue methods. However, it seems a lot of places is accessing the →value prop of StringValue directly.

Perhaps there's a better way? Mainly looking for feedback on better workarounds for this... I realize it sucks :)

- Results: Original mechanism: Score: 100.00 Proposed in the PR: Score: 52.75, 1.90x slower than original Regex with neg.lookbehind: Score: 91.62, 1.09x slower than original

smuuf

Hi, thank you for preparing another PR. Although an interesting and sensible proposal, I think this indeed is a wrong approach. Handling these special/escape characters should be done as early as possible (as it is done now in the StringValue::__construct method), because - as you yourself already pointed out - there are (and probably will be more) many places one would have to track and handle separately. (Which could potentially lead to otherwise unnecessary bugs, if not used with caution.)

I ran a few benchmarks this morning and found out that using preg_replace with negative lookbehind could to the same job as your original proposal.

php-cb benchmark here 3cd50e3 with these results:

1. Score: 100.00, 9.5176 ms
        foreach ($strs as $s) {
                $res = originalFn($s);
        }
3. Score: 91.62, 1.09x slower, 10.3871 ms
        foreach ($strs as $s) {
                $res = regexverFn($s);
        }
2. Score: 52.75, 1.90x slower, 18.0424 ms
        foreach ($strs as $s) {
                $res = proposedFn($s);
        }

The original mechanism being still the fastest, mechanism proposed in this PR is almost 1.90x slower. The negative lookbehind is slower only 1.09x - while yielding the same resulting string (if I'm seeing correctly).

I'm leaning towards supporting this kind of regex solution - while keeping the expanding itself in constructor.

Totally different approach would be having some sort of raw-string literals, as eg. Python does. Quite bad I already dedicated the most suitable literal in form of `r"..."; to regex literals.

That could be changed (for example re"..." for regexes and raw strings could be r"...") as Primi is still in 0.3 very-pre-major version, but such decision is not something I'm willing to make on the spot.

fredsted · 2020-01-29T20:05:37Z

Thanks for the response. I agree that the regex is much preferable (I just wanted a quick PoC without fiddling with regex.)

Edit: I had to add an extra replacement to make my tests pass:

preg_replace(['#(?<!\\\)\\\n#', '/\\\\\\\n/'], ["\n", '\n'], $string);

Handling these special/escape characters should be done as early as possible

When e.g. concatenating strings ("\\n"+"\\n"), the escaping is then done twice (since a new StringValue is constructed) and you lose the escaped characters. In my opinion it's also unfortunate to change user data in a non-reversible way. Anything I'm missing here?

Totally different approach would be having some sort of raw-string literals, as eg. Python does. Quite bad I already dedicated the most suitable literal in form of `r"..."; to regex literals.

I think this is also a good solution, but maybe confusing for new users? Depending on whether you want to introduce more escape codes, it might be overkill just for having the possibility to use a newline literal.

smuuf · 2020-01-29T21:02:16Z

When e.g. concatenating strings ("\n"+"\n"), the escaping is then done twice (since a new StringValue is constructed) and you lose the escaped characters. In my opinion it's also unfortunate to change user data in a non-reversible way. Anything I'm missing here?

Oh, you're absolutely right! 😄 It is me, who missed something here...

Edit: Although now I'm not sure if we're talking about the same thing. What I was talking about is that my regex-based solution changed the original behaviour from this:

>>> "\\n"
"\
"

to this:

>>> "\\n"
"\\n"

... which is clearly wrong.

Your new regex solution does this:

>>> "\\n"
"\n"

... which looks ok.

The "\\n"+"\\n" thing is something else entirely, I think. I may have to do some research of how to handle these situations. 🤔 Nice catch! :)

smuuf · 2020-02-03T20:22:17Z

I've given it some thought and - also inspired by how Python handles this - I realized that processing escape sequences doesn't belong even inside the StringValue logic. It shouldn't be the constructor nor in any of its methods.

Handling these special/escape characters should be done as early as possible

I think my hunch still stands. What's currently wrong is the point where Primi does this - it really is too late.

Escape sequences should be processed before the StringValue is constructed. Escape sequences are part of how string literal should be interpreted - what exact result does the string literal yield.

Therefore I think the best course of action is to move string escape-sequence logic into StringLiteral handler:

primi/src/handlers/StringLiteral.php

Lines 18 to 20 in 3b84734

    
           $value = \mb_substr($content, 1, \mb_strlen($content) - 2); 
        
           return new StringValue($value);

The solution I have in mind would - as a result - contain some backwards incompatible stuff.

For example current:

>>> a = "\n\a\\n" + "\n"
"
\a\

" (string f6b2df)

would in the new version ideally result in error:

>>> a = "\n\a\\n" + "\n"
ERR: Unrecognized string escape sequence '\a'

The implementation I'm trying also does these (~ is used to highlight start/end of what the final string value literally contains):

>>> a = "hello"
Literally:
~hello~
Primi representation: "hello" (string 155d72)

>>> a = "\nhello"
Literally:
~
hello~
Primi representation: "\nhello" (string 42bdf9)

>>> a = "\\nhello"
Literally:
~\nhello~
Primi representation: "\\nhello" (string f078a3)

>>> a = "\\nhello\"
ERR: Syntax error @ line 1, position 0
>>> a = "\\nhello\\"
Literally:
~\nhello\~
Primi representation: "\\nhello\\" (string 28fe52)

>>> a = "\\nhello\\\n"
Literally:
~\nhello\\n~
Primi representation: "\\nhello\\\\n" (string 541c2c)

>>> a = "\\nhello\\\"
ERR: Syntax error @ line 1, position 0

The "\\nhello\\\n" seems to have Primi representation built wrong (which is what StringValue::getStringValue should be doing - maybe better naming for this would be getReprValue() 🤔 ... ), because there seem to be some edgecases when escaping multiple consecutive double backslashes. This would need to be taken care of. Edit: I think I have this solved now.

So I was trying different approaches and because of that I already may have some working version of this. But since this is your suggestion and PR, I would gladly let you to implement some similar universal mechanism, if you want.

Any thoughts?

Edit: Oh, and I forgot. It also does this:

>>> a = "\n" + "\n"
Literally:
~

~
Primi representation: "\n\n" (string 96d650)

>>> a = "\n" + "\\" + "\n"
Literally:
~
\
~
Primi representation: "\n\\\n" (string affe7b)

>>> a = "\n" + "\\n" + "\n"
Literally:
~
\n
~
Primi representation: "\n\\n\n" (string 892c3a)

- Now these escape sequences are supported: \\, \", \', \n, \t - Previously "\"" would throw a syntax error. Now it is handled properly. - A literal "\n" (or others) can now be entered by double back-slashing the escape sequence like "\\n", which was previously not supported. - Unknown escape sequences, such as '\a' will now throw an error. - This is a BC break! - Solution for PR smuuf#21. - Handling escape sequences is now done outside of StringValue. It is merely a way of how a string literal is interpreted. This means that "\" + "n" will NOT result in a "\n" sequence and thus in a literal newline. This is a BC break, as far as I know. - New StringEscaping helper with methods: escapeString, unescapeString - Modified and added tests for this BC breaking behavior.

smuuf · 2020-02-18T11:50:57Z

Care to look at this, if it correctly resolves the original idea, before I merge it? @fredsted

fredsted · 2020-02-18T14:38:03Z

Thanks for this, I'll take a look in a couple hours. :)

fredsted · 2020-02-18T18:51:18Z

For me it works nicely, not much else to say, I like your solution!

smuuf · 2020-02-18T20:13:24Z

Ok, thanks :) I'll merge it into master then.

fredsted added 3 commits January 28, 2020 07:51

Allow newline literal in strings

12e0be5

Handle string concatenations

3f44dec

Make tests pass with right values

fc2cd55

fredsted requested a review from smuuf January 28, 2020 21:55

Revert get_val usage

87991c9

smuuf reviewed Jan 29, 2020

View reviewed changes

Use a much nicer regex

801b096

smuuf added 2 commits February 17, 2020 23:23

Chmod +x for executable files.

744115a

smuuf force-pushed the newline_literal_alt branch from 42c33ba to 0664605 Compare February 17, 2020 22:24

smuuf merged commit a245148 into smuuf:master Feb 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow \n literals in strings #21

Allow \n literals in strings #21

fredsted commented Jan 28, 2020

smuuf left a comment

fredsted commented Jan 29, 2020 •

edited

Loading

smuuf commented Jan 29, 2020 •

edited

Loading

smuuf commented Feb 3, 2020 •

edited

Loading

smuuf commented Feb 18, 2020

fredsted commented Feb 18, 2020

fredsted commented Feb 18, 2020

smuuf commented Feb 18, 2020

Allow \n literals in strings #21

Allow \n literals in strings #21

Conversation

fredsted commented Jan 28, 2020

smuuf left a comment

Choose a reason for hiding this comment

fredsted commented Jan 29, 2020 • edited Loading

smuuf commented Jan 29, 2020 • edited Loading

smuuf commented Feb 3, 2020 • edited Loading

smuuf commented Feb 18, 2020

fredsted commented Feb 18, 2020

fredsted commented Feb 18, 2020

smuuf commented Feb 18, 2020

fredsted commented Jan 29, 2020 •

edited

Loading

smuuf commented Jan 29, 2020 •

edited

Loading

smuuf commented Feb 3, 2020 •

edited

Loading