Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow \n literals in strings #21

Merged
merged 7 commits into from
Feb 18, 2020
Merged

Conversation

fredsted
Copy link
Contributor

Was looking for a way to make a string in Primi with a literal "\n". Since it was converting \n on string creation, some data is lost, so I've tried to keep the internal state as-is, but return the right state in the getInternalValue and getStringValue methods. However, it seems a lot of places is accessing the →value prop of StringValue directly.

Perhaps there's a better way? Mainly looking for feedback on better workarounds for this... I realize it sucks :)

@fredsted fredsted requested a review from smuuf January 28, 2020 21:55
smuuf added a commit that referenced this pull request Jan 29, 2020
- Results:
  Original mechanism:        Score: 100.00
  Proposed in the PR:        Score: 52.75, 1.90x slower than original
  Regex with neg.lookbehind: Score: 91.62, 1.09x slower than original
Copy link
Owner

@smuuf smuuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for preparing another PR. Although an interesting and sensible proposal, I think this indeed is a wrong approach. Handling these special/escape characters should be done as early as possible (as it is done now in the StringValue::__construct method), because - as you yourself already pointed out - there are (and probably will be more) many places one would have to track and handle separately. (Which could potentially lead to otherwise unnecessary bugs, if not used with caution.)

I ran a few benchmarks this morning and found out that using preg_replace with negative lookbehind could to the same job as your original proposal.

php-cb benchmark here 3cd50e3 with these results:

1. Score: 100.00, 9.5176 ms
        foreach ($strs as $s) {
                $res = originalFn($s);
        }
3. Score: 91.62, 1.09x slower, 10.3871 ms
        foreach ($strs as $s) {
                $res = regexverFn($s);
        }
2. Score: 52.75, 1.90x slower, 18.0424 ms
        foreach ($strs as $s) {
                $res = proposedFn($s);
        }

The original mechanism being still the fastest, mechanism proposed in this PR is almost 1.90x slower. The negative lookbehind is slower only 1.09x - while yielding the same resulting string (if I'm seeing correctly).

I'm leaning towards supporting this kind of regex solution - while keeping the expanding itself in constructor.


Totally different approach would be having some sort of raw-string literals, as eg. Python does. Quite bad I already dedicated the most suitable literal in form of `r"..."; to regex literals.

That could be changed (for example re"..." for regexes and raw strings could be r"...") as Primi is still in 0.3 very-pre-major version, but such decision is not something I'm willing to make on the spot.

@fredsted
Copy link
Contributor Author

fredsted commented Jan 29, 2020

Thanks for the response. I agree that the regex is much preferable (I just wanted a quick PoC without fiddling with regex.)

Edit: I had to add an extra replacement to make my tests pass:

preg_replace(['#(?<!\\\)\\\n#', '/\\\\\\\n/'], ["\n", '\n'], $string);

Handling these special/escape characters should be done as early as possible

When e.g. concatenating strings ("\\n"+"\\n"), the escaping is then done twice (since a new StringValue is constructed) and you lose the escaped characters. In my opinion it's also unfortunate to change user data in a non-reversible way. Anything I'm missing here?

Totally different approach would be having some sort of raw-string literals, as eg. Python does. Quite bad I already dedicated the most suitable literal in form of `r"..."; to regex literals.

I think this is also a good solution, but maybe confusing for new users? Depending on whether you want to introduce more escape codes, it might be overkill just for having the possibility to use a newline literal.

@smuuf
Copy link
Owner

smuuf commented Jan 29, 2020

When e.g. concatenating strings ("\n"+"\n"), the escaping is then done twice (since a new StringValue is constructed) and you lose the escaped characters. In my opinion it's also unfortunate to change user data in a non-reversible way. Anything I'm missing here?

Oh, you're absolutely right! 😄 It is me, who missed something here...


Edit: Although now I'm not sure if we're talking about the same thing. What I was talking about is that my regex-based solution changed the original behaviour from this:

>>> "\\n"
"\
"

to this:

>>> "\\n"
"\\n"

... which is clearly wrong.

Your new regex solution does this:

>>> "\\n"
"\n"

... which looks ok.


The "\\n"+"\\n" thing is something else entirely, I think. I may have to do some research of how to handle these situations. 🤔 Nice catch! :)

@smuuf
Copy link
Owner

smuuf commented Feb 3, 2020

I've given it some thought and - also inspired by how Python handles this - I realized that processing escape sequences doesn't belong even inside the StringValue logic. It shouldn't be the constructor nor in any of its methods.

Handling these special/escape characters should be done as early as possible

I think my hunch still stands. What's currently wrong is the point where Primi does this - it really is too late.

Escape sequences should be processed before the StringValue is constructed. Escape sequences are part of how string literal should be interpreted - what exact result does the string literal yield.

Therefore I think the best course of action is to move string escape-sequence logic into StringLiteral handler:

$value = \mb_substr($content, 1, \mb_strlen($content) - 2);
return new StringValue($value);

The solution I have in mind would - as a result - contain some backwards incompatible stuff.

For example current:

>>> a = "\n\a\\n" + "\n"
"
\a\

" (string f6b2df)

would in the new version ideally result in error:

>>> a = "\n\a\\n" + "\n"
ERR: Unrecognized string escape sequence '\a'

The implementation I'm trying also does these (~ is used to highlight start/end of what the final string value literally contains):

>>> a = "hello"
Literally:
~hello~
Primi representation: "hello" (string 155d72)

>>> a = "\nhello"
Literally:
~
hello~
Primi representation: "\nhello" (string 42bdf9)

>>> a = "\\nhello"
Literally:
~\nhello~
Primi representation: "\\nhello" (string f078a3)

>>> a = "\\nhello\"
ERR: Syntax error @ line 1, position 0
>>> a = "\\nhello\\"
Literally:
~\nhello\~
Primi representation: "\\nhello\\" (string 28fe52)

>>> a = "\\nhello\\\n"
Literally:
~\nhello\\n~
Primi representation: "\\nhello\\\\n" (string 541c2c)

>>> a = "\\nhello\\\"
ERR: Syntax error @ line 1, position 0

The "\\nhello\\\n" seems to have Primi representation built wrong (which is what StringValue::getStringValue should be doing - maybe better naming for this would be getReprValue() 🤔 ... ), because there seem to be some edgecases when escaping multiple consecutive double backslashes. This would need to be taken care of. Edit: I think I have this solved now.

So I was trying different approaches and because of that I already may have some working version of this. But since this is your suggestion and PR, I would gladly let you to implement some similar universal mechanism, if you want.

Any thoughts?



Edit: Oh, and I forgot. It also does this:

>>> a = "\n" + "\n"
Literally:
~

~
Primi representation: "\n\n" (string 96d650)

>>> a = "\n" + "\\" + "\n"
Literally:
~
\
~
Primi representation: "\n\\\n" (string affe7b)

>>> a = "\n" + "\\n" + "\n"
Literally:
~
\n
~
Primi representation: "\n\\n\n" (string 892c3a)

smuuf pushed a commit to fredsted/primi that referenced this pull request Feb 17, 2020
- Now these escape sequences are supported: \\, \", \', \n, \t
  - Previously "\"" would throw a syntax error. Now it is handled properly.
  - A literal "\n" (or others) can now be entered by double back-slashing
    the escape sequence like "\\n", which was previously not supported.
  - Unknown escape sequences, such as '\a' will now throw an error.
    - This is a BC break!
- Solution for PR smuuf#21.
- Handling escape sequences is now done outside of StringValue. It is
  merely a way of how a string literal is interpreted.
  This means that "\" + "n" will NOT result in a "\n" sequence and thus in
  a literal newline. This is a BC break, as far as I know.
- New StringEscaping helper with methods: escapeString, unescapeString
- Modified and added tests for this BC breaking behavior.
- Now these escape sequences are supported: \\, \", \', \n, \t
  - Previously "\"" would throw a syntax error. Now it is handled properly.
  - A literal "\n" (or others) can now be entered by double back-slashing
    the escape sequence like "\\n", which was previously not supported.
    - Unknown escape sequences, such as '\a' will now throw an error.
    - This is a BC break!
- Solution for PR smuuf#21.
- Handling escape sequences is now done outside of StringValue. It is
  merely a way of how a string literal is interpreted.
  This means that "\" + "n" will NOT result in a "\n" sequence and thus in
  a literal newline. This is a BC break, as far as I know.
- New StringEscaping helper with methods: escapeString, unescapeString
- Modified and added tests for this BC breaking behavior.
@smuuf
Copy link
Owner

smuuf commented Feb 18, 2020

Care to look at this, if it correctly resolves the original idea, before I merge it? @fredsted

@fredsted
Copy link
Contributor Author

Thanks for this, I'll take a look in a couple hours. :)

@fredsted
Copy link
Contributor Author

For me it works nicely, not much else to say, I like your solution!

@smuuf
Copy link
Owner

smuuf commented Feb 18, 2020

Ok, thanks :) I'll merge it into master then.

@smuuf smuuf merged commit a245148 into smuuf:master Feb 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants