Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC FS-1001 POC] String Interpolation [WIP] #6770

Closed
wants to merge 21 commits into from
Closed

[RFC FS-1001 POC] String Interpolation [WIP] #6770

wants to merge 21 commits into from

Conversation

yatli
Copy link

@yatli yatli commented May 18, 2019

Planning

This is the beginning of my take on FS-1001 (fsharp/fslang-design#6, fsharp/fslang-design#368) -- I have just started to code it, but I'd like to know what you think.

In this thread, I will interchangeably call the original formatting specifiers "deferred specifier", and inline interpolations "immediate specifier".

New Tokens and Types

We'll have to introduce new tokens & non-terminals in the parser to accommodate the interpolated string information. Given a string:

image

We want to tokenize the string into:

image

When we parse the token stream:

SIP_LEFT SynExpr SIP_MID SynExpr SIP_RIGHT

In the parser, we can deduce with the following rules:

STRING_INTEROP_HEAD:
| SIP_LEFT SynExpr
| STRING_INTEROP_HEAD SIP_MID SynExpr

STRING_INTEROP:
| STRING_INTEROP_HEAD SIP_RIGHT

Tokenizer Challenges

If we let the hierarchy of nested interpolation slip from the lexer to the parser, we will have to construct ad-hoc parser and start over again. This is not desirable. However, to get a clean token stream considering all the syntax/semantics is a bit challenging. Here's how it's done in Roslyn
https://github.com/dotnet/roslyn/blob/14e0068d9d529ade86c0ed1544477a221809d0e0/src/Compilers/CSharp/Portable/Parser/Lexer_StringLiteral.cs#L222

            // We have a string of the form
            //                $" ... "
            // or, if isVerbatim is true, of possible forms
            //                $@" ... "
            //                @$" ... "
            // Where the contents contains zero or more sequences
            //                { STUFF }
            // where these curly braces delimit STUFF in expression "holes".
            // In order to properly find the closing quote of the whole string,
            // we need to locate the closing brace of each hole, as strings
            // may appear in expressions in the holes. So we
            // need to match up any braces that appear between them.
            // But in order to do that, we also need to match up any
            // /**/ comments, ' characters quotes, () parens
            // [] brackets, and "" strings, including interpolated holes in the latter.

So basically they hoist a non-regular, mini parser that counts braces.

There are two approaches for us:

  1. Follow Roslyn
  2. Embed interpolation states into the main lexer, so that when an unmatched } is encountered, either yield RBRACE, or resume the string lexer to yield either SIP_MID or SIP_RIGHT.

Option 2 is perhaps more maintainable, because it does not drift away from the "real lexer".

Implementing FS-1001

Suppose we have the token stream properly parsed, and we have STRING_INTEROP nodes in the untyped AST. The string formatters are generated at a much later stage, in src/fsharp/CheckFormatStrings.fs, which is only called by the type checker

and TcConstStringExpr cenv overallTy env m tpenv s =

There are two situations here.

  1. The interpolated string does not have a "deferred" specifier.
  2. The interpolated string has one or more "deferred" specifier.
Case 1

No deferred specifier means that we don't have to worry about string formatters. Simply move away from the path stringOrKeywordString -> rawConstant -> constant -...-> later picked up by the type checker, and translate our node as appropriate.

After translation, we should get something like:

image

edit: for the completeness of the feature we have to go for case 2.

Case 2

The harder case. We will have to make a closure to capture the interpolated expressions, and return a string formatter for the "deferred" parts.

let ($) (x: Printf.StringFormat<'a>) = x
let bar = "baz"
// the format string should be StringFormat<string -> string>
let fmt = $ "foo %{bar} %s"

parseFormatString s typeof<'T> :?> _, (2 * count + 1) - optimizedArgCount // second component is used in SprintfEnv as value for internal buffer

The runtime support is here.
If we capture the immediate values in the format, we can translate the format string to indicate the index of the captured values -- for example, %{s:1} means to extract from format.Captures.[1]

Currently, in the front end (printf), I'm putting the captures in the Format objects, and I assume it will be populated by the backend-generated code.
FormatSpecifier now optionally holds the index of the capture target.
The capture arrays are then forwarded to the ctor of PrintfEnv's.
The builder now handles "Begin-with-capture" arguments on the builder stack. (wip)

The Printf.Specialization class is updated with captured value builders. To prevent combinatorial explosion, we restrict that a capture must be the first argument in the argument pack.
In the main builder, each time an immediate spec is encountered, the pack should be "flushed".
New builders in the families FinalCaptureX, FinalFastCaptureX, ... , ChainedCaptureX , ... , LittleAChainedCaptureX, ... are added.
Note that a format specifier can be both immediate capture and special, so we have to extend the main builder for the special form to consider captures.

Your input is much appreciated!

@yatli
Copy link
Author

yatli commented May 18, 2019

Status

  • Front-end:
    • ✔ printf.fs
    • Work out the mini-syntax to extend %O
    • And optionally, allow Format<_> to receive IFormatProvider.
  • Back-end
    • Lexer
      • Stateful lexing for nested interpolation segments
    • Parser
      • Add token types for interpolated strings
      • Add token types for interpolated verbatim strings, i.e. $@"...." and $"""....."""
    • Type inference engine tweaks
    • Codegen tweaks
  • Testing
    • printf tests
      • FormatSpecifier captures are properly forwarded to PrintfEnv
      • PrintfEnv captures are properly recalled into specialized formatters
      • Captured values are correctly formatted without format specifier (defaults to %O)
      • Captured values are correctly formatted, combined with format specifiers
      • Invalid format specifiers for captured value causes exceptions
      • Captured format specifiers work with deferred specifiers
    • Codegen tests
      • Codegen properly replaces the captured expression with its index
      • Codegen properly fills a Format<...> with Captures and Types
    • Lexer/parser tests
    • Type inference engine tests
    • E2E testing on code snippets

Scratchpad

typedExprWithStaticOptimizationsBlock:

Looks like this is the master entry for "r-values"?

In printf.fs, L1271, why isn't ContinuationOnStack maintained accordingly?
I figure ContinuationOnStack is just a constant to denote an imminent cont?

Looks like immediate specs will block the grouping and increase the number of reflections.
(no they won't.)

@dnfclas
Copy link

dnfclas commented May 20, 2019

CLA assistant check
All CLA requirements met.

@forki
Copy link
Contributor

forki commented May 20, 2019

@cartermp looks @yatli is Microsoft employee. Can you please resolve this?

@yatli
Copy link
Author

yatli commented May 20, 2019

Anything I should do at the moment?

@cartermp
Copy link
Contributor

@yatli you'll need to sign the CLA to allow any PR to be accepted to the repo. @dsyme can speak more towards this approach.

@yatli
Copy link
Author

yatli commented May 21, 2019

I've signed the CLA. Working on printf.fs.

@realvictorprm
Copy link
Contributor

cool stuff :)

@cartermp
Copy link
Contributor

Tagging @dsyme for design review

@yatli
Copy link
Author

yatli commented Jun 20, 2019

@realvictorprm @forki I'm wondering if we could team up on this?
Considering that we all tried to attack 1001 before 😀

@realvictorprm
Copy link
Contributor

@yatli convinced, are you on the F# Slack?

@yatli
Copy link
Author

yatli commented Jun 20, 2019

@realvictorprm yeah, I've just joined FSF FSSF :D

@forki
Copy link
Contributor

forki commented Jun 20, 2019 via email

@yatli
Copy link
Author

yatli commented Jun 20, 2019

@forki no worry, one question for now:
Reading through your PR, I see that you've added some tests.
I'm fairly new to the codebase, so I'm wondering if you could give me some pointers -- where to put unit tests for FSharp.Core (now it's relevant in the interpolation), where to put end-to-end test files, where compiler tests, etc.

@cartermp
Copy link
Contributor

cartermp commented Jul 2, 2019

There are a lot of design discussions going on here. Can we capture some of this in an RFC? I think I generally agree with the direction and @dsyme suggestions here. But there are some sub-points that would need a separate design discussion. For example, for verbatim interoplated strings, what about both $@ and @$ (as per C# here: dotnet/csharplang#1630)?

Additionally we'd have to adjust tooling to account for understanding when we're in an interoplated context (and complete the } with brace completion.

@yatli
Copy link
Author

yatli commented Jul 3, 2019

@cartermp yes I agree. There are details that are not covered in FS-1001, some of these points are already well-aligned, and some needs further convergence.

I'm listing some of the well aligned points here (may have missed something):

  • $"....{}..." as the string interpolation form
  • Support verbatim/multiline interpolation (we still have to fix down the exact syntax)
  • ``%O` by-default for interpolated expressions
  • Extend %O to incorporate C# formatting specifiers, for both interpolated/conventional strings
  • Take cultural info into consideration, which further implies supporting IFormatProvider
  • And after some more thoughts I agree with @dsyme about separating the interpolated form and the normal form.

The thing that still doesn't come to its fixed-point in my head is, whether the interpolated form should be a Format<_,_,_,_,_>, or FormattableString (net46+ only)/IFormattable.

The former incorporates well with the printf family, while the latter is more of the "embracing the C# way" style. I prefer the former because, it implies that we plant our support for IFormatProvider into the printf family, which supports both the interpolated form and the conventional form.
On the other hand, if we decide to check interpolated string into FormattableString/IFormattable, then there will be a separation between the localization and custom formatting, for conventional/interpolated forms.

Some complexities in extending the printf family:
There is overlap between IFormatProvider and F# %a specifier, if we plan to extend the printf family, it has to be adjusted accordingly so that the custom formatter receives the format provider, in addition to the contextual writer and the payload.

@dsyme
Copy link
Contributor

dsyme commented Jul 3, 2019

@cartermp I will rewrite FS-1001 today

@yatli
Copy link
Author

yatli commented Jul 3, 2019

@dsyme shall we split out the proposal to extend %O for conventional strings? (e.g. your %(N3)O example)

@dsyme
Copy link
Contributor

dsyme commented Jul 3, 2019

@dsyme shall we split out the proposal to extend %O for conventional strings?

It would be great to get it all in together, as a complete package, but I think it depends on whether you feel it fits into the implementation and testing reasonably.

@yatli
Copy link
Author

yatli commented Jul 3, 2019

I'd love to do it altogether -- just want to make sure that this is indeed the plan :)

@dsyme
Copy link
Contributor

dsyme commented Jul 3, 2019

The thing that still doesn't come to its fixed-point in my head is, whether the interpolated form should be a Format<_,_,_,_,_>, or FormattableString (net46+ only)/IFormattable.

Could we make it that both interpolated and non-interpolated strings are considered compatible with all of these, depending on the known type from type checking?

Existing string literals interpreted as PrintfFormat strings do not include any interpolation fills - the $ must be added to get this.

May I ask what's your concern on this? Indeed, I think, mixing captures and deferred specs is definitely bad when there're dozens of them. More chances of off-by-one (or more) errors.

I think it would just really surprise people if non-interoplated strings could ever have interpolation fills when used as printf-style format strings. Also not backwards compatible is it, if the format for interpolation is %printfFormat{<interpolationExpression>}? For example sprintf "%d{x}" 3 today produces 3{x}.

If possible, I'd very much want interpolated strings to incorporate with xprintf family for its stronger verbal functions (output to streams, print to the screen without first combining the pieces, etc)

Yes, I agree with this

Rule (2) would mean your example would still need $ to interpolate...

if fmt must be followed by a non-curried form, then csprintf becomes a shortcut for FormattableString.Format(some_culture_info), right?

My thinking was that

  • interpolated strings would support interpolation fills and be usable as any of Format<_,_,_,_,_>, or FormattableString or IFormattable.

  • regular strings would not support interpolation fills and are also usable as any of Format<_,_,_,_,_>, or FormattableString or IFormattable.

So you could have

let fmt = csprintf some_culture_info
fmt $"....%d{xyz}..."
fmt "....%d..." xyz

Additionally we'd have to adjust tooling to account for understanding when we're in an interoplated context (and complete the } with brace completion.

yes

There is overlap between IFormatProvider and F# %a specifier, if we plan to extend the printf family, it has to be adjusted accordingly so that the custom formatter receives the format provider, in addition to the contextual writer and the payload.

I think pretty much no one uses %a. If we do want to address this would it be possible to use %z or the like for an extension point also taking IFormatProvider? Would that need to be optional in case it hasn't been provided, or is there a default one that would be passed in.

BTW you can find me on F# Software Foundation Slack and we can chat there if you like

Copy link
Contributor

@dsyme dsyme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding a few comments

src/fsharp/FSharp.Core/.vim/coc-settings.json Outdated Show resolved Hide resolved
src/fsharp/FSharp.Core/printf.fs Outdated Show resolved Hide resolved
src/fsharp/FSharp.Core/printf.fs Outdated Show resolved Hide resolved
@dsyme
Copy link
Contributor

dsyme commented Jul 31, 2019

@yatli Have you been able to make more progress here? Is there anything we can help with?

@yatli
Copy link
Author

yatli commented Jul 31, 2019

@dsyme thanks for getting back :)
was interrupted by a minibatch of deadlines, so I have to put the codebase back into my cache again

@yatli
Copy link
Author

yatli commented Jul 31, 2019

it'll be great if we can do some coding together :)

@dsyme
Copy link
Contributor

dsyme commented Jul 31, 2019

it'll be great if we can do some coding together :)

Can you msg me on FSSF Slack? Maybe late August

@naughtyGitCat
Copy link

Now is 2020, Is there any new progress on string interpolation? It's quite a practical feature😀

@varon
Copy link

varon commented Mar 7, 2020

Please can we get this in. @dsyme / @yatli ?

@dsyme
Copy link
Contributor

dsyme commented Apr 7, 2020

Hi @yatli, I'm thinking of taking another look at this. Would you like to still work on it with me for the next iteration?

@dsyme
Copy link
Contributor

dsyme commented Apr 7, 2020

Hi @yatli, I'm thinking of taking another look at this. Would you like to still work on it with me for the next iteration?

(I'm starting with prototyping the lexer and parser changes)

@yatli
Copy link
Author

yatli commented Apr 8, 2020 via email

@dsyme
Copy link
Contributor

dsyme commented Apr 8, 2020

@yatli I'm going to close this PR and start a new one coming from a feature branch feature/string-interp in this repo (containing what you've done so far). You will be able to send work as PRs to that feature branch if you like.

@dsyme dsyme mentioned this pull request Apr 8, 2020
29 tasks
@dsyme
Copy link
Contributor

dsyme commented Apr 8, 2020

Replaced by #8907

@yatli
Copy link
Author

yatli commented Apr 8, 2020

sounds great!

@dsyme
Copy link
Contributor

dsyme commented Apr 8, 2020

Continued at #8907

@dsyme dsyme closed this Apr 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.