Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement format/interpolation strings #822

Closed
wants to merge 23 commits into from

Conversation

kwshi
Copy link

@kwshi kwshi commented May 7, 2021

Resolves #11.

This PR builds off of @casey's existing work on parsing interpolation strings:

  • Fixes parsing of interpolation fragments
  • Handles un-indenting & escape sequence processing
  • Implements evaluation of interpolation strings.
  • Adds some parser tests

While it appears to work correctly, the code I've written to do so is pretty patchy, and a few more things are needed for reliability & maintainability:

  • Add some evaluation tests?

  • Possibly refactor the un-indenting code.

    Because of the way interpolated strings are structured (Vec<Fragment>), it's not really possible to re-use the existing logic in unindent.rs, which acts on individual &strs. (It's not sufficient to run unindent on each one of the Text fragments individually, because the common indentation needs to be calculated across the entire string, i.e. all fragments, not simply all lines for a given fragment.)

    As such, a lot of the logic of unindent had to be reimplemented, but with a lot more consideration for edge-cases, etc. The resulting code is pretty messy and hard to read, and I have an idea about how to re-architect the FormatString type to be able to re-use the existing unindent code cleanly, but it requires some heavier refactoring, so I've held off on it for now to see if other contributors have any better thoughts/comments on this challenge.

@casey if it's not too much trouble, perhaps you can take a look at this & offer some suggestions on how to tidy up the code? Thanks!

@kwshi
Copy link
Author

kwshi commented May 7, 2021

Some of the code diffs were accidentally caused by my editor auto-running rustfmt on save & I accidentally committed them--I didn't bother picking them out afterward but if you want me to I can revert those!

@casey
Copy link
Owner

casey commented May 8, 2021

Nice! Today's a little busy, but I should definitely be able to check this out tomorrow.

UnterminatedBacktick => {
writeln!(f, "Unterminated backtick")?;
},
UnterminatedFormatString => {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should probably be an UnterminatedFormatBacktick variant as well.

Comment on lines +789 to +804
// TODO:
// - what token should format backticks be?
// - how are format strings dedented?
// - brag about recursive format strings
// - test empty format string interpolation error
// - handle backticks
// - test format string evaluation
// - format strings with text immediatle before and after inteprolation
// start/end
// - test format strings inside of recipe bodies
// - test multi-line format strings inside of recipe bodies
// - test open delimiters with multiple lines inside of recipe bodies
// - test multi-line strings inside of recipe bodies
// - combine backticks and string literals
// - must avoid parsing format string or backtick in shell setting

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO:
// - what token should format backticks be?
// - how are format strings dedented?
// - brag about recursive format strings
// - test empty format string interpolation error
// - handle backticks
// - test format string evaluation
// - format strings with text immediatle before and after inteprolation
// start/end
// - test format strings inside of recipe bodies
// - test multi-line format strings inside of recipe bodies
// - test open delimiters with multiple lines inside of recipe bodies
// - test multi-line strings inside of recipe bodies
// - combine backticks and string literals
// - must avoid parsing format string or backtick in shell setting

I'll comment with the ones in this list that are still to do.

src/assignment_resolver.rs Outdated Show resolved Hide resolved
@casey
Copy link
Owner

casey commented May 9, 2021

This looks great so far.

  • There were some errors from clippy on CI. I have clippy turned up to "diabolical", so sometimes it complains about silly things, in which case I just add an allow to main.rs. The things it's complaining about currently seem reasonable, but if any of them seem unreasonable or the code to fix them is worse, we can turn off those lints globally or on a case-by-case-basis.

  • I use nightly rustfmt for Just, which supports more formatting options. You can do rustup install nightly to install it and clippy +nightly fmt --all to run it. I pushed a commit that fixes the clippy warnings.

  • The bin/lint script that runs on CI will fail if things like 'todo' or dbg!(…) are found in source code. I use this to avoid committing notes and debug code.

  • I feel like the contents of the unindent module can be made generic over both &str and &[StringFragment]. I added an Unindent trait that is implemented for str. I claim (but cannot prove!) that this could be implemented for [StringFragment]. It's getting late, so I didn't give it a shot. It will be trick, because in Unindent::slice, start and end are character indices. I think that for the purpose of these indices, an interpolation would have length 1, and a text fragment would have length text.len().

  • This needs a bunch of integration tests, check out the tests directory. I added tests/format_strings.rs with an test that checks that unknown variables inside format strings produce a correctly formatted error. I like to have integration tests for all aspects of all features, as well as for all errors. For errors, one thing to check for is that the underlined token, if there is one, makes sense. Some test ideas off the top of my head:

    • Circular variable dependency inside format strings: foo := f"{{foo}}"
    • Empty format string foo := f"{{}}"
    • Untermined format string: foo := f"
    • Untermined format string interpolation: foo := f"{{
    • Backtick evaluation tests
    • Unindentation tests
    • Unindentation test that makes sure that escape-sequence produced whitespace is not unindented
    • Escape processing edge case: foo := f"\{{""}}n" # this probably shouldn't be a newline
    • Format strings inside of recipe bodies
    • Weird, multi-line, indented format strings inside recipe bodies
    • Recursive format strings
    • Check that set shell := [f"{{foo}}"] does not work. (The value of set shell must be evaluated without invoking the shell, so no format strings.

    Feel free to add some of these, and I can also add some.

@casey
Copy link
Owner

casey commented May 9, 2021

Just pushed a couple more commits removing the old unindent functions. I promise I'm done pushing to this branch for now.

@kwshi
Copy link
Author

kwshi commented May 10, 2021

Thanks for the tips! I'll take a look & work on this in the next couple days.

@kwshi
Copy link
Author

kwshi commented May 16, 2021

Running into a few uncertainties trying to implement Unindent for &[StringFragment]:

  • slice performance concerns. Since each chunk/fragment will have a variable length, indexing must be computed via a fold over the fragment lengths (e.g., for a chunk [Text("hello"), Interpolation(..), Text("hi")], trying to find the index i=7 requires checking fragment_start <= i && i < fragment_start + fragment.len() for each fragment with fragment_start accumulating over each visited fragment length).

    This operation takes O(n) (cf. O(1) for plain &strs) time. Since unindent calls slice once for each line in the string, unindent's runtime cost would be O(n^2) on format strings. Granted, the constants are small--n is estimated relative to the number of fragments and lines, typically under 100--so the practical impact of this difference may be negligible. But still, I think it could be improved.

    • One possible solution to this drawback is to pre-compute index positions (e.g., do a single, initial fold over the fragments to figure out which fragment each index maps to). The two ways I can think of doing so:

      1. Store a vector directly mapping each index in the range [0, len) (len is the total byte-length of the string, + 1 for each interpolation) to the corresponding position (fragment index, position within fragment for text fragments). Takes O(n) pre-computing, O(n) extra storage space; makes subsequent indexing/slicing operations O(1), and therefore unindent O(n), same as &str.

      2. Store a "boundaries" vector holding the starting index of each fragment ([0, 5, 6] in the [Text("hello"), Interpolation(..), Text("hi")] example). For each given index, binary-search the boundaries vector to determine which fragment to index into, and then index (subtracting offset) within the correct fragment. Takes O(n) pre-computing, O(n) extra storage (but much less--one usize for each fragment instead of for each byte); makes subsequent index/slice ops O(log n), and therefore unindent O(n log n).

      However, both of these options require introducing a wrapper struct to store the pre-computed information, so the trait would have to be implemented not on [StringFragment] directly but on the wrapper struct, making some of the function signatures a little awkward (e.g. slice and from_str returning &Self instead of Self).

    • An alternative solution I'd like to propose is to introduce something like

      enum InterpolationChar {
        Text(char), // or u8
        Interpolation(Expression),
      }

      and implement Unindent on [InterpolationChar] instead--this way, there is a more obvious/direct analogy between &str as a slice of chars (or bytes) and format-strings as a slice of InterpolationChars: each element corresponds to one character, indexing on the string matches indexing on the slice itself, etc.

      I believe this implicitly incurs an extra O(n) storage (for marking the enum tag on each character), but I think the implementation becomes much more elegant/simple/straightforward this way--no more fancy piecewise finagling to compute index locations. What do you think about this?

  • join ownership. The join method signature (and some trait lifetime restrictions) requires it to return an owned type (Self::Output). In particular, given (possibly discontiguous) strings, it has to clone the contents of those strings to a separate location in order to return an owned type (in the str implementation, this is done in the .collect() call). This means that, in order to implement the analogous logic for StringFragment, StringFragment::Interpolation { expression: Expression } also needs to be clone-able. In particular, we will need to add #[derive(Clone)] to StringFragment and Expression. Are we... cool with that? (I'm relatively new to Rust, so I'm not sure what common design patterns are, I just have a vague urge to avoid allowing large things to be cloned here and there.)

@casey
Copy link
Owner

casey commented May 16, 2021

Good points!

I think that the issue with slicing being O(n) isn't something we should worry about. As you mentioned, N is likely to be small enough that even if our algorithm is terrible, it will still be fast enough that nobody will notice. Also, in general, I think it's a good idea to worry about being correct before worrying about being fast. After we have a correct and slow implementation with a bunch of tests, we can replace the correct and slow implementation with a correct and fast implementation.

You propose some good optimizations, but optimizations should always be led by profiling that shows that an operation is slow, needs to be optimized, and that the optimizations actually improve things.

That being said, if there's a cleaner or simpler implementation, then that is worth considering up front.

I think that InterpolationChar would be fairly large. An enum is the same size as its largest variant, plus the enum tag, and Expression is pretty large. It could be made smaller by doing enum InterpolationChar { Text(char), Interpolation(Box<Expression>) }. I would do char instead of u8, since the slicing logic is more complicated for u8, since you would have to respect charter boundaries and panic if an &[InterpolationChar] was sliced mid-character.

I'm kind of ambivalent as to whether an Unindent implementation over &[InterpolationChar] would be simpler than over &[Fragment]. If the complexity is reasonable and confined to one or two functions, I'd be inclined to go with &[Fragment], since it avoids needing a new type. But since you have your hands on the code, I think you're in a better position to decide, I'm happy with both.

Regarding join ownership, adding #[derive(Clone)] and cloning things is fine. I think the same logic around optimization applies here too. The clones are likely to be so fast that nobody will ever notice they're there.

@kwshi
Copy link
Author

kwshi commented Jun 22, 2021

Sorry I've left this to stagnate for a bit-- a bunch of other obligations have popped up in my life, leaving me with little leisure to work on this. I'm not abandoning this PR, but it might be another month or so until I'll be able to come back to it--if anyone else comes across this and wants to pick up the work, feel free! Otherwise, see you in a month.

@casey
Copy link
Owner

casey commented Jun 22, 2021

No worries at all, life happens! And the original issue for this feature was opened in 2016, so what's another month or few.

@casey
Copy link
Owner

casey commented Jan 16, 2023

This has gone pretty stale. Feel free to reopen!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow interpolation inside of backticks and strings
2 participants