Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

forc-fmt: Consider parsing with comments #3789

Closed
kayagokalp opened this issue Jan 16, 2023 · 10 comments
Closed

forc-fmt: Consider parsing with comments #3789

kayagokalp opened this issue Jan 16, 2023 · 10 comments
Assignees
Labels
bikeshedding For bikeshedding trivialities code quality enhancement New feature or request formatter

Comments

@kayagokalp
Copy link
Member

We are facing lots of small issues regarding comment insertion. At the time of #2311 we were still exploring the issue and after maintaining this way of handling comments for couple of months, I think it seems like a better way of handling should be discussed. I am not linking all comment related hot-fix PRs but we had couple of them 😄

Proposal

Let's parse comments and produce a CommentedParseTree just like we are lexing comments (we have CommentedTokenStream vs TokenStream etc). This way re-constructing the comments section would not need:

  1. Search the unformatted code
  2. Search the formatted code

We would be simply formatting comments just like we are formatting other stuff like Tys, exprs etc. I believe this would be a little bit faster and most importantly it could lead to less problems. Currently once we have some edge case stuff happening related to the comments we tend to introduce rather "hacky" solutions as the interaction between formatter and comment handler is not simple enough.

With parsed comments, if we can also collect some sort of context we could still be sure that we are not altering the intentional comment styles (rust classifies comments like this)

Any thoughts on this? cc @FuelLabs/tooling

@kayagokalp kayagokalp added enhancement New feature or request bikeshedding For bikeshedding trivialities code quality formatter labels Jan 16, 2023
@kayagokalp kayagokalp changed the title forc-fmt: Consider parsing comments forc-fmt: Consider parsing with comments Jan 16, 2023
@eightfilms
Copy link
Contributor

+1 on this, swayfmt v2 seems mature enough to work well for most cases but it seems like the search-and-insert method is leading to a few weird behavior surrounding formatting with comments. Right now fixing these edge cases are just small changes but this could become hard to maintain over time.

@mitchmindtree
Copy link
Contributor

Sounds good to me!

This would also address #2357.

Originally, we avoided trying to produce a Commented alternative to the sway_ast::Module due to the sheer number of places that comments can appear within the AST (between every leaf node!) however as we're discovering, the unstructured alternative leaves a lot of edge cases that are difficult to reliably handle in a non-destructive way.

I just had a chat with @bingcicle and after checking rustfmt again for better ideas, he'll start looking into how feasible this is.

The gist of the idea is to mirror the existing AST of sway_ast under a new commented module and replace each field with a commented alternative along the lines of the following:

pub struct Commented<T> {
    pub node: T,
    pub comments: Vec<Comment>,
}

pub struct Comment {
    pub span: Span,
}

Where comments is a list of comments that occur between this node and the previous node.

With parsed comments, if we can also collect some sort of context we could still be sure that we are not altering the intentional comment styles (rust classifies comments like this)

I think inferring the intended "style" of comments like this can be tricky to do reliably. While that CommentStyle enum looks handy, I'm skeptical that it can reliably describe a lot of cases. E.g. while I was chatting with Bing, I found a few cases where rustfmt automatically deletes comments entirely 😅

We might be better off starting by simply collecting comments that precede the node and re-adding them verbatim. If a node has a commented child where the comment style is at all ambiguous, we can avoid formatting the item for now.


Alternative

Thinking on this more, I think there's still a chance we can find a simpler alternative that doesn't require constructing an entire alternative commented AST. I'm imagining something like this:

  1. Before formatting a module begins, construct a CommentMap, i.e. a BTreeMap from spans to comments (I think we already have a type for this somewhere).
  2. Provide a reference to the CommentMap to the Format::format trait method (potentially via the Formatter type).
  3. Provide a CommentMap method that makes it easy to check whether or not a node intersects a comment by checking the node's span against the btreemap.

This approach might be helped by splitting the Format trait into Visitor and Format traits, separating the logic for traversing the nodes from the logic for formatting them, e.g. https://refactoring.guru/design-patterns/visitor/rust/example. This way, before "visiting" a child node, the visitor could check whether or not the child directly follows a comment by checking it's span against the CommentMap, and if so, allow for formatting it as a Commented<ChildNode> rather than a ChildNode.

This might require a less of a significant refactor, but get close to the "structured"-ness of the full commented AST approach? @bingcicle if you're interested in this approach let me know and we can have another chat.

@eightfilms
Copy link
Contributor

eightfilms commented Jan 18, 2023

I'm starting to think the alternative approach might be preferred here for now, even though the commented AST approach would probably be a more structurally sound approach. I think the formatter as it is right now is good for most cases, and perhaps what we need to do is to rethink the heuristics when it comes to formatting commenting. The visitor pattern seems like a great idea - it seems to be what rustfmt is doing as well.

With that said, rustfmt seemingly has similar issues with comment formatting in the same way we do now, leading to a recurring issue of the formatter destructively removing comments without warning. In some cases, this destructive manner could severely impact the devEx of the formatter.

One additional thing we should probably consider here is to define clearly what is and what isn't a valid comment. I'm wary of trying to follow in rustfmt's footsteps here. In the rustfmt guide there is a recommendation to "avoid writing comments on the same line as the braces", and the general sense is that rustfmt is flexible about where you can place your comments, and it will do the rest for you. However, I think leaving options open make it much harder to have a good solution for a very minor improvement in user flexibility. This issue comes to mind and @Braqzen also made a good point there - perhaps we should also think about restricting what is valid in the context of formatting comments while tackling this issue. Doing this will probably help with reducing the destructive behavior we're seeing happen in rustfmt (and sometimes, already in swayfmt) while reducing the scope of work here as well, at the cost of slightly less flexibility when it comes to comment syntax.

@mitchmindtree
Copy link
Contributor

I'm certainly open to restricting certain kinds of comments if it does indeed simplify our implementation, however this restriction should be imposed during parsing, and not only through forc fmt mysteriously deleting invalid comments.

Perhaps for now we can forge ahead with the "Alternative" visitor approach above and get as far as we can handling as many types of comments as we can. During this, we collect examples of comments that we are unable to support in a practical manner, and we propose to the lang team that we disallow these during parsing?

@eightfilms
Copy link
Contributor

Yep, agreed that it should be imposed under parsing and not under formatting.

Perhaps for now we can forge ahead with the "Alternative" visitor approach above and get as far as we can handling as many types of comments as we can. During this, we collect examples of comments that we are unable to support in a practical manner, and we propose to the lang team that we disallow these during parsing?

This sounds like a good way to begin!

@eightfilms
Copy link
Contributor

Provide a CommentMap method that makes it easy to check whether or not a node intersects a comment by checking the node's span against the btreemap

By 'intersect', do you mean any node's span that is in range with a comment's span? How would this play well with for eg. trailing comments that don't intersect?

This way, before "visiting" a child node, the visitor could check whether or not the child directly follows a comment by checking it's span against the CommentMap, and if so, allow for formatting it as a Commented rather than a ChildNode.

If we are incrementally implementing this for each known AST node, it means that parts of comment formatting will still be handled via insertion while we are migrating to a Visitor pattern where we format both the node and the comment in the same pass, correct? In that case, our goal would be to handle whatever comment formatting we can within the initial formatting phase, remove the formatted comments from the CommentMap, then pass it over to handle_comments() to do the remainder. If this is the case, we would also need some ability to 'look back' so we know that there are no other tokens in between the end of a comment span and the start of a node span.

I think the first PR here to make these changes should just be the skeleton for the new pattern, aka introduce an impl FmtVisitor for all nodes, and simply replace all usages of format() with visit(), and within visit() we just call the current implementations - and from there we open new issues to tackle all the missing formatting for the commented versions of the nodes. This would probably help with reviewing big diffs, since we know that there are no logical changes there.

@kayagokalp
Copy link
Member Author

How would this play well with for eg. trailing comments that don't intersect?

I had a similar problem with initial implementation, since we are searching between spans for comments, a comment that is after (or before) all of the nodes in the tree was getting chopped off. I remember working around that by simply inserting two dummy spans one that points to the beginning of the file, and the other pointing to the end of the file.

@eightfilms
Copy link
Contributor

eightfilms commented Jan 20, 2023

I had a similar problem with initial implementation, since we are searching between spans for comments, a comment that is after (or before) all of the nodes in the tree was getting chopped off

This was also a problem I found while digging through old rustfmt issues - they also had problems with trying to format comments between rustc AST nodes...

I remember working around that by simply inserting two dummy spans one that points to the beginning of the file, and the other pointing to the end of the file.

This is good to know! Any other gotchas to be aware of here?

So far, this is what I'm doing:

  1. we used to initialize the CommentMap at the comment insertion step. Instead, lets do it right before formatting now, and have the Formatter hold it as a &mut CommentMap
  2. we then begin formatting. However, we do not implement FmtVisitor - I realized that it isn't really a useful abstraction at the moment. Instead I have a helper function that finds the comments within the comment map that are between a given start and end point - this can be between the end of a preceding span and a start of a following span. This has been particularly useful in certain cases that were hard to format before, like comments between } and else.
  3. we format the node there and then with the comment, and remove the entry from the CommentMap. This mutated CommentMap is later passed to handle_comments() which will format the remaining comments without the already formatted comment. This makes it so we can incrementally introduce the new pattern while still allowing us to use our old way of comment insertion.

A potential downside i'm concerned about here is if we end up with a lot more code, since we're no longer doing everything in handle_comments(), but rather at each impl Format of each node. But I think this might be a clearer way to reason about formatting comments, and we also get access to other properties of Formatter like the indentation when doing it at this stage.

@kayagokalp
Copy link
Member Author

kayagokalp commented Jan 20, 2023

If this would mean we would be looking for bugs in specific item's formatting function, it is still a big win imho :)

eightfilms added a commit that referenced this issue Jan 30, 2023
closes #3853
closes #3574


## Overview

Currently, comment formatting is done post-formatting of the source
code, which can make it difficult to reason about formatting comments in
certain scenarios. There have been a few cases
(#3574,
#2888,
#2649) where comments are often
wrongly indented.

In #3789, there is extensive discussion on an alternative way of
handling comment formatting. It seems like we're at a similar problem
that rustfmt originally encountered since comments are not part of the
AST.

## Approach

Instead of formatting comments _after_ formatting the source code as a
whole w/o comments, we now format comments locally within each `Format`
implementation of each AST node. Few advantages of doing this:

1) We now have the context of the formatter at the point of which the
comments are being added, e.g. we can use `formatter.shape.indent` to
indent comments, especially in odd locations.

2) Easier handling of `LineStyle::Multiline` <> `LineStyle::Inline`
conversions, when comments are involved. You can see this from how we
can check if comments are present and we can decide which `LineStyle` to
use in the `}` and `else` formatting of comments.

To make sure that we don't do too much extra work, we use
`comment_map_from_src` to init a `CommentMap` at the point where
formatting begins instead of during `handle_comments` - we then reuse
this `CommentMap` for the entire pass of the formatting. This is
temporary and the end goal is to deprecate comment insertion
post-formatting the source code, and to do it all in one pass.

To ensure backward compatibility, we _still_ call `handle_comments` for
now until we refactor everything to use the new method. Thankfully we
have extensive tests that will prevent regression while this refactor is
happening.

This will be a gradual refactor and the above 2 issues were selected
since they were difficult to fix with the old model of comment
insertion, but much easier with this new model.
@eightfilms
Copy link
Contributor

Closing this in favor of #3938

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bikeshedding For bikeshedding trivialities code quality enhancement New feature or request formatter
Projects
None yet
Development

No branches or pull requests

3 participants