Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hclsyntax: Introduce token-based parse methods #383

Closed
wants to merge 1 commit into from

Conversation

radeksimko
Copy link
Member

@radeksimko radeksimko commented May 28, 2020

This change introduces new methods to allow two-phased approach where tokenization is done prior to parsing.

This is specifically useful in the context of a language server, so it can pass around a single interpretation of HCL blocks (tokens) instead of passing around both block and its tokens.

Related: hashicorp/terraform-ls#125

New methods

func ParseBodyFromTokens(tokens Tokens, end TokenType) (*Body, hcl.Diagnostics)
func ParseBodyItemFromTokens(tokens Tokens) (Node, hcl.Diagnostics)
func ParseBlockFromTokens(tokens Tokens) (*Block, hcl.Diagnostics)
func ParseAttributeFromTokens(tokens Tokens) (*Attribute, hcl.Diagnostics)

@radeksimko
Copy link
Member Author

radeksimko commented May 28, 2020

After testing it a bit more I think it would be sensible to add some checks to ParseBodyItemFromTokens, I was specifically tripped up by not knowing that the token sequence:

  • must start with TokenIdent, e.g. newline must be stripped if there is one
  • must end with either TokenEOF or TokenNewline, i.e. one has to be inserted if sequence ends with TokenCBrace

This change introduces new methods to allow two-phased
approach where tokenization is done prior to parsing.
@radeksimko radeksimko marked this pull request as ready for review May 28, 2020 18:39
@radeksimko
Copy link
Member Author

I added some checks as mentioned above.

@radeksimko radeksimko requested a review from apparentlymart May 28, 2020 19:31
Copy link
Contributor

@apparentlymart apparentlymart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in looking at this. This looks good to me! I left a few minor things inline, but overall this looks great.

One thing I was wondering about is that in the terraform-lsp PR you linked to it only seems to be using ParseBodyFromTokens so far, and the other ones surprised me a little bit because we don't really have any API surface here for extracting an isolated set of tokens for a slice of a file's token stream. I don't mind including them if you have a use-case for them, but I'd like to understand more about how they will be used since I think this is the first time we've introduced the idea of parsing "fragments" of input as a public API, and the ability to do that may constrain what refactorings are possible in the internals of the parser in future.

rng := bi.Range()
diags = append(diags, &hcl.Diagnostic{
Severity: hcl.DiagError,
Summary: fmt.Sprintf("Unexpected definition (%T)", bi),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we've written errors like this with an end-user target audience in mind, so that a caller can use a call to this function to represent the assertion "the user should have written an attribute" and automatically get a good error message if the user didn't provide one.

With that said, I'm not totally sure about that framing for these new functions. It could well be that we consider it a programming error on the part of the caller to pass in tokens representing a block here, in which case I suppose this could be okay although in cases like that HCL has typically used panic rather than error diagnostics so far. 🤔

As a compromise, what do you think about taking the text of the message HCL would normally return if the schema calls for attribute syntax but the user wrote a block, and reducing it to fit what we can determine here without a schema? For example:

Error: Unsupported block type

Blocks of type "example" are not expected here.

(To do this would, I realize, require type-asserting the bi to (*Block) first, so I guess there would still need to be a generic fallback for the short-never-happen case of it not being a block, but that would could presumably be a panic because it would only ever happen if there were a bug inside Parser.ParseBodyItem.)

(I have similar feedback for the opposite case of ParseBlockFromTokens above, but I won't write it all out again. 😄 )


peeker := newPeeker(tokens, false)

// Sanity checks to avoid surprises
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say "Initial checks" instead here, for inclusiveness. ❤️

@radeksimko
Copy link
Member Author

radeksimko commented Sep 7, 2020

Regarding ParseBodyItemFromToken and the two other similar methods:

TL;DR language server currently uses these methods, mostly as a form of optimization.

I will try to share some more context below.

The main reason behind tokenization done in a separate step is that language server needs to know where exactly in a parsed token hcl.Pos is, so it can provide correct completion data (e.g. complete just ource when user requested completion in res<HERE>, rather than whole resource which would result in resresource).

This can be implemented a few ways

  • maintain a single long array of tokens for all the configuration and perform lookup there
  • maintain tokens for each block/attribute and perform much more scoped lookup in much shorter array of tokens representing just that single block/attribute (this is what the language server does today)

With that said I understand your concerns and I admit this may be unnecessary pre-optimization which doesn't come cheap from API perspective. Perhaps a single array is just fine.


I'm currently working on a proposal for a new parser for the language server, which also involves looking into this problem more deeply. I have not figured out yet how to provide accurate completion data within hcl.Expression where the expression is made of more than a single token. This may affect the API a bit too, so I'll chew on that a bit more and come back to this PR.

@radeksimko
Copy link
Member Author

I'm going to close this as I found a different (more elegant?) way of resolving the same problem. The new API which was just published in the form of hashicorp/hcl-lang accepts *hcl.File which in itself holds raw bytes of each file and that allows me to re-tokenize these bytes on-demand and do so only when really necessary (e.g. when looking up for completion candidates) and only for the relevant file.

@radeksimko radeksimko closed this Nov 2, 2020
@radeksimko radeksimko deleted the f-token-parser branch November 2, 2020 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants