hclsyntax: Introduce token-based parse methods #383

radeksimko · 2020-05-28T14:35:26Z

This change introduces new methods to allow two-phased approach where tokenization is done prior to parsing.

This is specifically useful in the context of a language server, so it can pass around a single interpretation of HCL blocks (tokens) instead of passing around both block and its tokens.

Related: hashicorp/terraform-ls#125

New methods

func ParseBodyFromTokens(tokens Tokens, end TokenType) (*Body, hcl.Diagnostics)
func ParseBodyItemFromTokens(tokens Tokens) (Node, hcl.Diagnostics)
func ParseBlockFromTokens(tokens Tokens) (*Block, hcl.Diagnostics)
func ParseAttributeFromTokens(tokens Tokens) (*Attribute, hcl.Diagnostics)

radeksimko · 2020-05-28T17:59:26Z

After testing it a bit more I think it would be sensible to add some checks to ParseBodyItemFromTokens, I was specifically tripped up by not knowing that the token sequence:

must start with TokenIdent, e.g. newline must be stripped if there is one
must end with either TokenEOF or TokenNewline, i.e. one has to be inserted if sequence ends with TokenCBrace

This change introduces new methods to allow two-phased approach where tokenization is done prior to parsing.

radeksimko · 2020-05-28T18:39:43Z

I added some checks as mentioned above.

apparentlymart

Sorry for the delay in looking at this. This looks good to me! I left a few minor things inline, but overall this looks great.

One thing I was wondering about is that in the terraform-lsp PR you linked to it only seems to be using ParseBodyFromTokens so far, and the other ones surprised me a little bit because we don't really have any API surface here for extracting an isolated set of tokens for a slice of a file's token stream. I don't mind including them if you have a use-case for them, but I'd like to understand more about how they will be used since I think this is the first time we've introduced the idea of parsing "fragments" of input as a public API, and the ability to do that may constrain what refactorings are possible in the internals of the parser in future.

apparentlymart · 2020-08-21T19:04:32Z

hclsyntax/public.go

+		rng := bi.Range()
+		diags = append(diags, &hcl.Diagnostic{
+			Severity: hcl.DiagError,
+			Summary:  fmt.Sprintf("Unexpected definition (%T)", bi),


Usually we've written errors like this with an end-user target audience in mind, so that a caller can use a call to this function to represent the assertion "the user should have written an attribute" and automatically get a good error message if the user didn't provide one.

With that said, I'm not totally sure about that framing for these new functions. It could well be that we consider it a programming error on the part of the caller to pass in tokens representing a block here, in which case I suppose this could be okay although in cases like that HCL has typically used panic rather than error diagnostics so far. 🤔

As a compromise, what do you think about taking the text of the message HCL would normally return if the schema calls for attribute syntax but the user wrote a block, and reducing it to fit what we can determine here without a schema? For example:

Error: Unsupported block type Blocks of type "example" are not expected here.

(To do this would, I realize, require type-asserting the bi to (*Block) first, so I guess there would still need to be a generic fallback for the short-never-happen case of it not being a block, but that would could presumably be a panic because it would only ever happen if there were a bug inside Parser.ParseBodyItem.)

(I have similar feedback for the opposite case of ParseBlockFromTokens above, but I won't write it all out again. 😄 )

apparentlymart · 2020-08-21T19:08:30Z

hclsyntax/public.go

+
+	peeker := newPeeker(tokens, false)
+
+	// Sanity checks to avoid surprises


Let's say "Initial checks" instead here, for inclusiveness. ❤️

radeksimko · 2020-09-07T16:27:59Z

Regarding ParseBodyItemFromToken and the two other similar methods:

TL;DR language server currently uses these methods, mostly as a form of optimization.

I will try to share some more context below.

The main reason behind tokenization done in a separate step is that language server needs to know where exactly in a parsed token hcl.Pos is, so it can provide correct completion data (e.g. complete just ource when user requested completion in res<HERE>, rather than whole resource which would result in resresource).

This can be implemented a few ways

maintain a single long array of tokens for all the configuration and perform lookup there
maintain tokens for each block/attribute and perform much more scoped lookup in much shorter array of tokens representing just that single block/attribute (this is what the language server does today)

With that said I understand your concerns and I admit this may be unnecessary pre-optimization which doesn't come cheap from API perspective. Perhaps a single array is just fine.

I'm currently working on a proposal for a new parser for the language server, which also involves looking into this problem more deeply. I have not figured out yet how to provide accurate completion data within hcl.Expression where the expression is made of more than a single token. This may affect the API a bit too, so I'll chew on that a bit more and come back to this PR.

radeksimko · 2020-11-02T09:59:59Z

I'm going to close this as I found a different (more elegant?) way of resolving the same problem. The new API which was just published in the form of hashicorp/hcl-lang accepts *hcl.File which in itself holds raw bytes of each file and that allows me to re-tokenize these bytes on-demand and do so only when really necessary (e.g. when looking up for completion candidates) and only for the relevant file.

radeksimko added enhancement syntax/native labels May 28, 2020

radeksimko force-pushed the f-token-parser branch from 45fbdd6 to 162e40d Compare May 28, 2020 15:43

radeksimko marked this pull request as draft May 28, 2020 16:33

hclsyntax: Introduce token-based parse methods

fa7c453

This change introduces new methods to allow two-phased approach where tokenization is done prior to parsing.

radeksimko force-pushed the f-token-parser branch from 162e40d to fa7c453 Compare May 28, 2020 18:34

radeksimko marked this pull request as ready for review May 28, 2020 18:39

radeksimko requested a review from apparentlymart May 28, 2020 19:31

apparentlymart reviewed Aug 21, 2020

View reviewed changes

radeksimko closed this Nov 2, 2020

radeksimko deleted the f-token-parser branch November 2, 2020 10:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hclsyntax: Introduce token-based parse methods #383

hclsyntax: Introduce token-based parse methods #383

radeksimko commented May 28, 2020 •

edited

Loading

radeksimko commented May 28, 2020 •

edited

Loading

radeksimko commented May 28, 2020

apparentlymart left a comment

apparentlymart Aug 21, 2020

apparentlymart Aug 21, 2020

radeksimko commented Sep 7, 2020 •

edited

Loading

radeksimko commented Nov 2, 2020


		peeker := newPeeker(tokens, false)

		// Sanity checks to avoid surprises

hclsyntax: Introduce token-based parse methods #383

hclsyntax: Introduce token-based parse methods #383

Conversation

radeksimko commented May 28, 2020 • edited Loading

New methods

radeksimko commented May 28, 2020 • edited Loading

radeksimko commented May 28, 2020

apparentlymart left a comment

Choose a reason for hiding this comment

apparentlymart Aug 21, 2020

Choose a reason for hiding this comment

apparentlymart Aug 21, 2020

Choose a reason for hiding this comment

radeksimko commented Sep 7, 2020 • edited Loading

radeksimko commented Nov 2, 2020

radeksimko commented May 28, 2020 •

edited

Loading

radeksimko commented May 28, 2020 •

edited

Loading

radeksimko commented Sep 7, 2020 •

edited

Loading