Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: parser instead of regex for preprocessor #6611

Closed

Conversation

tanhauhau
Copy link
Member

@tanhauhau tanhauhau commented Aug 3, 2021

Fixes #5900, fixes #5292, fixes #4701

Replace the regex to a simple parser, allow us to fix #5292 and support #4701

Before submitting the PR, please make sure you do the following

  • Prefix your PR title with [feat], [fix], [chore], or [docs].
  • This message body should clearly illustrate what problems it solves.
  • Ideally, include a test that fails without this PR but passes with it.

Tests

  • Run the tests with npm test and lint the project with npm run lint

@tanhauhau tanhauhau force-pushed the tanhauhau/preproces-parse branch from 7d10174 to 24b8d45 Compare August 3, 2021 03:15
@Conduitry
Copy link
Member

I'm worried about using a parser for this, because we have no idea what sort of syntax people will be using in the original (unpreprocessed) code. For example, we have no idea whether the language that they're trying to preprocess in expressions has any of the same features of matching braces or strings or comments as JS does.

@tanhauhau
Copy link
Member Author

tanhauhau commented Aug 4, 2021

hmm.. fair enough, would it make sense to have an option on the preprocessors to decide whether to use regex or parser?

for example:

const preprocessor = {
  style() {},
  markup() {},
  script() {},
  mode: 'parser' | 'regex' = 'regex',
};

with default as regex, which will make preprocess() backwards compatible, and preprocessors author can update on their own pace

though, i'm not sure to use the word "regex" and "parser", it feels like leaking our implementation out

and alternative maybe to specify as version number:

const preprocessor = {
  style() {},
  markup() {},
  script() {},
  version: 1,
};

@dummdidumm
Copy link
Member

dummdidumm commented Aug 16, 2021

I like this parser very very much, because it could help us with all sorts of tooling. Preprocessing would get easier, but the ESLint and Prettier plugins could also benefit from this, and language-tools could possibly get rid of some custom top level script extractor logic. So it would be great if the parser could be available for direct use.

Regarding the preprocessor enhancement: I propose to not enhance the existing API and instead try to make it part of the new API which is discussed in this RFC: sveltejs/rfcs#56

I therefore propose to split this PR up into two pieces: One which simply makes this new parser available publicly (whether that's under svelte/preprocess or not is TBD), and once the new preprocessor API is finalized another PR where the parser is usable as part of that new API.

Another thing that would help tooling: Add a isTopLevel property to script / style entries to distinguish

<script>
   console.log('top level')
</script>
<div>
  <script>console.log('not top')</script>
</div>

@dummdidumm
Copy link
Member

@pngwn might have some thoughts about this, he wrote a language-agnostic parser, too, albeit with more sophisticated parsing mechanism for the contents of example each etc and producing a complete AST in the end (https://github.com/pngwn/MDsveX/tree/master/packages/svelte-parse).

@Conduitry
Copy link
Member

Regarding the ESLint plugin benefiting from this: I'm not sure what that would look like exactly, but unless it were a synchronous API (which the existing preprocessor API is not), it couldn't be used in ESLint, where all plugins need to run synchronously.

@dummdidumm
Copy link
Member

The parser itself is synchronous and could help ESLint etc to for example find out where the top level script/style tags are - which is why I asked if we could split this PR into two, one for the parser itself because it would be immediately useful in itself, and the preprocessing changes independent of that.

@baseballyama
Copy link
Member

I may not understand something regarding this PR.
However, I have a small idea for finding top-level <script> / <style> tag.
So I'm posting it.
I'm sorry if I'm saying fundamentally wrong or any considerations were omitted.

I think this process could be made easier by adding a rule that <script>/ <style> tags must always be at the beginning of a line.
This is breaking changes but I don't think it will affect most users.
And I wrote about how to notice it to affected users at the bottom.

This way, perhaps only regular expressions, or a very simple parser that is completely language-independent, will be able to extract each part correctly.

The below (maybe only) case doesn't handle properly, but I think it's enough to put it in the documentation because this is a super edge case.

<sctipt>.
  const str = `
<script>
  const hoge = "hoge";
</script>
`;
</script>

It would be nice to have some kind of warning and a way to disable it (like <!-- svelte-ignore invalid-script-tag -->) in case a user unintentionally writes an invalid <script> / <style> tag.

And to do this, we need to traverse an entire code before separating it for finding <script> / <style> tags that are not at the beginning of a line.
Is there currently a mechanism for this?


By the way,
For the part that extracts expressions, for example, if user use pug, the notation will be as follows.
(Just I copied from web.)
if statements doesn't have {, }.
Therefore current parser may doesn't run as expected if my understand is correct.

 ul
   +if('posts && posts.length > 1')
     +each('posts as post')
       li
         a(rel="prefetch" href="blog/{post.slug}") {post.title}
     +else()
       span No posts :boom:

@bluwy
Copy link
Member

bluwy commented Dec 4, 2021

@baseballyama I'm not sure if checking for <script> at the first of line would be a robust solution. The format of the code shouldn't change its semantic meaning, so I think a simple parser would work better here.

Re pug, it shouldn't be an issue since the parser would always run on pure Svelte code. Pug would be preprocessed into html before Svelte parses it.

@baseballyama
Copy link
Member

baseballyama commented Dec 4, 2021

@bluwy

Thank you for your comment!

I'm not sure if checking for <script> at the first of line would be a robust solution. The format of the code shouldn't change its semantic meaning, so I think a simple parser would work better here.

I add an additional explanation of my idea.

Point1

This is just IMO, but the parser doesn't know anything other than <script> / </script> and <style> / </style> strings before splitting them into script and style and markup parts.
This is because it doesn't know what language is used.
Thus, the parser needs to rely only on the <script> / </script> string and the <style> / </style> string for parsing.

According to GitHub issues, Svelte users sometimes write <script> or <style> strings inside <script> or <style> or markup part.
However, there doesn't seem to be a case for writing them at the beginning of a line.
This is because there is a context that it is a real <script> / <style> tag.

Point2

From another perspective, I think we can recognize <script> and <style> not as HTML tags, but just separators that represent sections of a .svelte file.
Then it is ergonomically natural to write the separator at the beginning of the line, like # in markdown.

For these two reasons, it may make sense to introduce the rule of my idea,
But anyway, if it is true that this rule can simplify the parsing process, I thought it would be a good idea to consider it.
And if the maintainers think that a simple parser is better, of course, I will follow it!

Re pug, it shouldn't be an issue since the parser would always run on pure Svelte code. Pug would be preprocessed into html before Svelte parses it.

For instance, I had understood that there is a use case where language-tools want to use this before pre-processing.
Sorry if I'm wrong. (I need to learn more about it.😅)
And I totally agree with you about the Svelte compiler, I don't think it will be a problem since it is already pre-processed.

@bluwy
Copy link
Member

bluwy commented Dec 5, 2021

@baseballyama I'm still not sure if it's possible with the idea you suggested. I agree that we can find the start tag easily with this assumption, but the closing tag is still hard to match. Re point2, determining whether they are a separator is another tricky one to solve. Markdown's # works because it must start on a new line, Svelte however doesn't mind where the tag is at as long as it's proper HTML, for example a Svelte file written only in one line. But if I'm still misunderstanding your point, feel free to try out your idea 😅

Re language tools, I'm not familiar with the code, but it looks like it's for stripping out the tags before svelte.parse so it doesn't error out. @dummdidumm may know more about that.

If you have anymore questions (that might be off-topic), It'd be great to have them in the Svelte discord contributing channel. Would be nice to see you there too!

@tanhauhau tanhauhau added this to the one day milestone Mar 1, 2023
@benmccann benmccann changed the title [feat] parser instead of regex for preprocessor feat: parser instead of regex for preprocessor Mar 14, 2023
@Rich-Harris
Copy link
Member

Closing Svelte 4 PRs as stale — thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants