Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html: apostrophe in JavaScript comment breaks guest parser #3581

Closed
polyscone opened this issue Dec 6, 2022 · 9 comments · Fixed by #3598
Closed

html: apostrophe in JavaScript comment breaks guest parser #3581

polyscone opened this issue Dec 6, 2022 · 9 comments · Fixed by #3598

Comments

@polyscone
Copy link

The name of the parser:

JavaScript as a guest parser in HTML.

The command line you used to run ctags:

$ ctags --options=NONE --extras=+g foo.html

The content of input file:

<h1>Foo</h1>

<script>
	const bar = 123

	// I don't know why, but an apostrophe breaks
	// the JavaScript guest language
	function baz () {
		return 'abc'
	}
</script>

The tags output you are not satisfied with:

!_TAG_FILE_FORMAT       2       /extended format; --format=1 will not append ;" to lines/
!_TAG_FILE_SORTED       1       /0=unsorted, 1=sorted, 2=foldcase/
!_TAG_OUTPUT_EXCMD      mixed   /number, pattern, mixed, or combineV2/
!_TAG_OUTPUT_FILESEP    slash   /slash or backslash/
!_TAG_OUTPUT_MODE       u-ctags /u-ctags or e-ctags/
!_TAG_PATTERN_LENGTH_LIMIT      96      /0 for no limit/
!_TAG_PROGRAM_AUTHOR    Universal Ctags Team    //
!_TAG_PROGRAM_NAME      Universal Ctags /Derived from Exuberant Ctags/
!_TAG_PROGRAM_URL       https://ctags.io/       /official site/
!_TAG_PROGRAM_VERSION   5.9.0   /4099472/
Foo     ./foo.html      /^<h1>Foo<\/h1>$/;"     h

The tags output you expect:

!_TAG_FILE_FORMAT       2       /extended format; --format=1 will not append ;" to lines/
!_TAG_FILE_SORTED       1       /0=unsorted, 1=sorted, 2=foldcase/
!_TAG_OUTPUT_EXCMD      mixed   /number, pattern, mixed, or combineV2/
!_TAG_OUTPUT_FILESEP    slash   /slash or backslash/
!_TAG_OUTPUT_MODE       u-ctags /u-ctags or e-ctags/
!_TAG_PATTERN_LENGTH_LIMIT      96      /0 for no limit/
!_TAG_PROGRAM_AUTHOR    Universal Ctags Team    //
!_TAG_PROGRAM_NAME      Universal Ctags /Derived from Exuberant Ctags/
!_TAG_PROGRAM_URL       https://ctags.io/       /official site/
!_TAG_PROGRAM_VERSION   5.9.0   /4099472/
Foo     ./foo.html      /^<h1>Foo<\/h1>$/;"     h
bar     ./foo.html      /^      const bar = 123$/;"     C
baz     ./foo.html      /^      function baz () {$/;"   f

The version of ctags:

Universal Ctags 5.9.0(4099472), Copyright (C) 2015-2022 Universal Ctags Team
Universal Ctags is derived from Exuberant Ctags.
Exuberant Ctags 5.8, Copyright (C) 1996-2009 Darren Hiebert
  Compiled: Nov 29 2022, 02:36:15
  URL: https://ctags.io/
  Optional compiled features: +win32, +wildcards, +regex, +gnulib_regex, +internal-sort, +unix-path-separator, +iconv, +option-directory, +xpath, +json, +interactive, +yaml, +case-insensitive-filenames, +packcc, +optscript, +pcre2

How do you get ctags binary:

Win32 binary taken from Universal-ctags/ctags-win32 project.

Extra details:

The tags are correctly generated when using the following input in a .js file:

const bar = 123

// I don't know why, but an apostrophe breaks
// the JavaScript guest language
function baz () {
	return 'abc'
}

The tags are only not generated in full when the JavaScript parser would run as a guest inside the HTML one.

If I remove the apostrophe from the word "don't" in the JavaScript comment then tags are generated in full as expected.

@masatake masatake changed the title Apostrophe in JavaScript comment breaks guest parser html: apostrophe in JavaScript comment breaks guest parser Dec 6, 2022
@masatake
Copy link
Member

masatake commented Dec 6, 2022

Thank you. Reproduced. I found some variants:

<h1>Foo</h1>
<script>
  // '
  var x
</script>
<h1>Foo</h1>
<script>
  // "
  var x
</script>
<h1>Foo</h1>
<script>
  // </script
  var x
</script>

@masatake
Copy link
Member

masatake commented Dec 6, 2022

Do you know HTML well?
I wonder whether a newline inside '...' and "..." is allowed in an HTML file.

If it is not allowed, the next patch may mitigate this issue:

diff --git a/parsers/html.c b/parsers/html.c
index 069c667ac..017ab24c4 100644
--- a/parsers/html.c
+++ b/parsers/html.c
@@ -326,7 +326,7 @@ getNextChar:
 		{
 			const int delimiter = c;
 			c = getcFromInputFile ();
-			while (c != EOF && c != delimiter)
+			while (c != EOF && c != delimiter && c != '\n')
 			{
 				vStringPut (token->string, c);
 				c = getcFromInputFile ();

masatake added a commit to masatake/ctags that referenced this issue Dec 6, 2022
Close universal-ctags#3581.

MORE DESCRIPTIONS ARE NEEDED.

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
@jafl
Copy link
Contributor

jafl commented Dec 6, 2022

Newlines are not permitted in JavaScript strings

@polyscone
Copy link
Author

As jafl said, newlines aren't permitted in '' and "" strings, but are permitted in backtick (`) strings, such as:

var hello = `
  world
`

I know you didn't ask specifically about backticks, but I built and tested #3585 with some multiline backtick strings as well and it seemed to work just fine for me.

@masatake
Copy link
Member

masatake commented Dec 7, 2022

I'm talking about HTML.
The apostrophe makes it hard for our HTML parser to find the area (<script> ~ </script>).
This is not a bug in our JavaScript parser.

@polyscone
Copy link
Author

I'm talking about HTML.

Ah ok, it seemed like you were asking about strings in JS in HTML.

In HTML you can do things like this and the browser will accept it:

<h1 class="
    bar
    baz
    qux:quxx
">Foo</h1>

Is this what you wanted to know?

@masatake
Copy link
Member

masatake commented Dec 7, 2022

@polyscone, Yes. Thank you. It is what I would like to know.

@masatake
Copy link
Member

I found variants:

<h1>Foo</h1>
<script>
  // <!--
  var x
</script>

@masatake
Copy link
Member

masatake commented Dec 17, 2022

Though I found a variant issue, I will focus on apostrophes here.
I will close this via #3585. I will make a new pull request fixing both this one and #3597.

masatake added a commit to masatake/ctags that referenced this issue Dec 17, 2022
The original code used a html-aware tokenizer for reading
tokens in <script>...</script> areas.

As reported in universal-ctags#3581 and universal-ctags#3597, this original code could
not recognize <script>...</script> areas in some cases.

This change introduces a tokenizer specialized to script
areas in addition to the original html-aware tokenizer.

Close universal-ctags#3581.
Close universal-ctags#3597.

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
masatake added a commit to masatake/ctags that referenced this issue Dec 17, 2022
The original code used a html-aware tokenizer for reading
tokens in <script>...</script> areas.

As reported in universal-ctags#3581 and universal-ctags#3597, this original code could
not recognize <script>...</script> areas in some cases.

This change introduces a tokenizer specialized to script
areas in addition to the original html-aware tokenizer.

Close universal-ctags#3581.
Close universal-ctags#3597.

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
masatake added a commit that referenced this issue Dec 18, 2022
HTML: introduce a specialized tokenizer for script areas

Close #3581.
Close #3597.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants