Skip to content
This repository has been archived by the owner on Feb 28, 2022. It is now read-only.

html-to-hast doesn't respect self-closing esi tags #317

Closed
tripodsan opened this issue May 8, 2019 · 12 comments
Closed

html-to-hast doesn't respect self-closing esi tags #317

tripodsan opened this issue May 8, 2019 · 12 comments
Labels
bug Something isn't working docmeplz invalid This doesn't seem right

Comments

@tripodsan
Copy link
Contributor

html with self closing esi tags are wrongly parsed by html-to-hast.

eg:

  <div>
    <esi:include src="foo" />
    <esi:include src="bar" />
  </div>

is turned into

<div>
    <esi:include src="foo">
    <esi:include src="bar">
</esi:include></esi:include></div>
@tripodsan
Copy link
Contributor Author

@tripodsan
Copy link
Contributor Author

@davidnuescheler for now, don't use self-closing esi tags outside attributes.
eg: <img src='<esi:include src="foo" />>' would work.

@tripodsan tripodsan added the bug Something isn't working label May 8, 2019
@trieloff
Copy link
Contributor

trieloff commented May 8, 2019

@tripodsan do you know if this is a parsing problem or a serialization problem?

trieloff added a commit that referenced this issue May 8, 2019
@trieloff
Copy link
Contributor

trieloff commented May 8, 2019

After looking into this a bit, I'd say it is behaving as specified:

In HTML5, there are no self-closing tags (<tag />) per se, but there are a number of void elements (https://github.com/inikulin/parse5/blob/f550fbac86ac58c2dbc754efc7378771a387cbed/packages/parse5/lib/serializer/index.js#L70-L89 – the HTML parser we use behind the scenes has a good list). All other elements need to have an opening and closing tag. If there is a closing tag right behind, HTML parsers are typically able to close tags in the right way. If there isn't, you get funky results.

So, I'm going to mark this as wontfix, as it is behaving according to spec and the cost of deviating from it would mean hacking the HTML parser (even switching to JSDOM won't do, as it is using the same HTML parser).

In short: don't use self-closing ESI tags. I'm also tagging this docmeplz, so that we remember removing all examples of self-closing ESI tags from the documentation.

@trieloff trieloff added invalid This doesn't seem right docmeplz labels May 8, 2019
@trieloff trieloff closed this as completed May 8, 2019
@tripodsan
Copy link
Contributor Author

we could also ensure that all self-closing esi:include tags are expanded before invoking html-to-mdast. I'm sure many developers might fall in the same trap.

@trieloff
Copy link
Contributor

trieloff commented May 9, 2019

nodesi does not expose the parser function, so we would have to re-create it.

@tripodsan
Copy link
Contributor Author

tripodsan commented May 9, 2019

nodesi does not expose the parser function, so we would have to re-create it.

?? nodesi has no play in this. I mean, we could just do:

// (draft regexp, for illustration :-)
html = html.replace(/<esi:include(.*)/>/g, '<esi:include$1></esi:include>');

@trieloff
Copy link
Contributor

trieloff commented May 9, 2019

I know that we don't use nodesi for anything in pipeline yet, but I'd rather use the same ESI parser we have elsewhere in the stack over some regex.

trieloff added a commit to trieloff/nodesi that referenced this issue May 9, 2019
We need to pre-process the ESI tags so that they play nicely with a standard HTML parser, which does not recognize self-closing tags. Instead of rolling our own regex-based ESI parser, we'd like to use `nodesi`. Therefore this change renames `findESIInclueTags` to `findESIIncludeTags` and exposes it as part of the `ESI` API.

see adobe/helix-pipeline#317
@tripodsan
Copy link
Contributor Author

tripodsan commented May 9, 2019

same ESI parser we have elsewhere in the stack over some regex.

I wouldn't call it a parser...more like a good finder:

https://github.com/Schibsted-Tech-Polska/nodesi/blob/94360d0cc9fa407bc51773c42396834dfdd55a91/lib/esi.js#L68-L85

besides, fastly certainly doesn't use nodeesi....

@trieloff
Copy link
Contributor

trieloff commented May 9, 2019

A good finder is better than a great regex, especially when it offers some consistency with the rest of the stack. (we can't get all of the stack, of course – unless we compile Varnish's ESI module to webm)

@tripodsan
Copy link
Contributor Author

In short: don't use self-closing ESI tags. I'm also tagging this docmeplz, so that we remember removing all examples of self-closing ESI tags from the documentation.

Just for the record: I don't think this classified as invalid, since the esi-lang especially states that the tag is empty:

https://www.w3.org/TR/esi-lang

The include element specifies a fragment for assembly and allows for two optionally specified behaviors. include is an empty element; it does not have a closing tag.

<esi:include src="URI" alt="URI" onerror="continue" />

koraa pushed a commit that referenced this issue May 10, 2019
filipgolonka pushed a commit to Schibsted-Tech-Polska/nodesi that referenced this issue May 10, 2019
We need to pre-process the ESI tags so that they play nicely with a standard HTML parser, which does not recognize self-closing tags. Instead of rolling our own regex-based ESI parser, we'd like to use `nodesi`. Therefore this change renames `findESIInclueTags` to `findESIIncludeTags` and exposes it as part of the `ESI` API.

see adobe/helix-pipeline#317
@tripodsan
Copy link
Contributor Author

this is no longer relevant, as it is now supported with the new jsdom/htlengine #337

the tags are now automatically closed:

<div>
  <esi:include src="foo"></esi:include>
  <esi:include src="bar"></esi:include>
</div>

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working docmeplz invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

2 participants