Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline HTML handling is sloppy #9

Open
rexxars opened this issue Jan 29, 2016 · 18 comments
Open

Inline HTML handling is sloppy #9

rexxars opened this issue Jan 29, 2016 · 18 comments

Comments

@rexxars
Copy link
Owner

rexxars commented Jan 29, 2016

The fact that foo <strong>bar</strong> gets turned into foo <span><strong></span>bar<span></strong></span> is obviously horribly broken. It's tied to the fact that the walker gives <strong> as an inline HTML element, then gives bar as a text node, followed by </strong> as another inline html element. Not 100% how to best handle this yet.

@rexxars rexxars added the bug label Jan 29, 2016
@kadishmal
Copy link

Stumbled upon this exact issue in remarkjs/react-markdown#17.

@rexxars
Copy link
Owner Author

rexxars commented Feb 29, 2016

Because of the way the commonmark parser is built, this one is pretty tricky to solve, unfortunately. Would love some help on this if someone has time.

@kadishmal
Copy link

I guessed that this comes from the parser. Any hint where I should look for this issue?

@rexxars
Copy link
Owner Author

rexxars commented Feb 29, 2016

Basically, when encountering foo <strong>bar</strong>, the parser yields four tokens:

  1. Text with the literal foo
  2. InlineHtml with the literal <strong>
  3. Text with the literal bar
  4. InlineHtml with the literal </strong>

Because of how React works, we can't just inject "half a tag" somewhere. If the parser gave us <strong>bar</strong>, things would be a lot simpler, but this doesn't work because of things like the following: foo <strong>*bar*</strong>. In other words, markdown inside of inline HTML.

@kadishmal
Copy link

Yeah, I see. Seems like have to fall back to the plain Markdown processor and inject the output dangerously.

@tauren
Copy link
Contributor

tauren commented Mar 23, 2017

I also ran into this issue in remarkjs/react-markdown#67

Is this an issue in the commonmark parser, or the way it is used by commonmark-react-renderer? Has anyone looked into alternative commonmark parsers?

@rexxars
Copy link
Owner Author

rexxars commented Mar 28, 2017

The issue is basically that you'd need to pull in an HTML parser in order to properly handle this.
Let's imagine you have this HTML:

<span id=unquoted class="foo">Some *bold* <span class='outlined'>text</span></span>

In an ideal world, it would have to convert this to the following tree:

{
  tag: 'span',
  attrs: {id: 'unquoted', className: 'foo'},
  children: [
    'Some ',
    {tag: 'strong', children: ['bold']},
    {tag: 'span', attrs: {className: 'outlined'}, children: ['text']}
  ]
}

Which is non-trivial to implement with a parser that only emits tokens without having any notion of "depth". You'd have to keep track of the structure you're inside of, so that for the opening <span>, you'd note that we're now entering a node and need to record every new token as a child, until you run into a new opening tag or a closing tag for the span. Then you'd also have to consider cases like optional closing tags (good old <p>, for instance), incorrect HTML structures (<span>Foo<em></span>Muh</em>) and the likes. It really is a nightmare.

@tauren
Copy link
Contributor

tauren commented Mar 28, 2017

What are the objections to pulling in an HTML parser? Is it the size? Does the commonmark parser not help with this? Would changing parsers solve this?

The way things are now, inline html doesn't work, even for very simple and properly formed html:

<span>Hello</span>

Here is what gets rendered:

<p><span><span></span></span><!-- react-text: 89 -->Hello<!-- /react-text --><span></span></p>

I suggest that the project README reflect this situation until it can be solved and it should also be mentioned in the react-markdown readme. I suspect this might save others a bunch of time figuring out why things aren't rendering right.

@rexxars
Copy link
Owner Author

rexxars commented Mar 28, 2017

Well, even if you pull in an HTML parser (which will add significant weight), you'll still have to figure out how to deal with all the edge cases, as I outlined above (broken HTML, optional closing tags etc).

It's not a task I want to venture out on - my time is already stretched way too thin as it is. I'd be more than happy to accept a pull request that would improve the situation. Last time I thought about this problem, I came across html-to-react, which would solve much of this problem if only the commonmark parser returned blocks of HTML instead of just tokens for individual HTML fragments (start and end tags as individual tokens). So you'd still be left with trying to match up start and end tags manually.

I'll try to update the readmes to reflect this state.

@tauren
Copy link
Contributor

tauren commented Mar 28, 2017

Thanks for the explanation, and I understand being stretched too thin. I was hoping to get a clear picture of how you would solve this so that someone else could take on the work and create a PR you would potentially accept. It would also help to know what you don't want to see (adding significant weight to project, etc.).

I am stretched thin and don't have the time to solve this problem either. I'm working around the issue in my current project, but it keeps rearing it's ugly head and I may end up needing to work on a real solution.

@kentcdodds
Copy link

Hi there! I'm evaluating this project right now. Honestly, I wouldn't expect the contents of HTML to be parsed as markdown. Couldn't things be drastically simplified if we say that once you get into HTML-land we no longer parse it as markdown? Kinda like switching context from JSX to expressions and back again using { and }?

@rexxars
Copy link
Owner Author

rexxars commented May 10, 2017

Hi Kent!

For sure, things would be simpler if inline HTML was just emitted as one big chunk. I don't think this is consistent with other markdown parsers though, and might not necessarily be what you want. It also doesn't really solve the problem unless we switch to a different parser.

As outlined earlier, the parser itself outputs individual tokens. If it emitted a single token, it would be a different story.

@Venryx
Copy link

Venryx commented Aug 13, 2017

Note that while <span>*Something*</span> is, as you say, parsed by Remarkable as three tokens (so Markdown can be used inside), this is not true for <div>*Something*</div>.

In that case, Remarkable just takes everything inside as pure html -- which is why that is output as a single htmlblock token, rather than htmltag, text, and htmltag.

So it's kind of odd to me why the authors of Remarkable did this -- it seems kind of inconsistent. They have span allowing Markdown inside it, whereas div does not.

EDIT: Just realized this project uses the Commonmark parser instead of the Remarkable one; however, I suspect they may have the same difference in handling of <div> and <span>.

@localjo
Copy link

localjo commented Oct 27, 2017

I'm going to try to look into this more in the next two weeks. One question; if we could detect whether the node was an opening/closing tag of inline HTML with something like if (!props.isBlock && props.literal.match(/<([A-Z]+)[^>]+>/i)) how would we then skip creating a react element for that node? Or would it be better to group an opening tag, text node and closing tag together and then create a react element for that?

@rexxars
Copy link
Owner Author

rexxars commented Oct 28, 2017

First of all, thanks a ton for showing interest in helping with this! <3

Been a while since I looked at this, so I don't have a clear picture of the issue in my head right now. Having said that, here's the first few thoughts that pops into my head when we're dealing with this stuff:

  • If you are using react-markdown, beware that I've been wanting to switch to a different parser for some time. I've got a semi-working prototype going that uses remark-parse as the parser. I'd focus my efforts on that branch. If you're using react-commonmark, or just this parser in and of itself, this is the right place to address it.

  • Simply matching the opening and closing tags might be a good start, but keep in mind that to properly create React elements from the HTML, you'll want to also extract the attributes in the opening tag and apply them to the created element.

  • Things I'm worried about being difficult to handle properly:

    • Invalid HTML (What happens if you put a <p> in there somewhere, and never close it? What if you open a <p>, then a <span> and close the tags in the wrong order?)
    • Properly parsing HTML without a real HTML parser
  • Having said that, my first suggestion (without having taken a stab at it) would be to:

    1. Keep an array of open tags/nodes that you append to every time you hit an InlineHtml node
      • If you hit an empty element, parse it and produce a react element right away
    2. Keep an array of all the other nodes produced since the last open html tag
    3. When you encounter an InlineHtml node, parse it, see if it's a closing tag. If so, check if you have a corresponding open tag. If so, close all the open nodes that have been encountered after that tag, and add all those elements along with the array of produced nodes previously recorded as children of the current tag.

As far as I know, once a React element has been created, there isn't a way to add children to it, so you'll have no choice but to group the children together as you suggested.

Good luck!

@localjo
Copy link

localjo commented Oct 30, 2017

Oh, very interesting. I'm familiar with remark-parse. I used a lot of remark plugins for https://github.com/sparkartgroup/quality-docs

That md-ast branch on react-markdown is 17 ahead and 54 behind master. What are your thoughts on rebasing that and then opening a PR with the current progress that we could use to track discussion?

@rexxars
Copy link
Owner Author

rexxars commented Oct 30, 2017

Sure, I'm super busy at work these days, but I'll see if I can find the time to get it up to speed sometime soon.

@rexxars
Copy link
Owner Author

rexxars commented Nov 4, 2017

Huge divergence from master, so I decided to just do a clean branch from master.
Anyway, check out https://github.com/rexxars/react-markdown/tree/remark if you're interested.

dbatten5 pushed a commit to appearhere/bloom that referenced this issue Jan 23, 2020
Since the current markdown parsing package has a known issue with
rendering inline html
(rexxars/commonmark-react-renderer#9) which
has been an issue for 3 years+, we need a new markdown package. The
react-markdown package has an option to render links with target="_blank"
so this commit allows the component to pass that option through
dbatten5 pushed a commit to appearhere/bloom that referenced this issue Jan 27, 2020
Since the current markdown parsing package has a known issue with
rendering inline html
(rexxars/commonmark-react-renderer#9) which
has been an issue for 3 years+, we need a new markdown package. The
react-markdown package has an option to render links with target="_blank"
so this commit allows the component to pass that option through
dbatten5 pushed a commit to appearhere/bloom that referenced this issue Jan 27, 2020
Since the current markdown parsing package has a known issue with
rendering inline html
(rexxars/commonmark-react-renderer#9) which
has been an issue for 3 years+, we need a new markdown package. The
react-markdown package has an option to render links with target="_blank"
so this commit allows the component to pass that option through
dbatten5 pushed a commit to appearhere/bloom that referenced this issue Jan 27, 2020
Since the current markdown parsing package has a known issue with
rendering inline html
(rexxars/commonmark-react-renderer#9) which
has been an issue for 3 years+, we need a new markdown package. The
react-markdown package has an option to render links with target="_blank"
so this commit allows the component to pass that option through
dbatten5 pushed a commit to appearhere/bloom that referenced this issue Jan 27, 2020
Since the current markdown parsing package has a known issue with
rendering inline html
(rexxars/commonmark-react-renderer#9) which
has been an issue for 3 years+, we need a new markdown package. The
react-markdown package has an option to render links with target="_blank"
so this commit allows the component to pass that option through
dbatten5 pushed a commit to appearhere/bloom that referenced this issue Jan 27, 2020
Since the current markdown parsing package has a known issue with
rendering inline html
(rexxars/commonmark-react-renderer#9) which
has been an issue for 3 years+, we need a new markdown package. The
react-markdown package has an option to render links with target="_blank"
so this commit allows the component to pass that option through
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants