Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: HTML to markdown parser #381

Merged

Conversation

pranshuchittora
Copy link
Contributor

@pranshuchittora pranshuchittora commented May 29, 2021

@roryabraham @jasperhuangg @marcaaron will you please review this?

Context: Expensify/App#2847

Tests

What unit/integration tests cover your change? What autoQA tests cover your change?
Yes I have added tests

What tests did you perform that validates your changed worked?
Yes

QA

What does QA need to do to validate your changes?
Try various HTML strings with <br> tags in various combinations

HTML -> MD

  1. Hello<br/>There -> Hello\nThere
  2. Hello<br></br>There -> Hello\nThere

Tests -> https://github.com/Expensify/expensify-common/pull/381/files#diff-7031bf36a0d060dfeb1742c45b8fa6853fb5790ef4da366b0fb96ceaafd342ea

What areas to they need to test for regressions?
Parser

@pranshuchittora pranshuchittora requested a review from a team as a code owner May 29, 2021 14:49
@github-actions
Copy link

github-actions bot commented May 29, 2021

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@MelvinBot MelvinBot requested review from Luke9389 and removed request for a team May 29, 2021 14:50
@pranshuchittora
Copy link
Contributor Author

I have read the CLA Document and I hereby sign the CLA

@roryabraham
Copy link
Contributor

@pranshuchittora I've run the test check suite on this PR and found that there are lint errors that need to be resolved

@pranshuchittora
Copy link
Contributor Author

@roryabraham I have fixed the linting issues :)

Copy link
Contributor

@Luke9389 Luke9389 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @pranshuchittora. Nice work so far. I've got a few small comments and one larger one that I'll post below.

lib/ExpensiMark.js Outdated Show resolved Hide resolved
lib/ExpensiMark.js Outdated Show resolved Hide resolved
lib/ExpensiMark.js Outdated Show resolved Hide resolved
lib/ExpensiMark.js Show resolved Hide resolved
@Luke9389
Copy link
Contributor

Luke9389 commented Jun 2, 2021

Do we want to match breaking space tags that are inside of code blocks?

If you sent a message like the following, what would we want our code to do?
"Hey fellow engineer, I noticed we are missing a <br/> closing tag on line X."

Currently, I think we'd end up matching and replacing that tag, which is not desired behavior. Perhaps we should establish a method for preventing matches inside of code blocks. Thoughts?

@Luke9389
Copy link
Contributor

Luke9389 commented Jun 2, 2021

As a side note, I think you can put the changes from this PR here as well. I like the organization of having them separate, but I think it's best for our workflow to just have one PR to track.

@pranshuchittora
Copy link
Contributor Author

"Hey fellow engineer, I noticed we are missing a
closing tag on line X."

As this happens on runtime (conversion), therefore this becomes irrelevant.
My approach / philosophy is inspired by the way HTML works instead of breaking it assumes and fixes the tree on runtime

@Luke9389
Copy link
Contributor

Luke9389 commented Jun 7, 2021

As this happens on runtime (conversion), therefore this becomes irrelevant. My approach / philosophy is inspired by the way HTML works instead of breaking it assumes and fixes the tree on runtime

Interesting. Would you mind elaborating a bit further, for my sake? Also, is this something we can test?

@pranshuchittora
Copy link
Contributor Author

pranshuchittora commented Jun 7, 2021

As this happens on runtime (conversion), therefore this becomes irrelevant. My approach / philosophy is inspired by the way HTML works instead of breaking it assumes and fixes the tree on runtime

Interesting. Would you mind elaborating a bit further, for my sake? Also, is this something we can test?

Sure, what I understand is that we should show some warning for unclosed <br> in dev mode. This kinda becomes as messages are going to be generated by the users on runtime.

The only way this becomes relevant is when input to HTML conversion happens. Instead of handling it here, I recommend writing robust test cases for the HTML conversion parser.

Correct me if I am missing something 🤔 @Luke9389

@Luke9389
Copy link
Contributor

Luke9389 commented Jun 7, 2021

Sure, what I understand is that we should show some warning for unclosed <br> in dev mode.

Ok it's possible we're talking about two different things right now. My main concern is that if someone puts a br tag inside backticks, that we won't make a newline. I'm not sure what you're referring to with showing a warning.

@pranshuchittora
Copy link
Contributor Author

Sure, what I understand is that we should show some warning for unclosed <br> in dev mode.

Ok it's possible we're talking about two different things right now. My main concern is that if someone puts a br tag inside backticks, that we won't make a newline. I'm not sure what you're referring to with showing a warning.

Yeah, I agree that's an edge case. I try finding out how other parsers are doing this unfortunately one of the famous parser has buggy behaviour with this as well.

http://domchristie.github.io/turndown/

Paste this HTML 👇🏼

<code>
Hello There <br/>
</code>

@Luke9389
Copy link
Contributor

Luke9389 commented Jun 9, 2021

Couldn't we just use regex to accomplish this? I think negative lookahead will allow us to avoid matching a breaking tag wrapped in backticks. Something like this:

<br\s*[/]?>(?!`)

That way we won't match <br/>

We already use negative lookahead in some of our other regex patterns in this file, so we should be safe to use it here.

@pranshuchittora
Copy link
Contributor Author

pranshuchittora commented Jun 9, 2021

I think negative lookahead will allow us to avoid matching a breaking tag wrapped in backticks

@Luke9389 backticks are from the markdown world.
Backticks are converted to <code> in HTML. The input to this will look like this

<code>
Hello There <br/>
</code>

Therfore having backticks in the input HTML is irrelevant and can be ignored in HTML -> MD parsing

@pranshuchittora
Copy link
Contributor Author

Hi, @marcaaron can you help merging this

@marcaaron
Copy link
Contributor

Sorry, @pranshuchittora please give the assigned reviewers time to handle this and if > 3 days ask in the Slack channel for help

@Luke9389
Copy link
Contributor

Hey @pranshuchittora, sorry for the late response here. I've been on vacation the past week.

Ok, I see what you're saying about the backticks, but we aren't out of the weeds yet.

First, I want to be sure we're on the same page. When someone sends a message with this, <br/>, they're talking about a br tag, and so we shouldn't parse that to a new line. That gets sent, turned into HTML (< code >< br/ >< /code >), and now when we are parsing that HTML back to markdown, we want to be sure we don't create a new line (we want it to be turned into <br/>).

So in my view, we should have some way of preventing the content inside of code blocks from being parsed. Maybe that should be saved for a different PR? What do you think @pranshuchittora?

@jasperhuangg jasperhuangg self-requested a review June 15, 2021 07:05
Copy link
Contributor

@jasperhuangg jasperhuangg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Luke's comments about ensuring that any code inside of code blocks isn't parsed.

@pranshuchittora
Copy link
Contributor Author

Hey @pranshuchittora, sorry for the late response here. I've been on vacation the past week.

Ok, I see what you're saying about the backticks, but we aren't out of the weeds yet.

First, I want to be sure we're on the same page. When someone sends a message with this, <br/>, they're talking about a br tag, and so we shouldn't parse that to a new line. That gets sent, turned into HTML (< code >< br/ >< /code >), and now when we are parsing that HTML back to markdown, we want to be sure we don't create a new line (we want it to be turned into <br/>).

So in my view, we should have some way of preventing the content inside of code blocks from being parsed. Maybe that should be saved for a different PR? What do you think @pranshuchittora?

Yup we are on the same page 👍🏼
Solving this is very tricky as we can’t find <code> directly as it can have some attribute like <code some-attribute="xyz" >. I checked few parsers out there as discussed in this comment #381 (comment)

IMO let’s solve that with another issue and PR

@Luke9389
Copy link
Contributor

Yea, I think solving that in another PR is reasonable. It'll be its own issue with its own solution. I'm ready to merge as soon as this gets resolved: https://github.com/Expensify/expensify-common/pull/381/files#r644268511

@pranshuchittora
Copy link
Contributor Author

Yea, I think solving that in another PR is reasonable. It'll be its own issue with its own solution. I'm ready to merge as soon as this gets resolved: https://github.com/Expensify/expensify-common/pull/381/files#r644268511

https://github.com/Expensify/expensify-common/pull/381/files#r652000032

Copy link
Contributor

@Luke9389 Luke9389 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing, can you add specific tests (numbered and detailed) for what Web QA should do to test this? Try and make it as easy for them as possible to understand what they need to do. Right now it's too vague.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants