feat: HTML to markdown parser #381

pranshuchittora · 2021-05-29T14:49:55Z

@roryabraham @jasperhuangg @marcaaron will you please review this?

Context: Expensify/App#2847

Tests

What unit/integration tests cover your change? What autoQA tests cover your change?
Yes I have added tests

What tests did you perform that validates your changed worked?
Yes

QA

What does QA need to do to validate your changes?
Try various HTML strings with   tags in various combinations

HTML -> MD

Hello There -> Hello\nThere
Hello There -> Hello\nThere

Tests -> https://github.com/Expensify/expensify-common/pull/381/files#diff-7031bf36a0d060dfeb1742c45b8fa6853fb5790ef4da366b0fb96ceaafd342ea

What areas to they need to test for regressions?
Parser

github-actions · 2021-05-29T14:50:14Z

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

pranshuchittora · 2021-05-29T16:06:45Z

I have read the CLA Document and I hereby sign the CLA

roryabraham · 2021-06-01T14:21:15Z

@pranshuchittora I've run the test check suite on this PR and found that there are lint errors that need to be resolved

pranshuchittora · 2021-06-02T16:42:06Z

@roryabraham I have fixed the linting issues :)

Luke9389

Hello @pranshuchittora. Nice work so far. I've got a few small comments and one larger one that I'll post below.

lib/ExpensiMark.js

Luke9389 · 2021-06-02T19:49:14Z

Do we want to match breaking space tags that are inside of code blocks?

If you sent a message like the following, what would we want our code to do?
"Hey fellow engineer, I noticed we are missing a   closing tag on line X."

Currently, I think we'd end up matching and replacing that tag, which is not desired behavior. Perhaps we should establish a method for preventing matches inside of code blocks. Thoughts?

Luke9389 · 2021-06-02T20:03:30Z

As a side note, I think you can put the changes from this PR here as well. I like the organization of having them separate, but I think it's best for our workflow to just have one PR to track.

pranshuchittora · 2021-06-04T15:01:18Z

"Hey fellow engineer, I noticed we are missing a
closing tag on line X."

As this happens on runtime (conversion), therefore this becomes irrelevant.
My approach / philosophy is inspired by the way HTML works instead of breaking it assumes and fixes the tree on runtime

Luke9389 · 2021-06-07T18:43:41Z

As this happens on runtime (conversion), therefore this becomes irrelevant. My approach / philosophy is inspired by the way HTML works instead of breaking it assumes and fixes the tree on runtime

Interesting. Would you mind elaborating a bit further, for my sake? Also, is this something we can test?

pranshuchittora · 2021-06-07T18:51:28Z

As this happens on runtime (conversion), therefore this becomes irrelevant. My approach / philosophy is inspired by the way HTML works instead of breaking it assumes and fixes the tree on runtime

Interesting. Would you mind elaborating a bit further, for my sake? Also, is this something we can test?

Sure, what I understand is that we should show some warning for unclosed   in dev mode. This kinda becomes as messages are going to be generated by the users on runtime.

The only way this becomes relevant is when input to HTML conversion happens. Instead of handling it here, I recommend writing robust test cases for the HTML conversion parser.

Correct me if I am missing something 🤔 @Luke9389

Luke9389 · 2021-06-07T21:11:33Z

Sure, what I understand is that we should show some warning for unclosed   in dev mode.

Ok it's possible we're talking about two different things right now. My main concern is that if someone puts a br tag inside backticks, that we won't make a newline. I'm not sure what you're referring to with showing a warning.

pranshuchittora · 2021-06-08T17:16:07Z

Sure, what I understand is that we should show some warning for unclosed   in dev mode.

Ok it's possible we're talking about two different things right now. My main concern is that if someone puts a br tag inside backticks, that we won't make a newline. I'm not sure what you're referring to with showing a warning.

Yeah, I agree that's an edge case. I try finding out how other parsers are doing this unfortunately one of the famous parser has buggy behaviour with this as well.

http://domchristie.github.io/turndown/

Paste this HTML 👇🏼

<code>
Hello There <br/>
</code>

Luke9389 · 2021-06-09T15:39:03Z

Couldn't we just use regex to accomplish this? I think negative lookahead will allow us to avoid matching a breaking tag wrapped in backticks. Something like this:

<br\s*[/]?>(?!`)

That way we won't match  

We already use negative lookahead in some of our other regex patterns in this file, so we should be safe to use it here.

pranshuchittora · 2021-06-09T17:50:07Z

I think negative lookahead will allow us to avoid matching a breaking tag wrapped in backticks

@Luke9389 backticks are from the markdown world.
Backticks are converted to <code> in HTML. The input to this will look like this

<code>
Hello There <br/>
</code>

Therfore having backticks in the input HTML is irrelevant and can be ignored in HTML -> MD parsing

pranshuchittora · 2021-06-11T20:21:38Z

Hi, @marcaaron can you help merging this

marcaaron · 2021-06-11T23:49:06Z

Sorry, @pranshuchittora please give the assigned reviewers time to handle this and if > 3 days ask in the Slack channel for help

Luke9389 · 2021-06-14T19:54:24Z

Hey @pranshuchittora, sorry for the late response here. I've been on vacation the past week.

Ok, I see what you're saying about the backticks, but we aren't out of the weeds yet.

First, I want to be sure we're on the same page. When someone sends a message with this,  , they're talking about a br tag, and so we shouldn't parse that to a new line. That gets sent, turned into HTML (< code > < /code >), and now when we are parsing that HTML back to markdown, we want to be sure we don't create a new line (we want it to be turned into  ).

So in my view, we should have some way of preventing the content inside of code blocks from being parsed. Maybe that should be saved for a different PR? What do you think @pranshuchittora?

jasperhuangg

I agree with Luke's comments about ensuring that any code inside of code blocks isn't parsed.

pranshuchittora · 2021-06-15T08:58:19Z

Hey @pranshuchittora, sorry for the late response here. I've been on vacation the past week.

Ok, I see what you're saying about the backticks, but we aren't out of the weeds yet.

First, I want to be sure we're on the same page. When someone sends a message with this,  , they're talking about a br tag, and so we shouldn't parse that to a new line. That gets sent, turned into HTML (< code > < /code >), and now when we are parsing that HTML back to markdown, we want to be sure we don't create a new line (we want it to be turned into  ).

So in my view, we should have some way of preventing the content inside of code blocks from being parsed. Maybe that should be saved for a different PR? What do you think @pranshuchittora?

Yup we are on the same page 👍🏼
Solving this is very tricky as we can’t find <code> directly as it can have some attribute like <code some-attribute="xyz" >. I checked few parsers out there as discussed in this comment #381 (comment)

IMO let’s solve that with another issue and PR

Luke9389 · 2021-06-15T17:03:28Z

Yea, I think solving that in another PR is reasonable. It'll be its own issue with its own solution. I'm ready to merge as soon as this gets resolved: https://github.com/Expensify/expensify-common/pull/381/files#r644268511

pranshuchittora · 2021-06-15T17:24:26Z

Yea, I think solving that in another PR is reasonable. It'll be its own issue with its own solution. I'm ready to merge as soon as this gets resolved: https://github.com/Expensify/expensify-common/pull/381/files#r644268511

https://github.com/Expensify/expensify-common/pull/381/files#r652000032

Luke9389

One more thing, can you add specific tests (numbered and detailed) for what Web QA should do to test this? Try and make it as easy for them as possible to understand what they need to do. Right now it's too vague.

feat: HTML to markdown parser

6494817

pranshuchittora requested a review from a team as a code owner May 29, 2021 14:49

MelvinBot requested review from Luke9389 and removed request for a team May 29, 2021 14:50

pranshuchittora mentioned this pull request May 29, 2021

feat: HTML to Markdown parser integration Expensify/App#3229

Merged

5 tasks

fix: Eslint fixes

265e00e

Luke9389 suggested changes Jun 2, 2021

View reviewed changes

lib/ExpensiMark.js Outdated Show resolved Hide resolved

lib/ExpensiMark.js Outdated Show resolved Hide resolved

lib/ExpensiMark.js Outdated Show resolved Hide resolved

lib/ExpensiMark.js Show resolved Hide resolved

fix: PR review fixes

e7e70d0

jasperhuangg approved these changes Jun 9, 2021

View reviewed changes

jasperhuangg self-requested a review June 15, 2021 07:05

jasperhuangg requested changes Jun 15, 2021

View reviewed changes

pranshuchittora requested a review from Luke9389 June 15, 2021 17:25

Luke9389 approved these changes Jun 15, 2021

View reviewed changes

jasperhuangg approved these changes Jun 17, 2021

View reviewed changes

Luke9389 merged commit c3465bf into Expensify:master Jun 17, 2021

roryabraham mentioned this pull request Jun 21, 2021

[Tracking Issue] Implement HTML -> Markdown conversions in Expensimark Expensify/App#3047

Closed

10 tasks

Luke9389 mentioned this pull request Jun 23, 2021

LHN - HTML encoding is visible in chat switcher Expensify/App#3673

Closed

Jag96 mentioned this pull request Jul 8, 2021

[HOLD for payment July 20] Markdown - Copied text (with markdown) does not show the formatting when pasted in e.cash Expensify/App#3790

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: HTML to markdown parser #381

feat: HTML to markdown parser #381

pranshuchittora commented May 29, 2021 •

edited

Loading

github-actions bot commented May 29, 2021 •

edited

Loading

pranshuchittora commented May 29, 2021

roryabraham commented Jun 1, 2021

pranshuchittora commented Jun 2, 2021

Luke9389 left a comment

Luke9389 commented Jun 2, 2021

Luke9389 commented Jun 2, 2021

pranshuchittora commented Jun 4, 2021

Luke9389 commented Jun 7, 2021

pranshuchittora commented Jun 7, 2021 •

edited

Loading

Luke9389 commented Jun 7, 2021

pranshuchittora commented Jun 8, 2021

Luke9389 commented Jun 9, 2021 •

edited

Loading

pranshuchittora commented Jun 9, 2021 •

edited

Loading

pranshuchittora commented Jun 11, 2021

marcaaron commented Jun 11, 2021

Luke9389 commented Jun 14, 2021

jasperhuangg left a comment

pranshuchittora commented Jun 15, 2021

Luke9389 commented Jun 15, 2021

pranshuchittora commented Jun 15, 2021

Luke9389 left a comment

feat: HTML to markdown parser #381

feat: HTML to markdown parser #381

Conversation

pranshuchittora commented May 29, 2021 • edited Loading

Tests

QA

github-actions bot commented May 29, 2021 • edited Loading

pranshuchittora commented May 29, 2021

roryabraham commented Jun 1, 2021

pranshuchittora commented Jun 2, 2021

Luke9389 left a comment

Choose a reason for hiding this comment

Luke9389 commented Jun 2, 2021

Luke9389 commented Jun 2, 2021

pranshuchittora commented Jun 4, 2021

Luke9389 commented Jun 7, 2021

pranshuchittora commented Jun 7, 2021 • edited Loading

Luke9389 commented Jun 7, 2021

pranshuchittora commented Jun 8, 2021

Luke9389 commented Jun 9, 2021 • edited Loading

pranshuchittora commented Jun 9, 2021 • edited Loading

pranshuchittora commented Jun 11, 2021

marcaaron commented Jun 11, 2021

Luke9389 commented Jun 14, 2021

jasperhuangg left a comment

Choose a reason for hiding this comment

pranshuchittora commented Jun 15, 2021

Luke9389 commented Jun 15, 2021

pranshuchittora commented Jun 15, 2021

Luke9389 left a comment

Choose a reason for hiding this comment

pranshuchittora commented May 29, 2021 •

edited

Loading

github-actions bot commented May 29, 2021 •

edited

Loading

pranshuchittora commented Jun 7, 2021 •

edited

Loading

Luke9389 commented Jun 9, 2021 •

edited

Loading

pranshuchittora commented Jun 9, 2021 •

edited

Loading