-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: HTML to markdown parser #381
feat: HTML to markdown parser #381
Conversation
CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅ |
I have read the CLA Document and I hereby sign the CLA |
@pranshuchittora I've run the test check suite on this PR and found that there are lint errors that need to be resolved |
@roryabraham I have fixed the linting issues :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @pranshuchittora. Nice work so far. I've got a few small comments and one larger one that I'll post below.
Do we want to match breaking space tags that are inside of code blocks? If you sent a message like the following, what would we want our code to do? Currently, I think we'd end up matching and replacing that tag, which is not desired behavior. Perhaps we should establish a method for preventing matches inside of code blocks. Thoughts? |
As a side note, I think you can put the changes from this PR here as well. I like the organization of having them separate, but I think it's best for our workflow to just have one PR to track. |
As this happens on runtime (conversion), therefore this becomes irrelevant. |
Interesting. Would you mind elaborating a bit further, for my sake? Also, is this something we can test? |
Sure, what I understand is that we should show some warning for unclosed The only way this becomes relevant is when input to HTML conversion happens. Instead of handling it here, I recommend writing robust test cases for the HTML conversion parser. Correct me if I am missing something 🤔 @Luke9389 |
Ok it's possible we're talking about two different things right now. My main concern is that if someone puts a br tag inside backticks, that we won't make a newline. I'm not sure what you're referring to with showing a warning. |
Yeah, I agree that's an edge case. I try finding out how other parsers are doing this unfortunately one of the famous parser has buggy behaviour with this as well. http://domchristie.github.io/turndown/ Paste this HTML 👇🏼
|
Couldn't we just use regex to accomplish this? I think negative lookahead will allow us to avoid matching a breaking tag wrapped in backticks. Something like this: <br\s*[/]?>(?!`) That way we won't match We already use negative lookahead in some of our other regex patterns in this file, so we should be safe to use it here. |
@Luke9389 backticks are from the markdown world. <code>
Hello There <br/>
</code> Therfore having backticks in the input HTML is irrelevant and can be ignored in HTML -> MD parsing |
Hi, @marcaaron can you help merging this |
Sorry, @pranshuchittora please give the assigned reviewers time to handle this and if > 3 days ask in the Slack channel for help |
Hey @pranshuchittora, sorry for the late response here. I've been on vacation the past week. Ok, I see what you're saying about the backticks, but we aren't out of the weeds yet. First, I want to be sure we're on the same page. When someone sends a message with this, So in my view, we should have some way of preventing the content inside of code blocks from being parsed. Maybe that should be saved for a different PR? What do you think @pranshuchittora? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Luke's comments about ensuring that any code inside of code blocks isn't parsed.
Yup we are on the same page 👍🏼 IMO let’s solve that with another issue and PR |
Yea, I think solving that in another PR is reasonable. It'll be its own issue with its own solution. I'm ready to merge as soon as this gets resolved: https://github.com/Expensify/expensify-common/pull/381/files#r644268511 |
https://github.com/Expensify/expensify-common/pull/381/files#r652000032 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing, can you add specific tests (numbered and detailed) for what Web QA should do to test this? Try and make it as easy for them as possible to understand what they need to do. Right now it's too vague.
@roryabraham @jasperhuangg @marcaaron will you please review this?
Context: Expensify/App#2847
Tests
What unit/integration tests cover your change? What autoQA tests cover your change?
Yes I have added tests
What tests did you perform that validates your changed worked?
Yes
QA
What does QA need to do to validate your changes?
Try various HTML strings with
<br>
tags in various combinationsHTML -> MD
Hello<br/>There
->Hello\nThere
Hello<br></br>There
->Hello\nThere
Tests -> https://github.com/Expensify/expensify-common/pull/381/files#diff-7031bf36a0d060dfeb1742c45b8fa6853fb5790ef4da366b0fb96ceaafd342ea
What areas to they need to test for regressions?
Parser