Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimpleRTF2HTMLConverter removes some valid tags during conversion #9

Closed
fadeyev opened this issue Sep 29, 2019 · 10 comments
Closed

SimpleRTF2HTMLConverter removes some valid tags during conversion #9

fadeyev opened this issue Sep 29, 2019 · 10 comments
Labels
Milestone

Comments

@fadeyev
Copy link

fadeyev commented Sep 29, 2019

The following line in Outlook msg RTF file:
{\*\htmltag241 -->}
should turn into --> during conversion to HTML.
However it's replace with empty string by line 139 of SimpleRTF2HTMLConverter
replacedText = replacedText.replaceAll("\\{\\\\\\*\\\\htmltag\\d+[^\\}<]+\\}", "");
Thus resulting HTML file is broken and has invalid structure

@bbottema
Copy link
Owner

This library is a pain to debug (I didn't build it from the ground up). I'll gladly take PR's if you have time for it :)

Meanwhile, I'll see if there are some quick wins from your bug reports (thanks for them btw!)

@bbottema
Copy link
Owner

bbottema commented Sep 29, 2019

Regarding this specific case, I can easily include an exception for -->, but are there other types like this? Right now the regex generically (and aggressively) clean up the content.

//filtering embedded tags like {\*\htmltag64 <tr>}                                          }
replacedText = replacedText.replaceAll("\\{\\\\\\*\\\\htmltag\\d+[^\\}<]+(<.+>)\\}", "$1");
//filtering embedded tags like {\*\htmltag84 &#43;}
replacedText = replacedText.replaceAll("\\{\\\\\\*\\\\htmltag\\d+[^\\}<]+\\}", "");

Basically what happens is that any HTML that is not in tags is removed.

@fadeyev
Copy link
Author

fadeyev commented Sep 29, 2019

I don't quite understand what we are removing here. For example, in lines {\*\htmltag64 <tr>} and {\*\htmltag84 &#43;}, aren't <tr> and + sign parts of the original HTML email content?
I tried to just remove the second replacement completely and resulting HTMLs look fine to me, tried on several emails.

@bbottema
Copy link
Owner

bbottema commented Sep 29, 2019

I don't quite understand what we are removing here.

Everything that is a HTML tag encoded as {htmltag <thetag>} is replaced by only <thetag>. Then the remaining encoded tags {htmltag whatever} are removed. You won't see the difference if the second type of encoded tags don't exist in your email. So this removes -->, but keeps for example <span>--></span>.

This is because RTF can be quite polluted with crap htmltags. Discerning which ones to keep is nigh impossible, except for specific exceptions like your - ->

@fadeyev
Copy link
Author

fadeyev commented Sep 29, 2019

Ok, I wrote a simple test - tried to parse 10 really complex emails with and without these two lines. Now I see what you mean - HTML entities are present twice in RTF, once as {htmltag &#43;} and then goes \htmlrtf +
Anyway, here's the summary of what got removed as a result of the test on 10 really complex files:

  1. <title> tag contents - not sure if it's a bug or not - I guess I'd like it to be kept
  2. part of CSS style definition looking like html, body {...} - seems like a bug as well
  3. Piece of HTML that was initially reported: -->
  4. Hundreds of &nbsp; - that's fine
  5. HTML entities like &#43;, &amp;, &gt;, etc. which duplicated by \htmlrtf - actually it's a bug as well, for &gt; it will produce invalid html

I suggest replacing two lines above with he following:
replacedText = replacedText.replaceAll("\\{\\\\\\*\\\\htmltag\\d+ (&[#\\w]+;)}\\\\htmlrtf.*\\\\htmlrtf0 ", "$1");
This works only for HTML entities and leaves out only HTML entity in place instead of the actual character.
I tested it on 10 complex emails - seems to work fine

@fadeyev
Copy link
Author

fadeyev commented Sep 29, 2019

Btw, the same code is in simple-java-mail library. Are you planning to make simple-java-mail library depend on this one and remove parsing classes from there?

P.S. both library are very cool, convenient and high-quality

@fadeyev
Copy link
Author

fadeyev commented Sep 29, 2019

I compared resulting HTML with the one produced by Outlook (when right clicking on an email and selecting View Source) - the two still have many small differences, many small bugs. My thinking that the class SimpleRTF2HTMLConverter needs to be completely rewritten and should use ANTLR. Probably I can do it in a month or so - too busy to do it right now.

@bbottema
Copy link
Owner

bbottema commented Sep 30, 2019

Btw, the same code is in simple-java-mail library. Are you planning to make simple-java-mail library depend on this one and remove parsing classes from there?

Simple Java Mail already depends on this library.

My thinking that the class SimpleRTF2HTMLConverter needs to be completely rewritten and should use ANTLR.

Oh sure, I'm just happy to get a workable result at all at this point. Having an exact parser would be great though.

Long time ago, I searched for (lightweight) libraries dedicated to converting RTF to HTML, but at the time they didn't exist for Java. Maybe there's something out there now, though. I'll have a gander.

I suggest replacing two lines above with he following:
replacedText = replacedText.replaceAll("\\{\\\\\\*\\\\htmltag\\d+ (&[#\\w]+;)}\\\\htmlrtf.*\\\\htmlrtf0 ", "$1");
This works only for HTML entities and leaves out only HTML entity in place instead of the actual character.
I tested it on 10 complex emails - seems to work fine

Awesome, I'll have a look.

Repository owner deleted a comment from fadeyev Sep 30, 2019
@bbottema bbottema added this to the 1.2.2 milestone Sep 30, 2019
@bbottema bbottema closed this as completed Oct 4, 2019
@bbottema
Copy link
Owner

bbottema commented Oct 4, 2019

Released v1.3.0

fadeyev pushed a commit to fadeyev/outlook-message-parser that referenced this issue Oct 12, 2019
fadeyev pushed a commit to fadeyev/outlook-message-parser that referenced this issue Oct 12, 2019
fadeyev pushed a commit to fadeyev/outlook-message-parser that referenced this issue Oct 12, 2019
@bbottema
Copy link
Owner

Released v1.4.0, which now uses the new and improved spec-compliant RFC converter!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants