-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SimpleRTF2HTMLConverter removes some valid tags during conversion #9
Comments
This library is a pain to debug (I didn't build it from the ground up). I'll gladly take PR's if you have time for it :) Meanwhile, I'll see if there are some quick wins from your bug reports (thanks for them btw!) |
Regarding this specific case, I can easily include an exception for -->, but are there other types like this? Right now the regex generically (and aggressively) clean up the content. //filtering embedded tags like {\*\htmltag64 <tr>} }
replacedText = replacedText.replaceAll("\\{\\\\\\*\\\\htmltag\\d+[^\\}<]+(<.+>)\\}", "$1");
//filtering embedded tags like {\*\htmltag84 +}
replacedText = replacedText.replaceAll("\\{\\\\\\*\\\\htmltag\\d+[^\\}<]+\\}", ""); Basically what happens is that any HTML that is not in tags is removed. |
I don't quite understand what we are removing here. For example, in lines |
Everything that is a HTML tag encoded as This is because RTF can be quite polluted with crap htmltags. Discerning which ones to keep is nigh impossible, except for specific exceptions like your - -> |
Ok, I wrote a simple test - tried to parse 10 really complex emails with and without these two lines. Now I see what you mean - HTML entities are present twice in RTF, once as
I suggest replacing two lines above with he following: |
Btw, the same code is in simple-java-mail library. Are you planning to make simple-java-mail library depend on this one and remove parsing classes from there? P.S. both library are very cool, convenient and high-quality |
I compared resulting HTML with the one produced by Outlook (when right clicking on an email and selecting View Source) - the two still have many small differences, many small bugs. My thinking that the class SimpleRTF2HTMLConverter needs to be completely rewritten and should use ANTLR. Probably I can do it in a month or so - too busy to do it right now. |
Simple Java Mail already depends on this library.
Oh sure, I'm just happy to get a workable result at all at this point. Having an exact parser would be great though. Long time ago, I searched for (lightweight) libraries dedicated to converting RTF to HTML, but at the time they didn't exist for Java. Maybe there's something out there now, though. I'll have a gander.
Awesome, I'll have a look. |
Released v1.3.0 |
Released v1.4.0, which now uses the new and improved spec-compliant RFC converter! |
The following line in Outlook msg RTF file:
{\*\htmltag241 -->}
should turn into
-->
during conversion to HTML.However it's replace with empty string by line 139 of SimpleRTF2HTMLConverter
replacedText = replacedText.replaceAll("\\{\\\\\\*\\\\htmltag\\d+[^\\}<]+\\}", "");
Thus resulting HTML file is broken and has invalid structure
The text was updated successfully, but these errors were encountered: