-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing spaces between tags when using translate_html #71
Comments
Thanks for reaching us! I could indeed reproduce your problem. Just checked the code, we do not seem to remove the strings intentionally. This might be done by any translator outside Because we don't expect all the translators to support HTML translation, we need to separate each component of the HTML to translate them apart and reassemble everything at the end. This has the side effect that each component is treated as separate, thus any cleaning (stripping the spaces for example) is done on every component.
Now, the problem is that we don't know what kind of cleaning is done by the translators, and it might even be different translators translating the different components. For some differently structured languages, the translator might be adding or removing some kind of specific symbols which has a meaning in the resulting language. The order of symbols in a single phrase might also need to be different. Now if we introduce a basic checking before translating to see if we need to re-add spaces after the translation or not ...
if tail_space_before_translation and not result.endswith(" "):
result += " "
... It might work for Latin-based languages translations, but the translator might have deleted the spaces for a reason : (will take my native languages for simplicity) <p>Je suis un étudiant <strong>et vous êtes un professeur</strong></p> Should be translated in Japanese to <p>僕は生徒で<strong>あなたは先生です</strong></p>
We see that this behavior is also found when translating with >>> from translatepy import Translate
>>> t = Translate()
>>> r = t.translate_html("<p>Je suis un étudiant <strong>et vous êtes un professeur</strong></p>", "Japanese")
>>> r
'<p>私は学生です<strong>そして、あなたは先生です</strong></p>'
I would need to come up with a better algorithm to translate HTML content without losing the context (language wise and HTML wise) but I guess that would require complex NLP If you have any idea, I would welcome them. If you have any question or issue, feel free to ask them! Oh, and sorry for being a bit inactive lately, but school work is way busier compared to what I previously had... |
Closing this for now, since it's been a while since this got any activity. I partly continued this discussion in #93 if you are interested. Feel free to reply if you want to reopen it! |
Missing spaces between tags when using translate_html
Code
Current:
Expected:
The text was updated successfully, but these errors were encountered: