Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra space after a closing emphasis mark #405

Open
ropery opened this issue Dec 24, 2023 · 2 comments
Open

Extra space after a closing emphasis mark #405

ropery opened this issue Dec 24, 2023 · 2 comments

Comments

@ropery
Copy link

ropery commented Dec 24, 2023

$ echo '<em>hello</em>'{\,,\",:,\[,.,\!,\?}'<br>' | html2text
_hello_ ,  
_hello_ "  
_hello_ :  
_hello_[  
_hello_.  
_hello_!  
_hello_?  

Note in the first three lines of the output, there is an extra space after the closing _ emphasis mark.

This is a bug, because Markdown has no problem with a punctuation immediately following the closing emphasis mark:

$ echo _hello_{\,,\",:,\[,.,\!,\?} | markdown
<p><em>hello</em>, <em>hello</em>&ldquo; <em>hello</em>: <em>hello</em>[ <em>hello</em>. <em>hello</em>! <em>hello</em>?</p>

The same rendered by GitHub: hello, hello" hello: hello[ hello. hello! hello?

I guess the extra space is added here:

# space is only allowed after *all* emphasis marks
if (bold or italic) and not self.emphasis:
self.o(" ")

Or here, which explains why the bottom four results don't have the extra space:

elif self.preceding_stressed:
if (
re.match(r"[^][(){}\s.!?]", data[0])
and not hn(self.current_tag)
and self.current_tag not in ["a", "code", "pre"]
):
# should match a letter or common punctuation
data = " " + data
self.preceding_stressed = False

@ropery
Copy link
Author

ropery commented Dec 24, 2023

I would like to add, that maybe we should simply not add extra spaces around stressed text:

$ for i in _ \* __ \*\*; do echo "${i}foo${i}bar${i}baz${i}"; done
_foo_bar_baz_
*foo*bar*baz*
__foo__bar__baz__
**foo**bar**baz**

My markdown produces:

$ for i in _ \* __ \*\*; do echo "${i}foo${i}bar${i}baz${i}" | markdown; done
<p><em>foo_bar_baz</em></p>
<p><em>foo</em>bar<em>baz</em></p>
<p><strong>foo</strong>bar<strong>baz</strong></p>
<p><strong>foo</strong>bar<strong>baz</strong></p>

But GitHub's rendering disagrees for the third __foo__bar__baz__:
foo_bar_baz
foobarbaz
foo__bar__baz
foobarbaz

$ for i in _ \* __ \*\*; do echo "${i}foo${i}bar${i}baz${i}" | markdown | html2text; done
_foo_bar_baz_

_foo_ bar _baz_

**foo** bar**baz**

**foo** bar**baz**

So it seems, if we want to add extra spaces, it would be only when the stress mark is _ or __ -- * and ** don't require extra spaces for Markdown to apply the stress, e.g., ***a**b* -> ab = ok

-- which leads to the question: should -e be the default, or maybe automatically use * in where _ would require extra spaces (thereby irreversibly distorting the text).

@epic1219
Copy link

epic1219 commented Oct 23, 2024

html2text/init.py
418,427
< # if (
< # start
< # and self.preceding_data
< # and self.preceding_data[-1] not in string.whitespace
< # and self.preceding_data[-1] not in string.punctuation
< # ):
< # emphasis = " " + self.emphasis_mark
< # self.preceding_data += " "
< # else:
< emphasis = self.emphasis_mark
871
< and self.current_tag not in ["a", "code", "pre", "strong", "em"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants