Whitespace

Treating Whitespace in Turndown

Treatment of whitespace in HTML is determined by its rendering in browsers. This is called whitespace collapsing. When Turndown processes HTML, the rule of thumb is:

The whitespace SHOULD be collapsed the HTML way if the generated Markdown would render differently than the original HTML.
The whitespace MIGHT be collapsed the HTML way, as long as it does not cause rendering differences.

The second principle allows turndown to simplify several things and its right operation actually depends on it.

There is not much special about whitesplace inside text. The situation is more tricky for whitespace at the edges of text nodes.

Flanking Delimiter Run Treatment

See CommonMark spec for more information.

The situation in Turndown 6.0.0 is as follows. Consider an element containing a text node and a followed by text node foo(End)(Start)Next. The non-breaking space (  = \u00A0) is alternatively written as · to improve readability.

#	End	Next Start	HTML collapse	Turndown operation	Example of Processing
1	ASCII WS	ASCII WS	eaten	no op	`<i>foo </i> bar` →`<i>foo</i> bar` →`_foo_ bar`
2	nonWS	ASCII WS	no-op	no op	`<i>foo</i> bar` →`<i>foo</i> bar` →`_foo_ bar`
3	ASCII WS	nonWS	no-op	move End outside	`<i>foo </i>bar` →`<i>foo </i>bar` →`_foo_ bar`
4	nonASCII WS	nonWS	no-op	•change End to 0x20 •move End outside	`<i>foo </i>bar` →`<i>foo </i>bar` →`_foo_ bar`
5	nonASCII WS	nonASCII WS	no-op	•change End to 0x20 •move End outside	`<i>foo </i> bar` →`<i>foo </i> bar` →`_foo_ ·bar`
6	ASCII WS	nonASCII WS	no-op	move End outside	`<i>foo </i> bar` →`<i>foo </i> bar` →`_foo_ ·bar`
7	nonASCII WS	ASCII WS	no-op	•output End as is •change End to 0x20 move End outside	`<i>foo </i> bar` →`<i>foo </i> bar` →`_foo·_ bar`

Cases 1 and 2 exactly match the rule of thumb. Let's discuss the other ones.

Turndown 6.0 Behavior Evaluation

Case 3: Moving whitespace outside of elements

Although the case 3 is a small change to HTML behavior:

Text content still matches.
Not really unexpected, normal WS should be treated as a fragile thing.
Likely resulting from unintended input artefacts, e.g. mouse-selecting text and pressing the I button.
A strictly matching encoding - _foo _bar - is just too ugly given the above reasons.

On the other hand, this is also applied to inlines that don't need it, specifically to <code>. E.g. (` foo `) in Markdown renders as (<code> foo </code>). But such HTML would convert back to ( `foo` ) now. This might be unintended even in the current code, as rules.code in commonmark-rules.js actually invokes trim() when testing on emptiness, which either just resembles the letter of CommonMark spec, or it also suggest that untrimmed content is expected. [DO-NOT-COLLAPSE-CODE-WS]

Cases 4-6: Unexpected and Misfmormatting vulnerability

The technical issue behind the current behavior lies in CommonMark spec. CommonMark requires some of the tags not to be surrounded by Unicode whitespace. But HTML whitespace collapsing works only with ASCII whitespace.

Suppose ~ means a non-breaking space (HTML  , unicode \u00A0). The current behavior has three issues:

Replacing Unicode whitespace with ASCII is not expected by users, e.g. Law §~1782 should not break after §. [RESPECT-ORIGINAL-WS]
Without extra escaping, replacing Unicode whitespace can produce false formatting. E.g. ~1. foo 1.~foo and both produces ordered lists, which were not on the input. [RESPECT-ORIGINAL-WS]
Users do not expect ASCII and nonASCII whitespace to be merged, e.g. always add ~km as the distance unit should not collapse in a single space after add. [DO-NOT-COLLAPSE-MIXED-WS]
Some users might expect unicode whitespace to be kept wihin emphasis elements, e.g. Law §~1782, which is achievable by using HTML entities Markdown. But this is can be considered a similar situation to normal whitespace, where it is actually moved. So we prefer it over introducing conversion to HTML entities. ~~[DO-NOT-MOVE-UNICODE-WS]~~

Case 7: Broken Flanking Delimiter Run

Case 7 adds extra issue of broken formating on top of the previous issue. This is partially due to an implementation detail of how it is decided when the content should be trim()med. ~~[TRIM-REGARDLESS-OF-WS-DETECTION]~~

The issue would also occured if HTML whitespace was not collapsed. But it is actually collapsed and the enabler of this issue is the mentioned [DO-NOT-COLLAPSE-MIXED-WS].

Changes Made

Unicode Whitespace Treatment

Successful completion of [RESPECT-ORIGINAL-WS] and [DO-NOT-COLLAPSE-MIXED-WS] leads to the following results. Same as above, · represents \u00A0 and  .

#	Name	Input	Output
4	element with trailing nonASCII WS followed by nonWS	`<i>foo·</i>bar`	`_foo_·bar`
5	element with trailing nonASCII WS followed by nonASCII WS	`<i>foo·</i>·bar`	`_foo_··bar`
6	element with trailing ASCII WS followed by nonASCII WS	`<i>foo </i>·bar`	`_foo_ ·bar`
7	element with trailing nonASCII WS followed by ASCII WS	`<i>foo·</i> bar`	`_foo_· bar`
4 mirrored	nonWS followed by element with leading nonASCII WS	`foo<i>·bar</i>`	`foo·_bar_`
5 mirrored	nonASCII WS followed by element with leading nonASCII WS	`foo·<i>·bar</i>`	`foo··_bar_`
6 mirrored	nonASCII WS followed by element with leading ASCII WS	`foo·<i> bar</i>`	`foo· _bar_`
7 mirrored	ASCII WS followed by element with leading nonASCII WS	`foo <i>·bar</i>`	`foo ·_bar_`

Inline Code Whitespace

[DO-NOT-COLLAPSE-CODE-WS] is slightly more tricky to describe as it has a few precondidions:

It is only meaningful when <code> element is treated as a preformatted inline element (like in GitLab). See this issue at the collapse-whitespace project and its fix.
Although it is harmless to assert the code always to be inline-preformatted, there might still be users expecting the old behavior, so making this configurable makes sense.
flankingWhitspace() has to match such setting.
And rules.code in commonmark-rules.js contains a minor bug, which has to be fixed.

Might sound complicated, but the code is actually very skinny and leads to the following results when preformattedCode setting is enabled:

Input	Output
`An <code> indented code line</code>`	An ` indented code line`
`(<code> foo </code>)`	(` foo `)
`(<i> <code> bar </code> </i>)`	( _` bar `_ )

See the behavior in GitLab:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly