Attribute escaping difference between Gutenberg and WP #9915

johngodley · 2018-09-14T16:10:07Z

Describe the bug
There appears to be a difference between how Gutenberg (specifically React) encodes HTML attributes, and how WordPress does, and this can lead to some mangling.

React encodes attributes according to the HTML spec, and > is allowed. This means test > thing is stored as-is in an attribute.

WordPress has other ideas, though. Here's an image created by Gutenberg. Note the alt tag has a value of test > thing (I've added returns in examples to make them look better):

<div class="wp-block-image">
<figure class="aligncenter">
<img src="http://latest.local/wp-content/uploads/2018/06/kirby.jpg" alt="test > thing" class="wp-image-10"/>
</figure>
</div>

WordPress renders this as:

<div class="wp-block-image">
<figure class="aligncenter">
<img src="http://latest.local/wp-content/uploads/2018/06/kirby.jpg" alt="test > thing&#8221; class=&#8221;wp-image-10&#8243;/></figure></div>

Note how the terminating quote in the alt attribute has been smart-quoted, and then everything else falls to pieces.

The culprit is wptexturize, and specifically the _get_wptexturize_split_regex regex which 'parses' the HTML using regex. It sees the > in the attribute, thinks it's an HTML tag, and the fun begins.

To test:

wp shell
wptexturize('<div class="wp-block-image"><figure class="aligncenter"><img src="http://latest.local/wp-content/uploads/2018/06/kirby.jpg" alt="test > thing" class="wp-image-10"/></figure></div>’)

=>
string(197) "<div class="wp-block-image"><figure class="aligncenter">
<img src="http://latest.local/wp-content/uploads/2018/06/kirby.jpg" alt="test > thing&#8221; class=&#8221;wp-image-10&#8243;/></figure></div>"

Going further you can see the > causes the regex to split (array position 2 and 3):

wp shell
preg_split( _get_wptexturize_split_regex( '' ), '<div class="wp-block-image"><figure class="aligncenter"><img src="http://latest.local/wp-content/uploads/2018/06/kirby.jpg" alt="test > thing" class="wp-image-10"/></figure></div>', -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY )

=>
array(6) {
  [0] =>
  string(28) "<div class="wp-block-image">"
  [1] =>
  string(28) "<figure class="aligncenter">"
  [2] =>
  string(79) "<img src="http://latest.local/wp-content/uploads/2018/06/kirby.jpg" alt="test >"
  [3] =>
  string(29) " thing" class="wp-image-10"/>"
  [4] =>
  string(9) "</figure>"
  [5] =>
  string(6) "</div>"
}

Changing wptexturize seems a huge headache. Maybe the best action is for Gutenberg to encode all attributes in a way compatible with _get_wptexturize_split_regex, which probably means converting the > to > (or just disabling wptexturize on every blog...):

wp shell
wptexturize('<div class="wp-block-image"><figure class="aligncenter">
<img src="http://latest.local/wp-content/uploads/2018/06/kirby.jpg" alt="test &gt; thing" class="wp-image-10"/></figure></div>')
=>
string(182) "<div class="wp-block-image"><figure class="aligncenter">
<img src="http://latest.local/wp-content/uploads/2018/06/kirby.jpg" alt="test &gt; thing" class="wp-image-10"/></figure></div>"

Note the above problem affects custom class names, and probably other blocks where data is directly output to an attribute.

This matches the problem described in #8779

The text was updated successfully, but these errors were encountered:

johngodley · 2018-09-14T16:24:47Z

Adding to this, if you use the classic editor to insert an image with an alt tag it is added with the encoded >:

<img class="alignnone wp-image-175 size-medium" src="http://latest.local/wp-content/uploads/2018/09/murray-12-240x300.jpg" alt="test > this" width="240" height="300" />

notnownikki · 2018-09-14T16:35:04Z

Related: #9908 #5897

dmsnell · 2018-09-14T16:45:23Z

https://www.w3.org/TR/html5/syntax.html#elements-attributes

notnownikki · 2018-09-14T17:02:13Z

So reading through the spec, it seems to me like > cannot appear unescaped in attribute values.

https://www.w3.org/TR/html5/syntax.html#unquoted :

must not contain any literal space characters,
any U+0022 QUOTATION MARK characters ("),
U+0027 APOSTROPHE characters ('),
U+003D EQUALS SIGN characters (=),
U+003C LESS-THAN SIGN characters (<),
U+003E GREATER-THAN SIGN characters (>),
or U+0060 GRAVE ACCENT characters (`),
and must not be the empty string.

Then we have "Single-quoted attribute value syntax"

... in addition to the requirements given above for attribute values, must not contain
any literal U+0027 APOSTROPHE characters ('),
and finally followed by a second single U+0027 APOSTROPHE character (').

Because those rules are in addition to the above rules, we can't have a literal > in quoted attribute values. But that would also mean that we can't have spaces either, which is a load of old trousers because the examples show values with spaces in them. 🙃

johngodley · 2018-09-14T17:28:26Z

Confusing spec!

The base attribute value is:

Attribute values are a mixture of text and character references, except with the additional restriction that the text cannot contain an ambiguous ampersand.

The spec for double-quoted attribute values is:

... in addition to the requirements given above for attribute values, must not contain any literal U+0022 QUOTATION MARK characters (")

The in addition to the requirements given above I think refers to the base spec, not to the unquoted value spec:

notnownikki · 2018-09-14T17:39:59Z

This explains IE6.

dmsnell · 2018-09-14T18:56:02Z

to confirm what @johngodley wrote, the restriction on > is specific to the unquoted attribute value and does not imply from the spec that it holds in quoted attribute values

afercia · 2019-01-27T11:00:21Z

Just to note this was reported again on Trac, see https://core.trac.wordpress.org/ticket/46114. See also the previous Trac ticket https://core.trac.wordpress.org/ticket/45387

mdawaffe · 2019-03-14T01:34:02Z

I mentioned this at #9963 (comment), but I want to clarify it here as well:

The spec allows > in double-quoted attributes. It does not require them.

<span data-foo="1 < 2">OK</span>
<span data-foo="1 &lt; 2">OK</span>
<script>
console.log(
        Array.from( document.querySelectorAll( 'span' ) )
                .map( span => span.getAttribute( 'data-foo' ) )
); // ["1 < 2", "1 < 2"]
</script>

Both are valid HTML. Both have exactly the same meaning.

johngodley added [Type] Bug An existing feature does not function as intended [Package] HTML entities /packages/html-entities labels Sep 14, 2018

johngodley mentioned this issue Sep 17, 2018

Add greater than symbol to attribute escaping #9963

Merged

4 tasks

aduth mentioned this issue Dec 11, 2018

Generated aria labels for links in block elements can destroy the output #12683

Closed

ellatrix mentioned this issue Mar 7, 2019

When blank link text include to " >" that published page give raise to bug. #14300

Closed

youknowriad closed this as completed in #9963 Mar 18, 2019

westonruter mentioned this issue May 14, 2019

Characters for greater than and less than are not encoded as entities in attribute values #15636

Open

ShedFlu mentioned this issue Nov 28, 2023

Greater-than-sign in attribute values cause other special characters to be encoded as HTML entities inside block's render php #56597

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attribute escaping difference between Gutenberg and WP #9915

Attribute escaping difference between Gutenberg and WP #9915

johngodley commented Sep 14, 2018 •

edited

Loading

johngodley commented Sep 14, 2018

notnownikki commented Sep 14, 2018

dmsnell commented Sep 14, 2018

notnownikki commented Sep 14, 2018 •

edited

Loading

johngodley commented Sep 14, 2018

notnownikki commented Sep 14, 2018

dmsnell commented Sep 14, 2018

afercia commented Jan 27, 2019

mdawaffe commented Mar 14, 2019

Attribute escaping difference between Gutenberg and WP #9915

Attribute escaping difference between Gutenberg and WP #9915

Comments

johngodley commented Sep 14, 2018 • edited Loading

johngodley commented Sep 14, 2018

notnownikki commented Sep 14, 2018

dmsnell commented Sep 14, 2018

notnownikki commented Sep 14, 2018 • edited Loading

johngodley commented Sep 14, 2018

notnownikki commented Sep 14, 2018

dmsnell commented Sep 14, 2018

afercia commented Jan 27, 2019

mdawaffe commented Mar 14, 2019

johngodley commented Sep 14, 2018 •

edited

Loading

notnownikki commented Sep 14, 2018 •

edited

Loading