Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attribute escaping difference between Gutenberg and WP #9915

Closed
johngodley opened this issue Sep 14, 2018 · 9 comments · Fixed by #9963
Closed

Attribute escaping difference between Gutenberg and WP #9915

johngodley opened this issue Sep 14, 2018 · 9 comments · Fixed by #9963
Labels
[Package] HTML entities /packages/html-entities [Type] Bug An existing feature does not function as intended

Comments

@johngodley
Copy link
Contributor

johngodley commented Sep 14, 2018

Describe the bug
There appears to be a difference between how Gutenberg (specifically React) encodes HTML attributes, and how WordPress does, and this can lead to some mangling.

React encodes attributes according to the HTML spec, and > is allowed. This means test > thing is stored as-is in an attribute.

WordPress has other ideas, though. Here's an image created by Gutenberg. Note the alt tag has a value of test > thing (I've added returns in examples to make them look better):

<div class="wp-block-image">
<figure class="aligncenter">
<img src="http://latest.local/wp-content/uploads/2018/06/kirby.jpg" alt="test > thing" class="wp-image-10"/>
</figure>
</div>

WordPress renders this as:

<div class="wp-block-image">
<figure class="aligncenter">
<img src="http://latest.local/wp-content/uploads/2018/06/kirby.jpg" alt="test > thing&#8221; class=&#8221;wp-image-10&#8243;/></figure></div>

Note how the terminating quote in the alt attribute has been smart-quoted, and then everything else falls to pieces.

The culprit is wptexturize, and specifically the _get_wptexturize_split_regex regex which 'parses' the HTML using regex. It sees the > in the attribute, thinks it's an HTML tag, and the fun begins.

To test:

wp shell
wptexturize('<div class="wp-block-image"><figure class="aligncenter"><img src="http://latest.local/wp-content/uploads/2018/06/kirby.jpg" alt="test > thing" class="wp-image-10"/></figure></div>’)

=>
string(197) "<div class="wp-block-image"><figure class="aligncenter">
<img src="http://latest.local/wp-content/uploads/2018/06/kirby.jpg" alt="test > thing&#8221; class=&#8221;wp-image-10&#8243;/></figure></div>"

Going further you can see the > causes the regex to split (array position 2 and 3):

wp shell
preg_split( _get_wptexturize_split_regex( '' ), '<div class="wp-block-image"><figure class="aligncenter"><img src="http://latest.local/wp-content/uploads/2018/06/kirby.jpg" alt="test > thing" class="wp-image-10"/></figure></div>', -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY )

=>
array(6) {
  [0] =>
  string(28) "<div class="wp-block-image">"
  [1] =>
  string(28) "<figure class="aligncenter">"
  [2] =>
  string(79) "<img src="http://latest.local/wp-content/uploads/2018/06/kirby.jpg" alt="test >"
  [3] =>
  string(29) " thing" class="wp-image-10"/>"
  [4] =>
  string(9) "</figure>"
  [5] =>
  string(6) "</div>"
}

Changing wptexturize seems a huge headache. Maybe the best action is for Gutenberg to encode all attributes in a way compatible with _get_wptexturize_split_regex, which probably means converting the > to &gt; (or just disabling wptexturize on every blog...):

wp shell
wptexturize('<div class="wp-block-image"><figure class="aligncenter">
<img src="http://latest.local/wp-content/uploads/2018/06/kirby.jpg" alt="test &gt; thing" class="wp-image-10"/></figure></div>')
=>
string(182) "<div class="wp-block-image"><figure class="aligncenter">
<img src="http://latest.local/wp-content/uploads/2018/06/kirby.jpg" alt="test &gt; thing" class="wp-image-10"/></figure></div>"

Note the above problem affects custom class names, and probably other blocks where data is directly output to an attribute.

This matches the problem described in #8779

@johngodley johngodley added [Type] Bug An existing feature does not function as intended [Package] HTML entities /packages/html-entities labels Sep 14, 2018
@johngodley
Copy link
Contributor Author

Adding to this, if you use the classic editor to insert an image with an alt tag it is added with the encoded >:

<img class="alignnone wp-image-175 size-medium" src="http://latest.local/wp-content/uploads/2018/09/murray-12-240x300.jpg" alt="test &gt; this" width="240" height="300" />

@notnownikki
Copy link
Member

Related: #9908 #5897

@dmsnell
Copy link
Member

dmsnell commented Sep 14, 2018

@notnownikki
Copy link
Member

notnownikki commented Sep 14, 2018

So reading through the spec, it seems to me like > cannot appear unescaped in attribute values.

https://www.w3.org/TR/html5/syntax.html#unquoted :

must not contain any literal space characters,
any U+0022 QUOTATION MARK characters ("),
U+0027 APOSTROPHE characters ('),
U+003D EQUALS SIGN characters (=),
U+003C LESS-THAN SIGN characters (<),
U+003E GREATER-THAN SIGN characters (>),
or U+0060 GRAVE ACCENT characters (`),
and must not be the empty string.

Then we have "Single-quoted attribute value syntax"

... in addition to the requirements given above for attribute values, must not contain
any literal U+0027 APOSTROPHE characters ('),
and finally followed by a second single U+0027 APOSTROPHE character (').

Because those rules are in addition to the above rules, we can't have a literal > in quoted attribute values. But that would also mean that we can't have spaces either, which is a load of old trousers because the examples show values with spaces in them. 🙃

@johngodley
Copy link
Contributor Author

Confusing spec!

The base attribute value is:

Attribute values are a mixture of text and character references, except with the additional restriction that the text cannot contain an ambiguous ampersand.

The spec for double-quoted attribute values is:

... in addition to the requirements given above for attribute values, must not contain any literal U+0022 QUOTATION MARK characters (")

The in addition to the requirements given above I think refers to the base spec, not to the unquoted value spec:

@notnownikki
Copy link
Member

This explains IE6.

@dmsnell
Copy link
Member

dmsnell commented Sep 14, 2018

to confirm what @johngodley wrote, the restriction on > is specific to the unquoted attribute value and does not imply from the spec that it holds in quoted attribute values

@afercia
Copy link
Contributor

afercia commented Jan 27, 2019

Just to note this was reported again on Trac, see https://core.trac.wordpress.org/ticket/46114. See also the previous Trac ticket https://core.trac.wordpress.org/ticket/45387

@mdawaffe
Copy link
Contributor

I mentioned this at #9963 (comment), but I want to clarify it here as well:

The spec allows > in double-quoted attributes. It does not require them.

<span data-foo="1 < 2">OK</span>
<span data-foo="1 &lt; 2">OK</span>
<script>
console.log(
        Array.from( document.querySelectorAll( 'span' ) )
                .map( span => span.getAttribute( 'data-foo' ) )
); // ["1 < 2", "1 < 2"]
</script>

Both are valid HTML. Both have exactly the same meaning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Package] HTML entities /packages/html-entities [Type] Bug An existing feature does not function as intended
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants