Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML API: Introduce WP_HTML::tag() for safely creating HTML. #5884

Draft
wants to merge 2 commits into
base: trunk
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 155 additions & 0 deletions src/wp-includes/html-api/class-wp-html-processor.php
Original file line number Diff line number Diff line change
Expand Up @@ -1544,6 +1544,161 @@ private function insert_html_element( $token ) {
* HTML Specification Helpers
*/

/**
* Returns whether a given element is an HTML tag name.
*
* @todo Verify this list.
Copy link
Member

@sirreal sirreal Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I scraped this following list from MDN and w3.org. It looks like there may be a few more elements in this list, but I notice there are some missing as well. I can push a change to merge the lists.

HTML elements

A
ABBR
ACRONYM
ADDRESS
AREA
ARTICLE
ASIDE
AUDIO
B
BASE
BDI
BDO
BIG
BLOCKQUOTE
BODY
BR
BUTTON
CANVAS
CAPTION
CENTER
CITE
CODE
COL
COLGROUP
COMMAND
CONTENT
DATA
DATALIST
DD
DEL
DETAILS
DFN
DIALOG
DIR
DIV
DL
DT
EM
EMBED
FIELDSET
FIGCAPTION
FIGURE
FONT
FOOTER
FORM
FRAME
FRAMESET
H1
H2
H3
H4
H5
H6
HEAD
HEADER
HGROUP
HR
HTML
I
IFRAME
IMAGE
IMG
INPUT
INS
KBD
KEYGEN
LABEL
LEGEND
LI
LINK
MAIN
MAP
MARK
MARQUEE
MATH
MENU
MENUITEM
META
METER
NAV
NOBR
NOEMBED
NOFRAMES
NOSCRIPT
OBJECT
OL
OPTGROUP
OPTION
OUTPUT
P
PARAM
PICTURE
PLAINTEXT
PORTAL
PRE
PROGRESS
Q
RB
RP
RT
RTC
RUBY
S
SAMP
SCRIPT
SEARCH
SECTION
SELECT
SHADOW
SLOT
SMALL
SOURCE
SPAN
STRIKE
STRONG
STYLE
SUB
SUMMARY
SUP
SVG
TABLE
TBODY
TD
TEMPLATE
TEXTAREA
TFOOT
TH
THEAD
TIME
TITLE
TR
TRACK
TT
U
UL
VAR
VIDEO
WBR
XMP

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In deeper inspection here I feel like this is the wrong approach. We should be able to instead determine if we're in foreign content and apply the rules there accordingly. That way we don't have to keep a list of HTML elements updated and we don't have to worry about conflating elements with the same name of HTML elements with foreign elements, e.g. TITLE inside an SVG.

So I think more important now is getting foreign content detection in place. I've started working on this in #6006.

It may be the case that this relies on the HTML Processor because the rules get complicated with MathML and foriegn content integration points.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obviously after exploring #6006 there's no easy way to infer the outside context when creating a tag in isolation. this leads me to think that we need a good way to communicate this to developers.

maybe instead of passing 'self-closing' we could have people pass 'xml-empty-tag' or 'empty-tag-in-foreign-content'. I'd like to communicate that this isn't for IMG or any HTML tag.

the trickiest part I've come to realize is that we could have an SVG IMG tag as well, which could adopt the self-closing identity by nature of being inside foreign content, and that would be meaningful.

I don't see this being often used, so I feel comfortable making it a bit awkward. it seems incredibly unlikely to be common that someone is intentionally creating an empty XML element inside foreign content.

*
* @since 6.5.0
*
* @param string $tag_name Tag name to check.
* @return bool Whether the element is defined in the HTML specification.
*/
public static function is_html_tag( $tag_name ) {
$tag_name = strtoupper( $tag_name );

return (
'A' === $tag_name ||
'ABBR' === $tag_name ||
'ACRONYM' === $tag_name || // Neutralized.
'ADDRESS' === $tag_name ||
'APPLET' === $tag_name || // Deprecated.
'AREA' === $tag_name ||
'ARTICLE' === $tag_name ||
'ASIDE' === $tag_name ||
'AUDIO' === $tag_name ||
'B' === $tag_name ||
'BASE' === $tag_name ||
'BDI' === $tag_name ||
'BDO' === $tag_name ||
'BGSOUND' === $tag_name || // Deprecated; self-closing if self-closing flag provided, otherwise normal.
'BIG' === $tag_name ||
'BLINK' === $tag_name || // Deprecated.
'BODY' === $tag_name ||
'BR' === $tag_name ||
'BUTTON' === $tag_name ||
'CANVAS' === $tag_name ||
'CAPTION' === $tag_name ||
'CENTER' === $tag_name || // Neutralized.
'CITE' === $tag_name ||
'CODE' === $tag_name ||
'COL' === $tag_name ||
'COLGROUP' === $tag_name ||
'DATA' === $tag_name ||
'DATALIST' === $tag_name ||
'DD' === $tag_name ||
'DEL' === $tag_name ||
'DETAILS' === $tag_name ||
'DFN' === $tag_name ||
'DIALOG' === $tag_name ||
'DIR' === $tag_name ||
'DIV' === $tag_name ||
'DL' === $tag_name ||
'DT' === $tag_name ||
'EM' === $tag_name ||
'EMBED' === $tag_name ||
'FIELDSET' === $tag_name ||
'FIGCAPTION' === $tag_name ||
'FIGURE' === $tag_name ||
'FONT' === $tag_name ||
'FOOTER' === $tag_name ||
'FORM' === $tag_name ||
'FRAME' === $tag_name ||
'FRAMESET' === $tag_name ||
'H1' === $tag_name ||
'H2' === $tag_name ||
'H3' === $tag_name ||
'H4' === $tag_name ||
'H5' === $tag_name ||
'H6' === $tag_name ||
'HEAD' === $tag_name ||
'HEADER' === $tag_name ||
'HGROUP' === $tag_name ||
'HR' === $tag_name ||
'HTML' === $tag_name ||
'I' === $tag_name ||
'IFRAME' === $tag_name ||
'IMG' === $tag_name ||
'INPUT' === $tag_name ||
'INS' === $tag_name ||
'ISINDEX' === $tag_name || // Deprecated.
'KBD' === $tag_name ||
'KEYGEN' === $tag_name || // Deprecated; void.
'LABEL' === $tag_name ||
'LEGEND' === $tag_name ||
'LI' === $tag_name ||
'LINK' === $tag_name ||
'LISTING' === $tag_name || // Deprecated, use PRE instead.
'MAIN' === $tag_name ||
'MAP' === $tag_name ||
'MARK' === $tag_name ||
'MARQUEE' === $tag_name || // Deprecated.
'MATH' === $tag_name ||
'MENU' === $tag_name ||
'META' === $tag_name ||
'METER' === $tag_name ||
'MULTICOL' === $tag_name || // Deprecated.
'NAV' === $tag_name ||
'NEXTID' === $tag_name || // Deprecated.
'NOBR' === $tag_name || // Neutralized.
'NOEMBED' === $tag_name || // Neutralized.
'NOFRAMES' === $tag_name || // Neutralized.
'NOSCRIPT' === $tag_name ||
'OBJECT' === $tag_name ||
'OL' === $tag_name ||
'OPTGROUP' === $tag_name ||
'OPTION' === $tag_name ||
'OUTPUT' === $tag_name ||
'P' === $tag_name ||
'PICTURE' === $tag_name ||
'PLAINTEXT' === $tag_name || // Neutralized.
'PRE' === $tag_name ||
'PROGRESS' === $tag_name ||
'Q' === $tag_name ||
'RB' === $tag_name || // Neutralized.
'RP' === $tag_name ||
'RT' === $tag_name ||
'RTC' === $tag_name || // Neutralized.
'RUBY' === $tag_name ||
'SAMP' === $tag_name ||
'SCRIPT' === $tag_name ||
'SEARCH' === $tag_name ||
'SECTION' === $tag_name ||
'SELECT' === $tag_name ||
'SLOT' === $tag_name ||
'SMALL' === $tag_name ||
'SOURCE' === $tag_name ||
'SPACER' === $tag_name || // Deprecated.
'SPAN' === $tag_name ||
'STRIKE' === $tag_name ||
'STRONG' === $tag_name ||
'STYLE' === $tag_name ||
'SUB' === $tag_name ||
'SUMMARY' === $tag_name ||
'SUP' === $tag_name ||
'SVG' === $tag_name ||
'TABLE' === $tag_name ||
'TBODY' === $tag_name ||
'TD' === $tag_name ||
'TEMPLATE' === $tag_name ||
'TEXTAREA' === $tag_name ||
'TFOOT' === $tag_name ||
'TH' === $tag_name ||
'THEAD' === $tag_name ||
'TIME' === $tag_name ||
'TITLE' === $tag_name ||
'TR' === $tag_name ||
'TRACK' === $tag_name ||
'TT' === $tag_name ||
'U' === $tag_name ||
'UL' === $tag_name ||
'VAR' === $tag_name ||
'VIDEO' === $tag_name ||
'WBR' === $tag_name ||
'XMP' === $tag_name // Deprecated, use PRE instead.
);
}

/**
* Returns whether an element of a given name is in the HTML special category.
*
Expand Down
9 changes: 5 additions & 4 deletions src/wp-includes/html-api/class-wp-html-tag-processor.php
Original file line number Diff line number Diff line change
Expand Up @@ -2286,15 +2286,16 @@ public function is_tag_closer() {
*
* For boolean attributes special handling is provided:
* - When `true` is passed as the value, then only the attribute name is added to the tag.
* - When `false` is passed, the attribute gets removed if it existed before.
* - When `false` or `null` is passed, the attribute gets removed if it existed before.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes seems worth landing on its own.

*
* For string attributes, the value is escaped using the `esc_attr` function.
*
* @since 6.2.0
* @since 6.2.1 Fix: Only create a single update for multiple calls with case-variant attribute names.
* @since 6.5.0 Allows passing `null` to remove attribute.
*
* @param string $name The attribute name to target.
* @param string|bool $value The new attribute value.
* @param string $name The attribute name to target.
* @param string|bool|null $value The new attribute value.
* @return bool Whether an attribute value was set.
*/
public function set_attribute( $name, $value ) {
Expand Down Expand Up @@ -2354,7 +2355,7 @@ public function set_attribute( $name, $value ) {
* > To represent a false value, the attribute has to be omitted altogether.
* - HTML5 spec, https://html.spec.whatwg.org/#boolean-attributes
*/
if ( false === $value ) {
if ( null === $value || false === $value ) {
return $this->remove_attribute( $name );
}

Expand Down
154 changes: 154 additions & 0 deletions src/wp-includes/html-api/class-wp-html.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
<?php
/**
* HTML API: WP_HTML class
*
* Provides a public interface for HTML-related functionality in WordPress.
*
* @package WordPress
* @subpackage HTML-API
* @since 6.5.0
*/
Comment on lines +2 to +10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used WP_HTML_Processor::is_void_tag a few times and it feels like it should live on a utility class like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it started in a utility class like this! but then we didn't include the utility class for pragmatic reasons.


/**
* WP_HTML class.
*
* @since 6.5.0
*/
class WP_HTML {
/**
* Generates HTML for a given tag and attribute set.
*
* Although this doesn't currently support nesting HTML tags inside
* the generated tag, it may do so in the future. When that happens
* the `$inner_text` parameter will transform into `$inner_content`
* and allow passing an array of strings and other tags to nest.
*
* Example:
*
* echo WP_HTML::tag( 'div', array( 'class' => 'is-safe' ), 'Hello, world!' );
* // <div class="is-safe">Hello, world!</div>
*
* echo WP_HTML::tag( 'input', array( 'type' => '"></script>', 'disabled' => true ), 'Is this > that?' );
* // <input type="&quot;&gt;&lt;/script&gt;" disabled>
*
* echo WP_HTML::tag( 'p', null, 'Is this > that?' );
* // <p>Is this &gt; that?</p>
*
* echo WP_HTML::tag( 'wp-emoji', array( 'name' => ':smile:' ), null, 'self-closing' );
* // <wp-emoji name=":smile:" />
*
* @since 6.5.0
*
* @param string $tag_name Name of tag to create.
* @param ?array $attributes Key/value pairs of attribute names and their values.
* Values may be boolean, null, or a string.
* @param ?string $inner_text Will always be escaped to preserve the given string in the rendered page.
* @param ?string $element_type 'self-closing' to self-close the generated HTML for a custom-element.
* This only generates the self-closing flag for non-HTML tags, as HTML
* itself contains no self-closing tags.
* @return string|null Generated HTML for the tag if provided valid inputs, otherwise null.
*/
public static function tag( $tag_name, $attributes = null, $inner_text = null, $element_type = 'html' ) {
if (
! is_string( $tag_name ) ||
( null !== $attributes && ! is_array( $attributes ) ) ||
( null !== $inner_text && ! is_string( $inner_text ) )
) {
return null;
}

// Validate tag name.
if ( 0 === strlen( $tag_name ) ) {
return null;
}

// Compare the first byte against [a-zA-Z].
$tag_initial = ord( $tag_name[0] );
if (
// Before A or after Z.
( $tag_initial < 65 || $tag_initial > 90 ) &&

// Before a or after z.
( $tag_initial < 97 || $tag_initial > 122 )
) {
return null;
}
if ( strlen( $tag_name ) !== strcspn( $tag_name, " \t\f\r\n/>" ) ) {
return null;
}

$is_void = WP_HTML_Processor::is_void( $tag_name );
$self_closes = (
! $is_void &&
'self-closing' === $element_type &&
! WP_HTML_Processor::is_html_tag( $tag_name )
);

/*
* This is unexpected with the closing tag, but it's required
* for special tags with modifiable text, such as TEXTAREA.
*/
$source_html = $self_closes ? "<{$tag_name}/></{$tag_name}>" : "<{$tag_name}></{$tag_name}>";

$processor = new WP_HTML_Tag_Processor( $source_html );
$processor->next_tag();

if ( null !== $attributes ) {
foreach ( $attributes as $name => $value ) {
$processor->set_attribute( $name, $value );
}
}

/*
* Strip off expected closing tag; it will be appropriately
* re-added if necessary after appending the inner text.
*/
$html = substr( $processor->get_updated_html(), 0, -strlen( "</{$tag_name}>" ) );

if ( $is_void || $self_closes ) {
return $html;
}

if ( $inner_text ) {
$big_tag_name = strtoupper( $tag_name );

/*
* Since HTML PRE and TEXTAREA elements strip a leading newline, if
* their inner content contains a leading newline, then they _need_
* to begin with a leading newline before the inner text so that it
* doesn't confuse the syntax for the content.
*/
if (
( 'PRE' === $big_tag_name || 'TEXTAREA' === $big_tag_name ) &&
"\n" === $inner_text[0]
) {
$html .= "\n";
}

switch ( $big_tag_name ) {
case 'SCRIPT':
case 'STYLE':
/*
* Over-zealously prevent escaping from SCRIPT and STYLE tags.
* It would be more complete to run the Tag Processor and look
* for the appropriate closers, but that requires parsing the
* contents which could add unexpected cost. This simplification
* will reject some rare and valid SCRIPT and STYLE text contents,
* but will never allow invalid ones.
*/
if ( false !== stripos( $inner_text, "</{$big_tag_name}" ) ) {
return null;
}
$html .= $inner_text;
break;

default:
$html .= esc_html( $inner_text );
}
}

$html .= "</{$tag_name}>";

return $html;
}
}
1 change: 1 addition & 0 deletions src/wp-settings.php
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,7 @@
require ABSPATH . WPINC . '/html-api/class-wp-html-token.php';
require ABSPATH . WPINC . '/html-api/class-wp-html-processor-state.php';
require ABSPATH . WPINC . '/html-api/class-wp-html-processor.php';
require ABSPATH . WPINC . '/html-api/class-wp-html.php';
require ABSPATH . WPINC . '/class-wp-http.php';
require ABSPATH . WPINC . '/class-wp-http-streams.php';
require ABSPATH . WPINC . '/class-wp-http-curl.php';
Expand Down
Loading