WIP: HTML API: Extract previous text and HTML chunks while processing. #5208

dmsnell · 2023-09-14T06:51:48Z

Notes

The motivation for this was to create something like strip_tags() or wp_strip_all_tags() but which can be limited to a given length, also can be used to create an excerpt of a document that preserves the HTML up to a given text length.

Consider you want to create a post excerpt with up to 200 words.

$excerpt_text = '';
$excerpt      = '';
while ( $processor->next_tag() ) {
	$text = $processor->get_prev_text_chunk();
	if ( word_count( $excerpt + $text ) > $excerpt_length ) {
		break;
	}

	$excerpt_text .= $text;
	// This will be better, but for now the array is necessary, ignore it.
	list( $html_stuff, $text_stuff ) = $processor->get_prev_html_chunk();
	$excerpt .= $html_stuff . $text_stuff;
}

return $excerpt;

Now this is neato because we've only parsed the input HTML up until the point where the .textContent/.innerText contains 200 words. In addition to that, we've preserved the HTML formatting that appeared up to and around those 200 words.

$excerpt = get_excerpt( '<div>this  is a <em>&x1F622; shame</em> to be missing</div>', 5, 'words' );
$except === '<div>this is a <em>&x1F622; shame</em>';

Furthermore if all we want to do is run strip_tags() and get a plaintext form of the document we get that too.

$processor = new WP_HTML_Tag_Processor( '<div>this is a <em>&x1F622; shame</em> to be missing</div>' );

$text = '';
while ( $processor->next_tag( $visit_everything ) ) {
	$text .= $processor->get_prev_text_chunk();
}
$text .= $processor->get_prev_text_chunk();

$text === 'this is a 😢 shame to be missing';

This explores the low-level primitives necessary to make this possible.

Description

The HTML API should be able to provide the ability to generate excerpts from HTML documents given a specific maximum length.

In this patch we're exploring the addition of text and HTML chunks that can be extracted while processing in order to do just this. The text chunks are similar to .textContent on the DOM while the HTML chunks contain raw and unprocessed HTML.

These functions should likely remain low-level in the Tag Processor and be exposed from the HTML Processor to ensure that proper semantics are heeded when extracting this information, such as how PRE tags ignore a leading newline inside their content or how SCRIPT and STYLE content isn't part of what we want with something like strip_tags().

In the process of this work it's evident again that the Tag Processor ought to expose the ability to visit every token and non-tag tokens should be classified. This has already been explored in dmsnell#7.

cc: @westonruter @ockham

westonruter · 2023-09-14T21:12:48Z

src/wp-includes/html-api/class-wp-html-tag-processor.php

+	public function get_previous_text_chunk() {
+		if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
+			$chunk = substr( $this->html, $this->last_position->end === 0 ? 0 : $this->last_position->end + 1 );
+			$chunk = preg_replace( '/<[^a-z].*>/i', '', $chunk );


What is the replacement doing here?

one thing this raises is the need to stop at all tokens. currently, inside next_tag() we skip over all non-tag content, including comments and DOCTYPE and CDATA sections. the result is that we don't currently have a way to remove comments from the plain text. for now this was a cheap trick to eliminate comment matter, because at the point we have the text chunk, we have no HTML tags, meaning that almost anything that's still surrounded by < and > is a form of a comment.

this is not meant to stay.

I added code to visit comments, similar to what I did in dmsnell#7, but I think it was revealing that we need to reorient some of the internal code to think more about tokens than tags, something I believe we can do just fine without changing any external interface.

I'll remove these so that the failing tests reappear.

dmsnell · 2023-09-18T23:20:15Z

@azaozz this is an interesting one because I think it could be one of those cases where we can go once through a string and bypass multiple passes with the existing strip_tags()/wp_strip_tags() and functions that generate excerpts or truncate HTML content.

certainly on WordPress.com we have some truncation functions that end up making multiple passes through an entire document just to grab the first 50 words or 200 characters or such, and in Core maybe not so much directly, but we have a number of filters for the post excerpt that look like they could be collapsed into one, and probably upon deeper inspection, I bet we have a number of places where we do things like strip tags in a first pass, then truncate in a second, whereas we could do this in a single straight pass.

it's not scheduled for merge yet, which is why it's still a draft. mostly I'm trying to explore existing places we can start to gather data that aren't as comprehensive and risky as replacing kses etc…

The HTML API should be able to provide the ability to generate excerpts from HTMl documents given a specific maximum length. In this patch we're exploring the addition of text and HTML chunks that can be extracted while processing in order to do just this. The text chunks are similar to `.textContent` on the DOM while the HTML chunks contain raw and unprocessed HTML. These functions should likely remain low-level in the Tag Processor and be exposed from the HTML Processor to ensure that proper semantics are heeded when extracting this information, such as how `PRE` tags ignore a leading newline inside their content or how `SCRIPT` and `STYLE` content isn't part of what we want with something like `strip_tags()`. In the process of this work it's evident again that the Tag Processor ought to expose the ability to visit every token and non-tag tokens should be classified. This has already been explored in #7.

westonruter reviewed Sep 14, 2023

View reviewed changes

dmsnell added 2 commits September 18, 2023 16:21

Primitive max-word-count HTML excerpt.

340418b

dmsnell force-pushed the html-api/explore-extracting-text-and-html-chunks branch from e3d35cf to 42bf6e6 Compare September 18, 2023 23:21

Remove quick workaround for removing comments.

e4d7d06

dmsnell force-pushed the html-api/explore-extracting-text-and-html-chunks branch from 42bf6e6 to e4d7d06 Compare September 18, 2023 23:22

dmsnell mentioned this pull request Apr 2, 2024

HTML API: Roadmap WordPress/gutenberg#60397

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: HTML API: Extract previous text and HTML chunks while processing. #5208

WIP: HTML API: Extract previous text and HTML chunks while processing. #5208

dmsnell commented Sep 14, 2023 •

edited

Loading

westonruter Sep 14, 2023

dmsnell Sep 14, 2023

dmsnell commented Sep 18, 2023

WIP: HTML API: Extract previous text and HTML chunks while processing. #5208

Are you sure you want to change the base?

WIP: HTML API: Extract previous text and HTML chunks while processing. #5208

Conversation

dmsnell commented Sep 14, 2023 • edited Loading

Notes

Description

westonruter Sep 14, 2023

Choose a reason for hiding this comment

dmsnell Sep 14, 2023

Choose a reason for hiding this comment

dmsnell commented Sep 18, 2023

dmsnell commented Sep 14, 2023 •

edited

Loading