Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: HTML API: Extract previous text and HTML chunks while processing. #5208

Draft
wants to merge 3 commits into
base: trunk
Choose a base branch
from

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Sep 14, 2023

Notes

The motivation for this was to create something like strip_tags() or wp_strip_all_tags() but which can be limited to a given length, also can be used to create an excerpt of a document that preserves the HTML up to a given text length.

Consider you want to create a post excerpt with up to 200 words.

$excerpt_text = '';
$excerpt      = '';
while ( $processor->next_tag() ) {
	$text = $processor->get_prev_text_chunk();
	if ( word_count( $excerpt + $text ) > $excerpt_length ) {
		break;
	}

	$excerpt_text .= $text;
	// This will be better, but for now the array is necessary, ignore it.
	list( $html_stuff, $text_stuff ) = $processor->get_prev_html_chunk();
	$excerpt .= $html_stuff . $text_stuff;
}

return $excerpt;

Now this is neato because we've only parsed the input HTML up until the point where the .textContent/.innerText contains 200 words. In addition to that, we've preserved the HTML formatting that appeared up to and around those 200 words.

$excerpt = get_excerpt( '<div>this  is a <em>&x1F622; shame</em> to be missing</div>', 5, 'words' );
$except === '<div>this is a <em>&x1F622; shame</em>';

Furthermore if all we want to do is run strip_tags() and get a plaintext form of the document we get that too.

$processor = new WP_HTML_Tag_Processor( '<div>this is a <em>&x1F622; shame</em> to be missing</div>' );

$text = '';
while ( $processor->next_tag( $visit_everything ) ) {
	$text .= $processor->get_prev_text_chunk();
}
$text .= $processor->get_prev_text_chunk();

$text === 'this is a 😢 shame to be missing';

This explores the low-level primitives necessary to make this possible.

Description

The HTML API should be able to provide the ability to generate excerpts from HTML documents given a specific maximum length.

In this patch we're exploring the addition of text and HTML chunks that can be extracted while processing in order to do just this. The text chunks are similar to .textContent on the DOM while the HTML chunks contain raw and unprocessed HTML.

These functions should likely remain low-level in the Tag Processor and be exposed from the HTML Processor to ensure that proper semantics are heeded when extracting this information, such as how PRE tags ignore a leading newline inside their content or how SCRIPT and STYLE content isn't part of what we want with something like strip_tags().

In the process of this work it's evident again that the Tag Processor ought to expose the ability to visit every token and non-tag tokens should be classified. This has already been explored in dmsnell#7.

cc: @westonruter @ockham

public function get_previous_text_chunk() {
if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
$chunk = substr( $this->html, $this->last_position->end === 0 ? 0 : $this->last_position->end + 1 );
$chunk = preg_replace( '/<[^a-z].*>/i', '', $chunk );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the replacement doing here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing this raises is the need to stop at all tokens. currently, inside next_tag() we skip over all non-tag content, including comments and DOCTYPE and CDATA sections. the result is that we don't currently have a way to remove comments from the plain text. for now this was a cheap trick to eliminate comment matter, because at the point we have the text chunk, we have no HTML tags, meaning that almost anything that's still surrounded by < and > is a form of a comment.

this is not meant to stay.

I added code to visit comments, similar to what I did in dmsnell#7, but I think it was revealing that we need to reorient some of the internal code to think more about tokens than tags, something I believe we can do just fine without changing any external interface.

I'll remove these so that the failing tests reappear.

@dmsnell
Copy link
Member Author

dmsnell commented Sep 18, 2023

@azaozz this is an interesting one because I think it could be one of those cases where we can go once through a string and bypass multiple passes with the existing strip_tags()/wp_strip_tags() and functions that generate excerpts or truncate HTML content.

certainly on WordPress.com we have some truncation functions that end up making multiple passes through an entire document just to grab the first 50 words or 200 characters or such, and in Core maybe not so much directly, but we have a number of filters for the post excerpt that look like they could be collapsed into one, and probably upon deeper inspection, I bet we have a number of places where we do things like strip tags in a first pass, then truncate in a second, whereas we could do this in a single straight pass.

it's not scheduled for merge yet, which is why it's still a draft. mostly I'm trying to explore existing places we can start to gather data that aren't as comprehensive and risky as replacing kses etc…

The HTML API should be able to provide the ability to generate excerpts from
HTMl documents given a specific maximum length.

In this patch we're exploring the addition of text and HTML chunks that can
be extracted while processing in order to do just this. The text chunks are
similar to `.textContent` on the DOM while the HTML chunks contain raw and
unprocessed HTML.

These functions should likely remain low-level in the Tag Processor and be
exposed from the HTML Processor to ensure that proper semantics are heeded
when extracting this information, such as how `PRE` tags ignore a leading
newline inside their content or how `SCRIPT` and `STYLE` content isn't
part of what we want with something like `strip_tags()`.

In the process of this work it's evident again that the Tag Processor ought
to expose the ability to visit every token and non-tag tokens should be
classified. This has already been explored in #7.
@dmsnell dmsnell force-pushed the html-api/explore-extracting-text-and-html-chunks branch from e3d35cf to 42bf6e6 Compare September 18, 2023 23:21
@dmsnell dmsnell force-pushed the html-api/explore-extracting-text-and-html-chunks branch from 42bf6e6 to e4d7d06 Compare September 18, 2023 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants