-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: HTML API: Extract previous text and HTML chunks while processing. #5208
base: trunk
Are you sure you want to change the base?
WIP: HTML API: Extract previous text and HTML chunks while processing. #5208
Conversation
public function get_previous_text_chunk() { | ||
if ( $this->bytes_already_parsed >= strlen( $this->html ) ) { | ||
$chunk = substr( $this->html, $this->last_position->end === 0 ? 0 : $this->last_position->end + 1 ); | ||
$chunk = preg_replace( '/<[^a-z].*>/i', '', $chunk ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the replacement doing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one thing this raises is the need to stop at all tokens. currently, inside next_tag()
we skip over all non-tag content, including comments and DOCTYPE and CDATA sections. the result is that we don't currently have a way to remove comments from the plain text. for now this was a cheap trick to eliminate comment matter, because at the point we have the text chunk, we have no HTML tags, meaning that almost anything that's still surrounded by <
and >
is a form of a comment.
this is not meant to stay.
I added code to visit comments, similar to what I did in dmsnell#7, but I think it was revealing that we need to reorient some of the internal code to think more about tokens than tags, something I believe we can do just fine without changing any external interface.
I'll remove these so that the failing tests reappear.
@azaozz this is an interesting one because I think it could be one of those cases where we can go once through a string and bypass multiple passes with the existing certainly on WordPress.com we have some truncation functions that end up making multiple passes through an entire document just to grab the first 50 words or 200 characters or such, and in Core maybe not so much directly, but we have a number of filters for the post excerpt that look like they could be collapsed into one, and probably upon deeper inspection, I bet we have a number of places where we do things like strip tags in a first pass, then truncate in a second, whereas we could do this in a single straight pass. it's not scheduled for merge yet, which is why it's still a draft. mostly I'm trying to explore existing places we can start to gather data that aren't as comprehensive and risky as replacing |
The HTML API should be able to provide the ability to generate excerpts from HTMl documents given a specific maximum length. In this patch we're exploring the addition of text and HTML chunks that can be extracted while processing in order to do just this. The text chunks are similar to `.textContent` on the DOM while the HTML chunks contain raw and unprocessed HTML. These functions should likely remain low-level in the Tag Processor and be exposed from the HTML Processor to ensure that proper semantics are heeded when extracting this information, such as how `PRE` tags ignore a leading newline inside their content or how `SCRIPT` and `STYLE` content isn't part of what we want with something like `strip_tags()`. In the process of this work it's evident again that the Tag Processor ought to expose the ability to visit every token and non-tag tokens should be classified. This has already been explored in #7.
e3d35cf
to
42bf6e6
Compare
42bf6e6
to
e4d7d06
Compare
Notes
The motivation for this was to create something like
strip_tags()
orwp_strip_all_tags()
but which can be limited to a given length, also can be used to create an excerpt of a document that preserves the HTML up to a given text length.Consider you want to create a post excerpt with up to 200 words.
Now this is neato because we've only parsed the input HTML up until the point where the
.textContent
/.innerText
contains 200 words. In addition to that, we've preserved the HTML formatting that appeared up to and around those 200 words.Furthermore if all we want to do is run
strip_tags()
and get a plaintext form of the document we get that too.This explores the low-level primitives necessary to make this possible.
Description
The HTML API should be able to provide the ability to generate excerpts from HTML documents given a specific maximum length.
In this patch we're exploring the addition of text and HTML chunks that can be extracted while processing in order to do just this. The text chunks are similar to
.textContent
on the DOM while the HTML chunks contain raw and unprocessed HTML.These functions should likely remain low-level in the Tag Processor and be exposed from the HTML Processor to ensure that proper semantics are heeded when extracting this information, such as how
PRE
tags ignore a leading newline inside their content or howSCRIPT
andSTYLE
content isn't part of what we want with something likestrip_tags()
.In the process of this work it's evident again that the Tag Processor ought to expose the ability to visit every token and non-tag tokens should be classified. This has already been explored in dmsnell#7.
cc: @westonruter @ockham