text() always decodes HTML entities #22

kiwijam · 2020-06-11T13:41:56Z

As far as I can tell, there's no easy way to extract text but preserve HTML entity encoding at the moment.

Having that option would be handy!

from selectolax.parser import HTMLParser
from html import escape

html = HTMLParser('<div>&#x3C;test&#x3E;</div>')
print(html.text())
print(escape(html.text()))

The text was updated successfully, but these errors were encountered:

rushter · 2020-06-12T15:00:34Z

I think I can't control it, since Modest performs some preprocessing but I can be wrong.

lexborisov · 2020-06-13T15:07:45Z

@kiwijam @rushter

In Modest we have buffer positions for attributes in tokens
You can use this for get raw data.

rushter · 2020-08-15T17:16:03Z

Added limited support for this in 0.2.7.

>>> html_parser = HTMLParser('<div>&#x3C;test&#x3E;</div>')
>>> selector = html_parser.css_first('div')
>>> selector.child.html
'&lt;test&gt;'
>>> selector.child.raw_value
b'&#x3C;test&#x3E;'

This is limited to text nodes only for now.

ichux · 2020-08-15T17:23:06Z

Added limited support for this in 0.2.7.

>>> html_parser = HTMLParser('<div>&#x3C;test&#x3E;</div>')
>>> selector = html_parser.css_first('div')
>>> selector.child.html
'&lt;test&gt;'
>>> selector.child.raw_value
b'&#x3C;test&#x3E;'

This is limited to text nodes only for now.

Thanks for your work done. How can I join in the maintenance of the library. I would like to be of help so that more features can be added.

rushter · 2020-08-15T17:49:01Z

Added limited support for this in 0.2.7.
>>> html_parser = HTMLParser('<div>&#x3C;test&#x3E;</div>')
>>> selector = html_parser.css_first('div')
>>> selector.child.html
'&lt;test&gt;'
>>> selector.child.raw_value
b'&#x3C;test&#x3E;'
This is limited to text nodes only for now.
Thanks for your work done. How can I join in the maintenance of the library. I would like to be of help so that more features can be added.

Well, It's open-source. You are welcome to propose new features or improve existing ones.

You can improve the new raw_value feature to support arbitrary nodes.
That's a pretty easy task, but you will need to be familiar with the C language and Modest library though.

rushter mentioned this issue Mar 18, 2021

Unescaping escaped text within html #34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text() always decodes HTML entities #22

text() always decodes HTML entities #22

kiwijam commented Jun 11, 2020

rushter commented Jun 12, 2020 •

edited

Loading

lexborisov commented Jun 13, 2020

rushter commented Aug 15, 2020

ichux commented Aug 15, 2020

rushter commented Aug 15, 2020 •

edited

Loading

text() always decodes HTML entities #22

text() always decodes HTML entities #22

Comments

kiwijam commented Jun 11, 2020

rushter commented Jun 12, 2020 • edited Loading

lexborisov commented Jun 13, 2020

rushter commented Aug 15, 2020

ichux commented Aug 15, 2020

rushter commented Aug 15, 2020 • edited Loading

rushter commented Jun 12, 2020 •

edited

Loading

rushter commented Aug 15, 2020 •

edited

Loading