Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text() always decodes HTML entities #22

Open
kiwijam opened this issue Jun 11, 2020 · 5 comments
Open

text() always decodes HTML entities #22

kiwijam opened this issue Jun 11, 2020 · 5 comments

Comments

@kiwijam
Copy link

kiwijam commented Jun 11, 2020

As far as I can tell, there's no easy way to extract text but preserve HTML entity encoding at the moment.

Having that option would be handy!

from selectolax.parser import HTMLParser
from html import escape

html = HTMLParser('<div>&#x3C;test&#x3E;</div>')
print(html.text())
print(escape(html.text()))
@rushter
Copy link
Owner

rushter commented Jun 12, 2020

I think I can't control it, since Modest performs some preprocessing but I can be wrong.

@lexborisov
Copy link

@kiwijam @rushter

In Modest we have buffer positions for attributes in tokens
You can use this for get raw data.

@rushter
Copy link
Owner

rushter commented Aug 15, 2020

Added limited support for this in 0.2.7.

>>> html_parser = HTMLParser('<div>&#x3C;test&#x3E;</div>')
>>> selector = html_parser.css_first('div')
>>> selector.child.html
'&lt;test&gt;'
>>> selector.child.raw_value
b'&#x3C;test&#x3E;'

This is limited to text nodes only for now.

@ichux
Copy link

ichux commented Aug 15, 2020

Added limited support for this in 0.2.7.

>>> html_parser = HTMLParser('<div>&#x3C;test&#x3E;</div>')
>>> selector = html_parser.css_first('div')
>>> selector.child.html
'&lt;test&gt;'
>>> selector.child.raw_value
b'&#x3C;test&#x3E;'

This is limited to text nodes only for now.

Thanks for your work done. How can I join in the maintenance of the library. I would like to be of help so that more features can be added.

@rushter
Copy link
Owner

rushter commented Aug 15, 2020

Added limited support for this in 0.2.7.

>>> html_parser = HTMLParser('<div>&#x3C;test&#x3E;</div>')
>>> selector = html_parser.css_first('div')
>>> selector.child.html
'&lt;test&gt;'
>>> selector.child.raw_value
b'&#x3C;test&#x3E;'

This is limited to text nodes only for now.

Thanks for your work done. How can I join in the maintenance of the library. I would like to be of help so that more features can be added.

Well, It's open-source. You are welcome to propose new features or improve existing ones.

You can improve the new raw_value feature to support arbitrary nodes.
That's a pretty easy task, but you will need to be familiar with the C language and Modest library though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants