-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with etree.HTML related to 4.2 #71
Comments
Hi, I've no tests with structures generated by the HTML parser of the lxml library, and this is the main reason of the regression. Adding your tests I will try to fix the problem. Other comments after a first analysis. Thank you |
Just find something. If doctype is html the root is not set with the root of the tree but with an internal element that is considered a fragment by v4.2.0. > /home/brunato/Development/elementpath/tests/test_xpath31.py(1157)test_regression_ep415_ep420()
-> res = elementpath.select(document, query, parser=XPath3Parser)
(Pdb) query
'if (count(//hotel/branch/staff) = 5) then true() else false()'
(Pdb) document
<Element hotel at 0x7f4df48323c0>
(Pdb) document[0]
<Element branch at 0x7f4df34fbbc0>
(Pdb) doctype
'html'
(Pdb) document.parent
*** AttributeError: 'lxml.etree._Element' object has no attribute 'parent'
(Pdb) document.getparent()
<Element body at 0x7f4df35c2cc0>
(Pdb) document.getparent().getparent()
<Element html at 0x7f4df38e0340>
(Pdb) document.getroottree()
<lxml.etree._ElementTree object at 0x7f4df35204c0>
(Pdb) document.getroottree().getroot()
<Element html at 0x7f4df3522000> as a workaround providing With fragments The v4.1.5 doesn't recognize the fragment but also missing the effective root of the HTML data. Probably a fix is needed for v4.2.1 to restore the old behavior unless |
Thank you for your research! (FYI |
- Fix for issue #71: create a dummy document unless fragment=True is provided; - Add uri and fragment arguments to selector API as kwargs; - Uniformate type annotations for dynamic context root.
Hi, This setting cut-off the effective root of the HTML , including Check if the v4.2.1 also fix the regression problem with its usage in your repo, and if so, close this issue. Thank you |
Thank you so much! I can confirm version 4.2.1 passed all the tests and others! |
EDIT: BTW I close this issue first. I tried the suggestion with the test code above,
the error message is |
Sorry, the right code for get the root element of the HTML document is |
Thank you so much! I will post some thoughts about the function and your suggestion later. That is quite a weird thing that can happen to your users but I believe that is good to show you. That will be quite long with some example codes, so maybe next week I will post it to you... Thank you for your help. |
Before you read this, I'm sorry for some sentences that are hard to read. I tried to compensate it codes. Also, I'm not an expert in this area, If I'm wrong, please feel free to tell me! The reason I used this library in a weird way is to achieve xpath2-3.1 support for html and xml type when etree.XML is not allowed because of XML security reasons at the same time(this was the requirement of the maintainer). With the requirements, I tried my best to make somehow a "document-neutral general xpath parsing tool" like Xidel. I will describe the details of xml, html, rendering engine. This is quite full of surprises and traps for novices. Basically, It's not related to the quality of the elementpath. pipelines : server -> (browser engine(using playwright and so on): optional) -> requests -> lxml -> elementpath HTMLfied XMLthe PR I did for the other repo is being used by the two types of source producers. 1. the rendering engine 2. Requests. (something like curl. it doesn't render.). XML distinguishes an uppercase and undercase for a tag name. But HTML is Letter-case-blind. (<HTML> and <html> are the same to HTML) (the simple flask server code: The rendering engine will modify the source.(code)from flask import Flask, render_template_string, Response
app = Flask(__name__)
#without <?xml-stylesheet type="text/css" href="/style"?>
#http://127.0.0.1:5000/no_style
xml_wo_stylesheet="""<?xml version="1.0"?>
<!-- XML demonstration -->
<!DOCTYPE earth>
<earth>
<mountain>
<name>Everest</name>
<place>Nepal</place>
<height>8,848</height>
</mountain>
<mountain>
<name>K2</name>
<place>Pakistan</place>
<height>8,611</height>
</mountain>
<mountain>
<name>Kangchenjunga</name>
<place>Nepal</place>
<height>8,586</height>
</mountain>
</earth>
"""
#with <?xml-stylesheet type="text/css" href="/style"?>
#http://127.0.0.1:5000/
xml_w_stylesheet="""<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="/style"?>
<!-- XML demonstration -->
<!DOCTYPE earth>
<earth>
<mountain>
<name>Everest</name>
<place>Nepal</place>
<height>8,848</height>
</mountain>
<mountain>
<name>K2</name>
<place>Pakistan</place>
<height>8,611</height>
</mountain>
<mountain>
<name>Kangchenjunga</name>
<place>Nepal</place>
<height>8,586</height>
</mountain>
</earth>
"""
#uppercase
#without <?xml-stylesheet type="text/css" href="/style_upper"?>
#http://127.0.0.1:5000/uppercase_no_style
xml_wo_stylesheet_uppercase="""<?xml version="1.0"?>
<!-- XML demonstration -->
<!DOCTYPE earth>
<EARTH>
<MOUNTAIN>
<NAME>Everest</NAME>
<PLACE>Nepal</PLACE>
<HEIGHT>8,848</HEIGHT>
</MOUNTAIN>
<MOUNTAIN>
<NAME>K2</NAME>
<PLACE>Pakistan</PLACE>
<HEIGHT>8,611</HEIGHT>
</MOUNTAIN>
<MOUNTAIN>
<NAME>Kangchenjunga</NAME>
<PLACE>Nepal</PLACE>
<HEIGHT>8,586</HEIGHT>
</MOUNTAIN>
</EARTH>
"""
#uppercase
#with <?xml-stylesheet type="text/css" href="/style_upper"?>
#http://127.0.0.1:5000/uppercase
xml_w_stylesheet_uppercase="""<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="/style_upper"?>
<!-- XML demonstration -->
<!DOCTYPE earth>
<EARTH>
<MOUNTAIN>
<NAME>Everest</NAME>
<PLACE>Nepal</PLACE>
<HEIGHT>8,848</HEIGHT>
</MOUNTAIN>
<MOUNTAIN>
<NAME>K2</NAME>
<PLACE>Pakistan</PLACE>
<HEIGHT>8,611</HEIGHT>
</MOUNTAIN>
<MOUNTAIN>
<NAME>Kangchenjunga</NAME>
<PLACE>Nepal</PLACE>
<HEIGHT>8,586</HEIGHT>
</MOUNTAIN>
</EARTH>
"""
@app.route('/')
def index():
#return Response(xml_w_stylesheet, mimetype='text/html')
return Response(xml_w_stylesheet, mimetype='text/xml')
@app.route('/no_style')
def index2():
#return Response(xml_wo_stylesheet, mimetype='text/html')
return Response(xml_wo_stylesheet, mimetype='text/xml')
@app.route('/uppercase')
def index3():
#return Response(xml_w_stylesheet_uppercase, mimetype='text/html')
return Response(xml_w_stylesheet_uppercase, mimetype='text/xml')
@app.route('/uppercase_no_style')
def index4():
#return Response(xml_wo_stylesheet_uppercase, mimetype='text/html')
return Response(xml_wo_stylesheet_uppercase, mimetype='text/xml')
@app.route('/style', methods=['POST','GET'])
def css():
stylesheet = """
:earth:before {
display: block;
font-weight: bold;
font-size: 300%;
content: "Mountains";
background-color: black;
}
earth {
display: block;
margin: 2em 1em;
border: 6px solid black;
padding: 0px 1em;
background-color: grey;
}
mountain {
display: block;
margin-bottom: 1em;
}
name {
display: block;
font-weight: bold;
font-size: 100%;
}
place {
display: block;
}
place:before {
content: "Place: ";
}
height {
display: block;
}
height:before {
content: "Height: ";
}
height:after {
content: " m";
}
"""
return Response(stylesheet, mimetype='text/css')
@app.route('/style_upper', methods=['POST','GET'])
def css2():
stylesheet = """
:EARTH:before {
display: block;
font-weight: bold;
font-size: 300%;
content: "MOUNTAINS";
background-color: black;
}
EARTH {
display: block;
margin: 2em 1em;
border: 6px solid black;
padding: 0px 1em;
background-color: grey;
}
MOUNTAIN {
display: block;
margin-bottom: 1em;
}
NAME {
display: block;
font-weight: bold;
font-size: 100%;
}
PLACE {
display: block;
}
PLACE:before {
content: "Place: ";
}
HEIGHT {
display: block;
}
HEIGHT:before {
content: "Height: ";
}
HEIGHT:after {
content: " m";
}
"""
return Response(stylesheet, mimetype='text/css') This is the code for the results of the browser engine and requests. (code)import requests
from playwright.sync_api import Playwright, sync_playwright, expect
web_page_list = [
"http://127.0.0.1:5000/",
"http://127.0.0.1:5000/no_style",
"http://127.0.0.1:5000/uppercase",
"http://127.0.0.1:5000/uppercase_no_style",
]
def run(playwright: Playwright) -> None:
browser = playwright.chromium.launch(headless=False)
#browser = playwright.firefox.launch(headless=False)
context = browser.new_context()
page = context.new_page()
for web_page in web_page_list:
print(f'{web_page.__repr__()}#####################################')
page.goto(web_page)
print(page.content())
context.close()
browser.close()
with sync_playwright() as playwright:
print("playwright module")
run(playwright)
print("requests module")
for web_page in web_page_list:
print(f'{web_page.__repr__()}#####################################')
res = requests.get(web_page)
print(res.content) The lxml.etree.HTML or lxml.etree.XML also modifies or recovers it.(code)from lxml import etree, html
doc = """<PERSON>
<NAME>Constantin Hong</NAME>
<HEAD>empty</HEAD>
<BODY>clothes</BODY>
</PERSON>"""
print(doc)
print(etree.HTML(doc))
print(html.document_fromstring(doc))
print(html.fromstring(doc, parser=etree.HTMLParser()))
print(etree.fromstring(doc))
print(etree.fromstring(doc, parser=etree.HTMLParser()))
print(f"{etree.tostring(etree.HTML(doc))=}")
print(f"{etree.tostring(html.document_fromstring(doc))=}")
print(f"{etree.tostring(html.fromstring(doc, parser=etree.HTMLParser()))=}")
print(f"{etree.tostring(etree.fromstring(doc))=}")
print(f"{etree.tostring(etree.fromstring(doc, parser=etree.HTMLParser()))=}") web inspector of Safari, and Chrome will show XML inside HTML if the source doesn't contain a stylesheet. But Firefox will show XML without HTMLfying. Therefore with the rendering engine, your suggestion might not work in uncontrolled circumstances like mine. The rendering engine and lxml.etree.HTML will add html tags in some cases. This will cause an inconvenience for the user. because now For HTMLfied XML, XPath should be undercase too. (e.g. /MOUNTIN/EVEREST -> /mountin/everest). (and again, It's not related to the quality of the elementpath. I'm sharing my experience just in case someone sends you a weird issue like mine.) So, if a user uses the wrong combination of the rendering engine and document, the result will be wrong. But absolutely, it's not about elementpath. The one thing I defend the function( But that( I will post my thoughts about the function as soon as possible. Thank you! |
Hi, Maybe some small test cases of HTML data + expected results could help to understand how elementpath can be used/changed to obtain the proper processing. If you want to contribute with code you can propose a PR that adds a new test module in 'tests/' directory (e.g. import unittest
try:
import lxml.html as lxml_html
except ImportError:
lxml_html = None
@unittest.skipIf(lxml_html is None, 'lxml is not installed ...')
class TestHtmlData(unittest.TestCase):
... In this case HTML data could be provided using strings instead of files. About the argument For specific cases one can build the node tree using |
Annotation fixed. I may still misinterpret the specs or I may hide some logic unintentionally. Please feel free to tell me. I initially thought about HTML as a subset of XML with additional syntactical allowance.It's quite unliable because XML, DOM, and HTML's allowed character range can be changed. At least the supported Unicode range is different by the XML DOM version, XPATH's ability relies on the XMLParser's ability. it means that if the XML parser doesn't allow DOM 5th version but 4th, xpath can't parser DOM 5th version document. > This document is a W3C Recommendation. This fifth edition is not a new version of XML. As a convenience to readers, it incorporates the changes dictated by the accumulated errata (available at http://www.w3.org/XML/xml-V10-4e-errata) to the Fourth Edition of XML 1.0, dated 16 August 2006. In particular, erratum [E09] relaxes the restrictions on element and attribute names, thereby providing in XML 1.0 the major end user benefit currently achievable only by using XML 1.1.1 Therefore, I'm ignoring Unicode because it's not good to rely on for an argument. So I choose another approach. My strategy is the procrustean bed for HTML syntax. XML element has two element node syntax.2
One thing is the "start-tag and end-tag". the second one is empty-element tag like " According to various HTML specs 4 (5), there are several syntaxes for HTML element nodes.
There is a Valid Html element Node syntax but is Not a Valid Xml element syntax node("VHNVX node")6. It's about kinds of element syntax and attribute syntax in the element node. The presence of a VHNVX node means that the specific element node needs to be "recognized" and to be "reduced" to a valid XML element node by the HTML parser or the parser has such an ability. For normal element nodes, except certain normal element nodes, The start and end tags of normal element nodes 7 must not be omitted. Also, it means a user-defined element node also must follow it. So, my argument is that if lxml supports the various those syntax's html elements and html element rules, the lxml provides the HTML as etree without losing meaningful information(Of course, before lxml parses it, the document must be valid and the document provider recognize the document is as intended.).
Any valid XML document can be parsed with xpath2-3.1(-4.0). The XPath parser can traverse whatever nodes are in a node tree. Also, set aside my logic, Saxonica supports xpath3.1 for HTML 16. Unless my logic is completely wrong, the test for the VHNVX element node is needed. So the xpath2, 3, 3.1, 4 test for HTML is just about accessing the VHNVX element node. (Personally, I just want to add a test for complex element nodes in HTML too, even XML also has it, lxml supports it well as a node.) EDIT: I will submit the html tests later.. Footnotes
|
Also for the "fragment", I meant that I need to create a dependency injection function for that repo. I said something wrong previously. That is my responsibility. You don't have to take any action for that. Thank you for your advice! |
EDIT: I'm sorry. I'm in investigation again. Done!
EDIT: I reposted the code! Thank you!
Thank you for the great tool!
With the recent update(4.2), I noticed some regression(?) related to
etree.HTML
.The below code shows the difference with version 4.2 and prints the ideal result with version 4.1.5(
pip install elementpath==4.1.5
)I wrote the test code for another repo(https://github.com/dgtlmoon/changedetection.io/blob/master/changedetectionio/tests/test_xpath_selector_unit.py) with version 4.15 which passed before.
The ability of Elementpath to parse HTML documents with XPath2-3.1 is mindblowing and awesome to the related community. Also, tools like the non-python tool called xidel support parsing html files too. But the only good library that supports this ability in Python is Elementpath and If I didn't know this repo, I wouldn't have even tried to learn XPath in the first place.
The noticeable query is
'//hotel/branch/staff'
. Because theetree.HTMLparser
fixed the HTML to be wellformed somehow and the structure became/html/body/hotel/branch/staff
(or something),'//hotel/branch/staff'
should be worked and actually worked before(4.1.5).. So with new version is somehow weird.. I'm sorry in advance, I'm still not well at XPath parser theoretically.The text was updated successfully, but these errors were encountered: