Again, I'm not an expert. I tried my best. However, I may still misinterpret the specs or I may hide some logic unintentionally. Please feel free to tell me.
I initially thought about HTML as a subset of XML with additional syntactical allowance.
It's quite unliable because XML, DOM, and HTML's allowed character range can be changed.[^1]
At least the supported Unicode range is different by the XML DOM version, XPATH's ability relies on the XMLParser's ability. it means that if the XML parser doesn't allow DOM 5th version but 4th, xpath can't parser DOM 5th version document.
> This document is a W3C Recommendation. This fifth edition is not a new version of XML. As a convenience to readers, it incorporates the changes dictated by the accumulated errata (available at http://www.w3.org/XML/xml-V10-4e-errata) to the Fourth Edition of XML 1.0, dated 16 August 2006. In particular, erratum [E09] relaxes the restrictions on element and attribute names, thereby providing in XML 1.0 the major end user benefit currently achievable only by using XML 1.1.
Therefore, I'm ignoring Unicode because it's not good to rely on for an argument.
So I choose another approach. My strategy is **the procrustean bed for HTML syntax**.
XML element has two element node syntax.[^4]
> Each XML document contains one or more elements, the boundaries of which are either delimited by start-tags and end-tags, or, for empty elements, by an empty-element tag.
One thing is the "start-tag and end-tag". the second one is empty-element tag like "``" [^5]
So, there is no such "\ " without closing tag in XML DOM.
According to various HTML specs[^17]([^2]), there are several syntaxes for HTML element nodes.
> Tags are used to delimit the start and end of elements in the markup. Raw text, escapable raw text, and normal elements have a start tag to indicate where they begin, and an end tag to indicate where they end. The start and end tags of certain normal elements can be omitted, as described below in the section on optional tags. Those that cannot be omitted must not be omitted. Void elements only have a start tag; end tags must not be specified for void elements. Foreign elements must either have a start tag and an end tag, or a start tag that is marked as self-closing, in which case they must not have an end tag.
There is a Valid Html element Node syntax but is Not a Valid Xml element syntax node("**VHNVX node**")[^6]. It's about kinds of element syntax and attribute syntax in the element node. The presence of a VHNVX node means that the specific element node needs to be "recognized" and to be "reduced" to a valid XML element node by the HTML parser or the parser has such an ability.
For normal element nodes, except certain normal element nodes, The start and end tags of normal element nodes [^3] must not be omitted. Also, it means a user-defined element node also must follow it.
So the element node must have at least one tag. and only a predefined one by w3c can omit tags.[^6]
So except for some html element node, every element node will conform xml element node syntax by default. If some devil picks one of the elements in XHTML, and shows me one, and asks me to decide whether the element is an element of html or xml without any context, I can't.
So, my argument is that if lxml supports the various those syntax's html elements and html element rules, the lxml provides the HTML as etree without losing meaningful information(Of course, before lxml parses it, the document must be valid and the document provider recognize the document is as intended.).
```
from lxml import etree, html
import elementpath
from elementpath.xpath30 import XPath30Parser
parser = XPath30Parser
# This attribute syntax in element node is allowed in HTML.
# https://html.spec.whatwg.org/multipage/syntax.html#attributes-2
fragments = """
"""
el_h = etree.HTML(fragments)
H_parser = etree.HTMLParser()
el_h_f = html.fromstring(fragments, parser=H_parser)
# Error occurs
el_x = etree.XML(fragments)
```
BTW, lxml already did it. It can be proved by xpath1. But when the VHNVX node is parsed by the etree.HTMLParser or etree.HTML, the node will be modified to conform a XML node syntax. (it is not reversible. However, it's not that important for querying by now.)
Any valid XML document can be parsed with xpath2-3.1(-4.0).
Also if some devil shows me one of elements in XHTML and requires me to decide whether the element is an element of HTML or XML, I can't. because the element can be in a xml contains xhtml page(for example, a sitemap xml but each element is XHTML, `/sitemap/site[1]`). So the non-VHNVX html node can be parsed with xpath2-3.1.
The XPath parser can traverse whatever nodes are in a node tree.
Since all XPath2,3,3.1,4 Model(XDM) supports items. According to various XDM definitions [^11][^12][^13][^16], A node is one of an item. [^7][^8][^9][^10] A xml element node can fit in it. So, if the HTML parser can change VHNVX to a normal XML element node, XPath2,3,3.1,4 can parse it.
Also, set aside my logic, Saxonica supports xpath3.1 for HTML [^14].
Unless my logic is completely wrong, the test for the VHNVX element node is needed. So the xpath2, 3, 3.1, 4 test for HTML is just about accessing the VHNVX element node. (Personally, I just want to add a test for complex element nodes in HTML too, even XML also has it, lxml supports it well as a node.)
[^1]: https://www.w3.org/TR/xml/
[^2]: https://www.w3.org/TR/2012/WD-html-markup-20121025/syntax.html#tag-name
[^3]: https://html.spec.whatwg.org/multipage/syntax.html#optional-tags
[^4]: https://www.w3.org/TR/xml/ : 3 Logical Structures
[^5]: https://www.w3.org/TR/xml/ : 3 Logical Structures : Examples of empty elements
[^6]: https://html.spec.whatwg.org/multipage/syntax.html#elements-2
[^7]: https://qt4cg.org/specifications/xpath-datamodel-40/Overview.html#Node
[^8]: https://www.w3.org/TR/2010/REC-xpath-datamodel-20101214/#Node
[^9]: https://www.w3.org/TR/xpath-datamodel-30/#Node
[^10]: https://www.w3.org/TR/xpath-datamodel-31/#Node
[^11]: https://www.w3.org/TR/2010/REC-xpath-datamodel-20101214/#terminology
[^12]: https://www.w3.org/TR/xpath-datamodel-30/#terminology
[^13]: https://www.w3.org/TR/xpath-datamodel-31/#terminology
[^14]: https://www.saxonica.com/saxon-js/documentation2/index.html#!api/xpathEvaluate
[^15]: https://www.w3.org/TR/xml/
[^16]: https://qt4cg.org/specifications/xpath-datamodel-40/Overview.html#basic-concepts
[^17]: https://html.spec.whatwg.org/multipage/syntax.html#normal-elements