Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/html: add Node methods Ancestors, ChildNodes, Descendants #62113

Closed
earthboundkid opened this issue Aug 17, 2023 · 40 comments
Closed

x/net/html: add Node methods Ancestors, ChildNodes, Descendants #62113

earthboundkid opened this issue Aug 17, 2023 · 40 comments

Comments

@earthboundkid
Copy link
Contributor

earthboundkid commented Aug 17, 2023

This is assuming #61405 is accepted.

Iterating through all the children of an html.Node is tedious. The current code has this example of a recursive function printing out links:

var f func(*html.Node)
f = func(n *html.Node) {
	if n.Type == html.ElementNode && n.Data == "a" {
		for _, a := range n.Attr {
			if a.Key == "href" {
				fmt.Println(a.Val)
				break
			}
		}
	}
	for c := n.FirstChild; c != nil; c = c.NextSibling {
		f(c)
	}
}
f(doc)

It would be much nicer with an iterator:

for n := range doc.All() {
	if n.Type == html.ElementNode && n.Data == "a" {
		for _, a := range n.Attr {
			if a.Key == "href" {
				fmt.Println(a.Val)
				break
			}
		}
	}
}
@gopherbot gopherbot added this to the Proposal milestone Aug 17, 2023
@ianlancetaylor ianlancetaylor moved this to Incoming in Proposals Aug 17, 2023
@ianlancetaylor
Copy link
Contributor

CC @neild

@earthboundkid
Copy link
Contributor Author

I've been using an external library with this, and it's nice. It's also handy to have Node.Children(), which only returns the immediate children of a Node.

@jimmyfrasche
Copy link
Member

We should be careful with names here and stick to the DOM as closely as possible (if only to avoid confusing anyone hopping back and forth between Go and JS).

Specifically, https://developer.mozilla.org/en-US/docs/Web/API/Element/children is all child nodes that are elements.

It sounds like this would be closer to https://developer.mozilla.org/en-US/docs/Web/API/Node/childNodes

I would be happy to have both ChildNodes and Children as the latter is often what you want in practice.

@earthboundkid
Copy link
Contributor Author

Reading the docs, it seems like the difference between childNodes and children is that children doesn't have text elements or comment nodes. That's not the distinction I'm making. n.All() would do a full tree iteration including grandchild nodes of whatever depth. ChildNodes is a fine name for the immediate children of a Node though.

Another iterator I use is Parents(). This is less necessary and easier to write on your own, but it can be handy.

@jimmyfrasche
Copy link
Member

Sorry, for being unclear. I have no issue with All for the entire (sub)tree. Just wanted to raise the issue of the existing Children/ChildNodes names/distinction for the direct child iterator(s).

@earthboundkid
Copy link
Contributor Author

I opened golang/net#215.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/594195 mentions this issue: html: add Node.All(), Node.ChildNodes(), and Node.Parents()

@earthboundkid earthboundkid changed the title proposal: x/net/html: add Node.All() iterator proposal: x/net/html: add Node.All(), Node.ChildNodes(), Nodes.Parents() iterators Jun 24, 2024
@rsc rsc moved this from Incoming to Active in Proposals Jul 25, 2024
@rsc
Copy link
Contributor

rsc commented Jul 25, 2024

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@neild
Copy link
Contributor

neild commented Jul 25, 2024

To bikeshed a little:

ChildNodes and Children match DOM attribute names, which seems fine. (I think ChildElements is clearer than Children, but perhaps aligning to the DOM naming is best.)

All does not say "recursively walk all descendants" to me. Maybe Walk? DescendantNodes?

Parents sounds to me like a node can have multiple parents. Perhaps Ancestors? Also, is this operation common enough to be worth an iterator when the manual traversal is pretty straightforward and the DOM doesn't contain an equivalent function or attribute (so far as I can see)?

for parent := node.Parent; parent != nil; parent = parent.Parent {
}

@earthboundkid
Copy link
Contributor Author

the DOM doesn't contain an equivalent function or attribute (so far as I can see)?

In JavaScript, this is called "closest" and takes a query selector.

Having it as an iterator lets you build other tools on top of it that take selectors and return the first matching node or all matching nodes. I don't think x/net/html needs to be that kind of opinionated library, but it's nice to have the obvious iterators already written in it, so you don't have to jump back and forth as much.

@rsc
Copy link
Contributor

rsc commented Aug 14, 2024

The APIs being suggested are:

func (n *Node) Parents() iter.Seq[*Node]
func (n *Node) ChildNodes() iter.Seq[*Node] 
func (n *Node) All() iter.Seq[*Node] 

I see the distinction in Mozilla docs about Children vs ChildNodes, but really, we already called the field 'FirstChild' not 'FirstChildNode', so we don't need to introduce 'Nodes' in the name. It should be Children.

But Children is immediate children (not grandchildren etc), while what is called Parents is all ancestors (not just the immediate parent), and that inconsistency is more confusing. So Parents should probably be Ancestors. And then All can be Descendants.

So that would be:

func (n *Node) Ancestors() iter.Seq[*Node]
func (n *Node) Children() iter.Seq[*Node] 
func (n *Node) Descendants() iter.Seq[*Node] 

Thoughts?

@jimmyfrasche
Copy link
Member

Modulo casing, FirstChild means the same thing in the browser as it does here. I don't have to remember or think about differences. That's good because when I'm using this package it's often in a context where I am bouncing back and forth between this and the browser. If the browser called it childNodes and didn't have children I wouldn't mind.

@ianlancetaylor
Copy link
Contributor

@jimmyfrasche To be clear, are you suggesting that Ancestors and Descendants seem fine, but that Children should be ChildNodes ?

@jimmyfrasche
Copy link
Member

Correct.

@earthboundkid
Copy link
Contributor Author

I think for some reason "child nodes" to me is less ambiguous and just obviously direct children, while "children" is ambiguous and could be just direct children or it could be all descendants. I don't know I feel this way since semantically it's not very different. So I am also in favor of Ancestors(), ChildNodes(), Descendants().

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/606595 mentions this issue: html: add iterators Node.{Ancestors,Children,Descendants}

@earthboundkid
Copy link
Contributor Author

@alandonovan started a separate CL, I think because he didn't notice that I had already opened golang/net#215.

I noticed one important implementation difference which is worth discussing. I have

// Parents returns a sequence yielding the node and its parents.

Alan has

// Ancestors returns an iterator over the ancestors of n, starting with n.Parent.

MDN says Element.closest returns "The closest ancestor Element or itself, which matches the selectors. If there are no such element, null."

In my experience working with node iterators in both JavaScript and Go, starting with the node itself is more consistent and useful than behavior which starts with the parent.

I also return empty sequences when doing nil.ChildNodes() or nil.Descendants(), whereas Alan panics. I think an empty sequence is more useful.

@adonovan
Copy link
Member

because he didn't notice that I had already opened golang/net#215.

Indeed; sorry about that. Though perhaps at least it raised some useful questions.

In my experience working with node iterators in both JavaScript and Go, starting with the node itself is more consistent and useful than behavior which starts with the parent.

"Ancestors" in its ordinary meaning is not a reflexive relation, so I think the reflexive behavior could potentially be confusing. Perhaps the behavior is right but the name is not; but I don't have a better naming suggestion. We could amend the doc to:

// Ancestors returns an iterator over n and all its ancestors, starting with n.

I also return empty sequences when doing nil.ChildNodes() or nil.Descendants(), whereas Alan panics. I think an empty sequence is more useful.

Is there precedent for treating a nil *Node in this way? None of the the existing three methods on *Node accept nil, but nor do they demonstrate a position on whether to quietly swallow nil. My inclination would be to panic.

@earthboundkid
Copy link
Contributor Author

My inclination would be to panic.

As I think about it more, I think panicking makes sense because it's like a deference and usually a sign of a programmatic error. It does make filtering code simpler because you don't have to do "if n != nil { filter(n.Whatever() }", but it also seems like it could cover up things where n is nil but you didn't realize it.

@rsc
Copy link
Contributor

rsc commented Aug 28, 2024

@earthboundkid re

In my experience working with node iterators in both JavaScript and Go, starting with the node itself is more consistent and useful than behavior which starts with the parent.

Can you give some examples of when it's useful, just to help us understand this intuition? Thanks!

It looks like Dom Children does not include the current node, so would you argue for Descendants not being reflexive but Ancestors being reflexive?

@earthboundkid
Copy link
Contributor Author

I've used el.closest to for example get labels on a node and parents to report to analytics. So you might click on a button and it sees <body data-label="sales-page"><div data-label="summer-promo"><button data-label="cta"> and reports "sales-page:summer-promo:cta" to the analytics. Doing that relies on closest being reflexive. If it weren't, you would lose "cta".

For Node.Ancestors() though, it's probably fine, and whoever builds a SelectClosest function in Go can just match against the Node before iterating through its Ancestors.

@earthboundkid
Copy link
Contributor Author

I changed my project to panic on iterating nil Nodes and the tests still passed with very minor changes (some ToString functions needed a check to return ""), so I think it's fine in the real world.

@earthboundkid
Copy link
Contributor Author

I think it's easier just to say none of the three are reflexive. Closest isn't that bad without it:

// Closest traverses the node and its parents until it finds a node that matches.
func Closest(n *html.Node, match func(*html.Node) bool) *html.Node {
	if match(n) {
		return n
	}
	for p := range n.Ancestors() {
		if match(p) {
			return p
		}
	}
	return nil
}

@rsc
Copy link
Contributor

rsc commented Sep 11, 2024

Great, making them non-reflexive seems fine. So it seems like we are back to #62113 (comment): Ancestors, Children (all kinds of nodes, not just element nodes), and Descendants.

Do I have that right?

@rsc rsc changed the title proposal: x/net/html: add Node.All(), Node.ChildNodes(), Nodes.Parents() iterators proposal: x/net/html: add Node methods Ancestors, Children, Descendants Sep 11, 2024
@jimmyfrasche
Copy link
Member

I strongly object to Children. This is called ChildNodes in the DOM and every HTML parsing library I've been able to find (that has such an iterator).

I do agree that it is a bad name—but it is simply the name of the thing being named here.

Giving this a name that exists in the same domain but means a related but different thing is going to cause confusion.

It would be better to drop the iterator entirely than to name it Children.

Ancestors and Descendants are fine because those names are not used in the domain so there's no collision.

@earthboundkid
Copy link
Contributor Author

I don't feel as strongly about it, but I do still feel that "ChildNodes" is more obviously just direct children to me, for some reason. Either is fine though.

@jimmyfrasche
Copy link
Member

At the very least, there should be something in the docs but even that's hard to articulate. You'd have to find some way to say that "Go(Children) = DOM(childNodes); if you want DOM(children), use Go(Children) then xiter.Filter out the nodes that aren't elements".

@adonovan
Copy link
Member

The symmetry of the Go names is nice, but the DOM is the spec here, and Children does correspond exactly to childNodes. So I would be ok with ChildNodes, Ancestors, Descendants, all non-reflexive.

@aclements
Copy link
Member

aclements commented Sep 25, 2024

In addition to what @adonovan said, given that the DOM has both Node.childNodes and Element.children, and this method corresponds to Node.childNodes and not to Element.children, I'm also inclined toward calling this Children ChildNodes (2024-10-02: fixed typo). Thus, that leads to this API:

func (n *Node) Ancestors() iter.Seq[*Node]
func (n *Node) ChildNodes() iter.Seq[*Node] 
func (n *Node) Descendants() iter.Seq[*Node] 

Where none of the functions return n itself and all panic if n == nil. Do I have that right?

@adonovan
Copy link
Member

I'm also inclined toward calling this Children.

I think you meant to say ChildNodes here. Otherwise yes.

@aclements
Copy link
Member

Whoops, yes. Edited my above post.

@aclements aclements changed the title proposal: x/net/html: add Node methods Ancestors, Children, Descendants proposal: x/net/html: add Node methods Ancestors, ChildNodes, Descendants Oct 2, 2024
@aclements
Copy link
Member

Have all remaining concerns about this proposal been addressed?

The proposal is to add the following methods to x/net/html:

func (n *Node) Ancestors() iter.Seq[*Node]
func (n *Node) ChildNodes() iter.Seq[*Node] 
func (n *Node) Descendants() iter.Seq[*Node] 

Where none of the methods return n itself and all panic if n == nil.

@earthboundkid
Copy link
Contributor Author

Do we need to put in a recursion limit to protect against malicious HTML or just assume it will crash on its own?

@adonovan
Copy link
Member

adonovan commented Oct 7, 2024

Do we need to put in a recursion limit to protect against malicious HTML or just assume it will crash on its own?

The recursion in descendants seems unlikely to use more stack than the recursion in the parser that produced the node tree in the first place.

@aclements
Copy link
Member

Based on the discussion above, this proposal seems like a likely accept.

The proposal is to add the following methods to x/net/html:

func (n *Node) Ancestors() iter.Seq[*Node]
func (n *Node) ChildNodes() iter.Seq[*Node] 
func (n *Node) Descendants() iter.Seq[*Node] 

Where none of the methods return n itself and all panic if n == nil.

@aclements aclements moved this from Active to Likely Accept in Proposals Oct 23, 2024
@earthboundkid
Copy link
Contributor Author

Seems like the CL got merged before the waiting period was over.

@ianlancetaylor
Copy link
Contributor

Sorry, that was probably my fault. I didn't double check to see whether the proposal was complete. If there are complications with the proposal we will revert.

@aclements aclements moved this from Likely Accept to Accepted in Proposals Oct 30, 2024
@aclements
Copy link
Member

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.

The proposal is to add the following methods to x/net/html:

func (n *Node) Ancestors() iter.Seq[*Node]
func (n *Node) ChildNodes() iter.Seq[*Node] 
func (n *Node) Descendants() iter.Seq[*Node] 

Where none of the methods return n itself and all panic if n == nil.

@aclements aclements changed the title proposal: x/net/html: add Node methods Ancestors, ChildNodes, Descendants x/net/html: add Node methods Ancestors, ChildNodes, Descendants Oct 30, 2024
@aclements aclements modified the milestones: Proposal, Backlog Oct 30, 2024
@ianlancetaylor
Copy link
Contributor

Proposal is accepted and was already implemented, so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Accepted
Development

Successfully merging a pull request may close this issue.

9 participants