Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: expand #shadow-root elements automatically in parseWithCheerio helper #2396

Merged
merged 4 commits into from
Apr 4, 2024

Conversation

B4nan
Copy link
Member

@B4nan B4nan commented Mar 28, 2024

Custom HTML elements have their content isolated under a #shadow-root property, this PR handles its expansion by traversing the DOM and looking for all custom components with a shadow root, inlining its contents. This way, we can traverse the inside of a custom component via cheerio, as well as get the text content later on.

The behavior is enabled by default and can be disabled via ignoreShadowRoots: true in the crawler options.

@github-actions github-actions bot added this to the 86th sprint - Tooling team milestone Mar 28, 2024
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Mar 28, 2024
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

@B4nan B4nan added the adhoc Ad-hoc unplanned task added during the sprint. label Mar 28, 2024
@B4nan B4nan marked this pull request as ready for review April 2, 2024 13:44
@B4nan
Copy link
Member Author

B4nan commented Apr 2, 2024

I guess we can merge this one, or do you guys think we should make this configurable? I don't think it will add a perf cost, but it could produce some unwanted junk..

@janbuchar
Copy link
Contributor

I guess we can merge this one, or do you guys think we should make this configurable? I don't think it will add a perf cost, but it could produce some unwanted junk..

Configuration will make the feature hard to discover, but the potential for over-selecting is also large - I'd even call it breaking. We could enable it by default and allow disabling it...

@B4nan
Copy link
Member Author

B4nan commented Apr 3, 2024

Yes definitely, I meant configurable opt-out, not making it opt-in. I guess I will add it to be safer.

B4nan added 4 commits April 3, 2024 17:08
…heerio` helper

Custom HTML elements have their content isolated under a `#shadow-root` property, this PR handles its expansion by traversing the DOM and looking for all custom components with a shadow root, inlining its contents. This way, we can traverse the inside of a custom component via cheerio, as well as get the text content later on.
@B4nan B4nan requested review from barjin and janbuchar April 3, 2024 15:09
Copy link
Contributor

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm - good thinking with the options, I'm still a bit unsure about the * selector (performance-wise), but let's see.

I'd still add (a hidden?) opt-out option to WCC for this before we release new WCC with this Crawlee version.

@B4nan
Copy link
Member Author

B4nan commented Apr 4, 2024

Lgtm - good thinking with the options, I'm still a bit unsure about the * selector (performance-wise), but let's see.

You need to traverse the whole DOM to find them, so it feels fine to me, you would have to do it one way or the other.

I'd still add (a hidden?) opt-out option to WCC for this before we release new WCC with this Crawlee version.

Agreed, let's be safe, and let's make it hidden initially so we don't need to convince anyone that its a good idea to have it there :]

@B4nan B4nan changed the title feat(core): expand #shadow-root elements automatically in parseWithCheerio helper feat: expand #shadow-root elements automatically in parseWithCheerio helper Apr 4, 2024
@B4nan B4nan merged commit a05b3a9 into master Apr 4, 2024
9 checks passed
@B4nan B4nan deleted the shadow-root branch April 4, 2024 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants