-
Notifications
You must be signed in to change notification settings - Fork 740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: expand #shadow-root elements automatically in parseWithCheerio
helper
#2396
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request is neither linked to an issue or epic nor labeled as adhoc!
packages/playwright-crawler/src/internals/utils/playwright-utils.ts
Outdated
Show resolved
Hide resolved
I guess we can merge this one, or do you guys think we should make this configurable? I don't think it will add a perf cost, but it could produce some unwanted junk.. |
Configuration will make the feature hard to discover, but the potential for over-selecting is also large - I'd even call it breaking. We could enable it by default and allow disabling it... |
Yes definitely, I meant configurable opt-out, not making it opt-in. I guess I will add it to be safer. |
…heerio` helper Custom HTML elements have their content isolated under a `#shadow-root` property, this PR handles its expansion by traversing the DOM and looking for all custom components with a shadow root, inlining its contents. This way, we can traverse the inside of a custom component via cheerio, as well as get the text content later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm - good thinking with the options, I'm still a bit unsure about the *
selector (performance-wise), but let's see.
I'd still add (a hidden?) opt-out option to WCC for this before we release new WCC with this Crawlee version.
You need to traverse the whole DOM to find them, so it feels fine to me, you would have to do it one way or the other.
Agreed, let's be safe, and let's make it hidden initially so we don't need to convince anyone that its a good idea to have it there :] |
parseWithCheerio
helperparseWithCheerio
helper
Custom HTML elements have their content isolated under a
#shadow-root
property, this PR handles its expansion by traversing the DOM and looking for all custom components with a shadow root, inlining its contents. This way, we can traverse the inside of a custom component via cheerio, as well as get the text content later on.The behavior is enabled by default and can be disabled via
ignoreShadowRoots: true
in the crawler options.