Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: shadow root expansion in parseWithCheerio removes content #2583

Closed
barjin opened this issue Jul 17, 2024 · 0 comments · Fixed by #2587
Closed

bug: shadow root expansion in parseWithCheerio removes content #2583

barjin opened this issue Jul 17, 2024 · 0 comments · Fixed by #2587
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@barjin
Copy link
Contributor

barjin commented Jul 17, 2024

On some pages (e.g. https://accessgroup.my.site.com/Support/s/article/Dimensions-AOI-Error-Category-number-cannot-be-changed), the call to the parseWithCheerio Playwright helper removes some of the content from the page.

This does not happen when the ignoreShadowRoots constructor option is set to true

Repro:

import { PlaywrightCrawler } from 'crawlee';
import { setTimeout } from 'timers/promises';

const startUrls = ['https://accessgroup.my.site.com/Support/s/article/Dimensions-AOI-Error-Category-number-cannot-be-changed'];

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ parseWithCheerio, page }) => {
        // wait till the content loads
        await page.waitForSelector('.cKnowledge_Articles');

        // parse with Cheerio, expanding the shadow roots (and removing the content).
        await parseWithCheerio();

        await setTimeout(10e3);
    },
    headless: false,
    ignoreShadowRoots: false, // set to `true` to make it work correctly
});

await crawler.run(startUrls);

With ignoreShadowRoots: false:

bad.mp4

With ignoreShadowRoots: true:

good.mp4

CC @B4nan as the author of the original PR adding the shadow root expansion.

@barjin barjin added bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team. labels Jul 17, 2024
B4nan pushed a commit that referenced this issue Jul 24, 2024
`expandShadowRoots` now does not replace the original content but
creates a separate (non-shadow) subtree as a sibling of the original
one.

Note that this doesn't duplicate content $\^{[1\]}$ , as the original
shadow roots are inaccessible from JS by design.
`document.documentElement.outerHTML` returns only the contents of (new)
regular DOM elements. The page also still looks the same, as a shadow
DOM tree [masks any "light" DOM
sibling](https://stackoverflow.com/questions/47500157/shadow-root-sibling-elements-disappear-on-attachshadow-call).

Closes #2583 

------

$^{[1]}$ With the notable exception of
[accessgroup.my.site.com](https://accessgroup.my.site.com/Support/s/article/Access-Capture-Error-Cant-reach-this-page-when-trying-to-access-the-Capture-web-page)
pages, where the use of [custom DOM
elements](https://developer.mozilla.org/en-US/docs/Web/API/Web_components/Using_custom_elements)
somehow creates elements with `shadowRoot` property, the content of
which is still accessible from JS(???). This is likely coming from the
[ancient web framework](https://github.com/aurajs/aura) they are using.

Either way, double the content is probably better than none (so far we
have only noticed this issue on this page).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant