-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use WebDriver BiDi for Chrome #17961
Comments
I've a wip locally to use Bidi with chrome. |
Nice; I hope it fixes this issue too then. If not, then puppeteer/puppeteer#12111 is pretty clever because I hadn't thought of using the demo viewer to create a more isolated test case yet, so thanks for that idea! If needed we can then probably report the issue upstream as a reduced standalone script instead, which makes it easier to reproduce and debug. |
I cannot really help with the Chrome implementation of WebDriver BiDi and what might be the culprit here. @OrKoN could you may have a look at it? It looks like it's something around mouse clicks but it's interesting that the referenced code works some lines above and then the same click doesn't work when run again. |
@whimboo thanks for pinging, I can take a look |
I get the above error. Could you perhaps share the Puppeteer logs for that test run? |
Normally this test has been fixed in #17962 |
Right, so the test passes for me on that branch. Is the OOM the only remaining issue? |
I was able to run all integration tests locally on a Mac and I didn't see any extreme memory usage. I got two test failures but it looked like some wrong selectors? I would recommend setting protocolTimeout to some lower value like 10sec to catch any timeouts earlier if you do not expect to have long running calls. |
Yes the last tries with the PR above gave us this OOM crash on the linux bot only.
in the middle of the unit tests on the Windows 10 bot. |
Could you try it with the canary build (install |
Done: #17962 (comment) |
Same OOM crash with canary. |
Which node version do you use? |
18.16.0 |
Actually, after I tried to run it one more time and left it unattended without moving the mouse, all tests passed. |
I wanted to try on a Windows 10 machine but for me web.archive.org times out. Is there another way to fetch the test files? |
never mind, it was a network issue that I managed to solve. |
So I ran it on Windows 10 and I do not see excessive memory use. Node peaks at 400 MB and drops to 200 MB during the test run.
|
@timvandermeij interesting, I am trying to run it now and it looks like it runs in the headless mode: is there a way to make it headful or perhaps double check that tabs are getting closed properly? perhaps some tabs remain open for longer than needed, maybe blocked by some alerts or beforeunload handlers or something like that? |
The comment #17962 (comment) looked like the OOM happened in the node process not able to get more than 4 GB of memory. |
Good catch; I'm not sure what to make of that yet because Node.js itself shouldn't be doing much other than starting browsers through Puppeteer and kicking off the reference testing framework in said browsers. Locally I think it was in fact Chrome being killed by the OOM killer and not the Node.js process given the following
However, I'm not sure if either one might be a red herring because I don't think it's guaranteed that the OOM killer necessarily kills the process that is using most memory: that can be configured, so it could be that also on the bots Chrome used most memory but the OOM killer decided to kill another process first.
The reference tests run differently compared to the integration tests. The integration tests work like one would expect from a Puppeteer point of view: we start two browsers and for each test open/close tabs. However for the reference tests we use Puppeteer only to spawn the browsers for us, but they always have only one tab that points to our reference test framework. That framework in turn loads each PDF file, compares/takes snapshots and closes said PDF files again. It doesn't do that by opening new tabs to load the PDF, but instead by creating a canvas, rendering the PDF onto it, writing the canvas to a file and destroying the canvas. That explains why, unlike the integration tests, we don't have to open/close tabs and there is therefore pretty much no usage of e.g. Puppeteer APIs. Now that I'm writing this it makes it even more interesting that the reference tests are where it fails on the bots and locally, so that could be an indication it's somehow caused by the reference testing framework. We should be cleaning up immediately after snapshotting, and a quick look at the code here and there seems to confirm that, but we should probably look into that more to make sure there is not a defect in there. Note that for the reference tests we only run Chrome on Linux now already because quite some time ago we had to disable Chrome on Windows for the reference tests only in #14509 because of excessive run times. I have seen that before the OOM killer kicks in that Chrome locally also slows down quite significantly, so maybe that slowdown was also a sign of the issue we experience here. I can't link any of these events for sure and therefore can't rule anything out yet, but it at least it looks like there may be a relation here. I have some ideas already looking at the code, and I'll try to think of a good way to debug this further so we can hopefully at least figure out where the memory leak is coming from (our own code, Node.js, Chrome, Puppeteer, etc.). You did give me some new ideas/pointers so far, so thank you for that! |
I have created a reduced version of the test framework/manifest to reproduce the issue much more easily. Steps to reproduce:
I'll check if I can reduce it even further, but in the meantime could you check if you can reproduce the issue at least with these steps? The only variable here is the change from CDP to WebDriver BiDi, so that at least shows an unexpected behavior change (even though the root cause might be elsewhere, such as a leak in PDF.js or the test framework, but even then CDP is able to somehow cope with it...). |
@timvandermeij is the high memory usage coming from the browser or from Puppeteer process? Chrome, the WebDriver BiDi implementation runs in the Puppeteer process. |
I think it's the Puppeteer process: the |
I seem to be able to reproduce the increases in the node process. |
It peaked around 2.5 GB for me (intel mac). |
Right, those are most likely the snapshots. The browser loads each page of the PDF file, draws it on a canvas, takes a snapshot of the canvas and sends it from the browser to the test runner process (which can then e.g. save it to disk). This is done using network requests at https://github.com/mozilla/pdf.js/blob/master/test/driver.js#L1033-L1034, and the I did try setting In the reproducer this method is called 2500 times, and if each image is around 600 KB as indicated in the screenshot then that'd amount to around 1.5 GB (however some of it eventually seems to be freed indeed, which probably makes it a bit less in practice and that's what we have seen). |
Version 22.10.1 contains the fix for the network request storage. Could you please trigger the tests using that version? |
Wow, that's fast; thank you for the quick action! I'll try it out locally and prepare a patch to update our version of Puppeteer, and if that's successful we can retrigger the tests for the Chrome BiDi PR. |
I have confirmed locally that the memory leak indeed no longer happens: the I have created #18231 to update Puppeteer and once that lands we can rebase the Chrome BiDi PR and retrigger the tests. |
Good news: in #17962 the Linux bot is happy now, so no more OOM, and the runtime is pretty much equal as with CDP. However, Windows is a different story: while the OOM problems are solved, it looks like WebDriver BiDi in combination with Chrome on Windows is quite a bit slower, causing the tests to eventually hit the 1-hour timeout mark. While they do eventually complete, we have also seen some strange logs in the output. We'll look into that further, and it might be a general Windows bot issue because the Windows bot has always been slower than the Linux bot already, but in any case: thanks so far @OrKoN for fixing the OOM issue! |
@timvandermeij do you also experience the slowness locally? |
Not on Linux, but I don't have a Windows machine to try it out on at the moment. We're currently testing it on the Windows bot to see if it can be reproduced and to exclude this being an issue with the bot itself somehow, and if that doesn't lead to insights then we can probably spin up a Windows VM or so to try to reproduce it. |
can you see which tests are slow? as I understand the bot runs multiple kinds of tests in Chrome and Firefox. |
I've a windows machine, I'll test it locally. |
Yes, that sounds likely, and looking at #17962 (comment) the tests now worked and within the expected runtime on Windows too. It might be that the left-over process was from one of the earlier runs. In any case, let's retry it a couple of times and check locally indeed, and hopefully it will keep working :-) |
I recall there is an existing issue that crashpad handler is being launched on Windows unconditionally. I wonder if there is a crash during the test run? Also, I think the latest available version is 126, but it looks like the bot runs with 125. |
Yep probably something went wrong (see the end of http://54.193.163.58:8877/a53e51e62057105/output.txt) but I think in general puppeteer shouldn't let those processes (especially if they're using some cpu) living. |
Crash reporting is disabled for Chrome for Testing builds. You could enable crash reporting in regular Chrome (requires manually ticking the checkbox when the profile is created or via the browser settings). It looks like in this case the Node process is killed on uncaught exception and so it does not close the browser properly. It might be another bug in chromium-bidi. Apart from the crash reporting, setting dumpio: true should make the crashes visible in the stdout if there are any. |
I don't recall having seen this particular issue with a left-over (active) browser process for CDP. |
I'm reopening this because we had to revert to CDP for Chrome in #18506 because of sudden timeout/OOM issues during the browser tests on the Linux bot that we didn't see anymore after disabling WebDriver BiDi. I have reconstructed the following series of events that led to this:
Note that using WebDriver BiDi for Firefox and CDP for Chrome also matches the upcoming default behavior in Puppeteer 23; see puppeteer/puppeteer#12728 and https://github.com/puppeteer/puppeteer/pull/12732/files#diff-8b40d1cbf56e2b8f3e21919bcd5f815060634e9b604a5f026f452633ac2e2f3bR96. Given that upstream doesn't enable WebDriver BiDi by default for Chrome yet, but does for Firefox, it could perhaps be an indication that it's not fully ready yet. Now that the Linux bot is stable again and the dependency updates are done we should find out why WebDriver BiDi suddenly caused issues for Chrome after two weeks without problems. It's important to note that we did try running the browser tests with the Puppeteer version we were already using before the issues occurred, which should use the same Chrome version because it's pinned by Puppeteer (in contrast to Firefox for which the latest version is fetched), but that didn't change anything. Given that reverting the Puppeteer update didn't change anything we're reasonably sure it's not a regression in Puppeteer, but it could be a regression in a browser binary because we had to purge the cache. |
note that we do not turn on WebDriver BiDi by default for Chrome because CDP currently offers more Chrome-specific features that many users rely on. WebDriver BiDi for Chrome is considered ready in the scope of the functionality available via WebDriver BiDi. It looks like the main problem is not the memory leak in chromium-bidi or Puppeteer but leaking browser processes. I think we do a very similar thing for WebDriver BiDi that we do for CDP when we close the browser so I think we would need to narrow the issue down. Is there a way for us to reproduce the issue on a smaller scale? As a workaround, we could try to identify browser processes after the test run and report/kill to make sure the next runs do not fail with OOM. |
I looked at the code and I wonder if an error happens here https://github.com/mozilla/pdf.js/blob/master/test/test.mjs#L962 it does not look like the browser would be closed (probably, it is not the cause). |
Also, the use of |
@timvandermeij would it be possible to run a script before and after the tests are done that would print out all the still running Chrome processes? That way we probably could see if a given test job leaves processes behind. Also maybe a cleanup could happen after the tests when all remaining browser processes are killed. Given that you re-use the same machine and don't reset state in-between runs piling up Chrome processes might explain the OOM. |
Thank you both the feedback! I think we should try this again with the most recent Puppeteer version in place, and if it still happens we should indeed gather more debug information since I do realize it's not a lot of information to go by right now. I'm planning to update #18590 soon so we can upgrade to Puppeteer 23 first. |
Hi @timvandermeij. Do you have an update for this issue? |
We finished the Puppeteer 23 upgrade, so we should be able to try this again now. We waited a bit due to holidays in which we didn't want to potentially break the bots. I currently don't have a lot of time to work on this, so if anyone wants to make a PR to enable BiDi for Chrome again and if possible add some useful debugging tools, please do (otherwise I'll see what I can do later). |
I have been looking at the integration test code and noticed the following code: https://github.com/mozilla/pdf.js/blob/master/test/test.mjs#L908-L910
Some time ago I tried using WebDriver BiDi for Chrome too and I recall it failed pretty soon back then. However, I tried again today with the current Puppeteer version and the results are much better now: only one test failure remains! Hence I figured now might be a good time to re-investigate WebDriver BiDi compatibility for Chrome, since it would be great if we could move away from CDP for all browsers we support.
I tried debugging this test,
FreeText Editor Freetext must stay focused after having been moved must keep the focus
, but I can't figure out why this works with CDP and fails with WebDriver BiDi. The traceback of the failing test is:Interestingly I found that the test passes with WebDriver BiDi if I comment out this:
pdf.js/test/integration/freetext_editor_spec.mjs
Lines 2362 to 2366 in 7290faf
I have enabled verbose logging using the Puppeteer debugging instructions from https://github.com/puppeteer/puppeteer/blob/ef8c4c808aa9dddd4d2501c8fc160cfcd0c4b9d1/docs/guides/debugging.md?plain=1#L131 but I can't really interpret the output because I don't know much about the WebDriver BiDi protocol and Puppeteer internals.
To reproduce the issue:
npm install
.cdp
towebDriverBiDi
inpdf.js/test/test.mjs
Line 881 in 7290faf
it
tofit
inpdf.js/test/integration/freetext_editor_spec.mjs
Line 2315 in 7290faf
npx gulp integrationtest
and notice that the test hangs. If the protocol is changed to CDP the test passes. The verbose debugging logging can be obtained withenv DEBUG="puppeteer:*" npx gulp integrationtest
, for which it helps to change https://github.com/mozilla/pdf.js/blob/master/test/test.mjs#L961 to only run Chrome (otherwise Firefox also spams the logs).@whimboo Is this perhaps something you could help out with? It could have surfaced a bug in the WebDriver BiDi implementation, but because I'm stuck on debugging this I also can't create a reduced test case to report to upstream...
The text was updated successfully, but these errors were encountered: