-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX failing sphinx-gallery CI #1145
FIX failing sphinx-gallery CI #1145
Conversation
So, it succeeds when I cut example 07 just before importing the Joiner. Let's now run the 2 next cells. |
Let's now run everything but the final CV |
It worked (the tick is red because some unrelated tests failed due to server issues). |
... which it did. So, there might be some weird memory usage due to the text-encoder example, and sentence-transformer or pytorch caching shenanigans. I didn't properly run all examples on the first commit, let's do it now and check that everything runs smoothly. |
So, it succeeds when I cut example 07 just before importing the Joiner
OK, so it's an interaction...
|
It failed, that's weird. It means that this is not provoked by the TextEncoder example: https://app.circleci.com/pipelines/github/skrub-data/skrub/4761/workflows/c392d89f-b5dd-4ab2-8d72-3438dd3e4911/jobs/9249 (example 2 is not run there, although example 7 fails). Let's try this one more time, replacing the CV with a simple |
The simple fit worked. So we're likely having a memory error due to CV on the (Joiner, TV, HGBT) pipeline. But why now? Let's remove the transformers module from the doc requirements and see what happens if I set the CV back. |
It could be triggered by an import...
|
But why now?
Memory that is not cleared? Maybe something that you can investigate on your computer.
Are we holding a reference on a big object?
|
thanks a lot for investigating this @Vincent-Maladiere I know it's not fun 😅 . it's good to have in mind that sphinx gallery runs all the examples in the same process so I guess it is possible that some of the memory from the textencoder example has not been freed when the example 7 runs and that the 2 together pass circleci's memory limit |
So, when we remove the "transformers" environment (i.e., pytorch, transformers and sentence-transformers dependencies) from the pixi doc environment, the CI runs fine.
That was my hypothesis too, but we have observed that without example 02 (which uses TextEncoder):
We only import sentence-transformers or pytorch within the TextEncoder, not outside. However, since I disabled example 02, the CI never imports or uses the TextEncoder. So I don't know how to conclude here. I will try to reproduce it locally. |
thanks a lot for investigating this @Vincent-Maladiere I know it's not fun 😅
Indeed. But it is an amazing stress test that reveals how the code lives (and dies) in challenging settings. It makes the software more robust, which the users then love.
. it's good to have in mind that sphinx gallery runs all the examples in the same process so I guess it is possible that some of the memory from the textencoder example has not been freed when the example 7 runs and that the 2 together pass circleci's memory limit
Quite possible. If that's the case, hopefully we can find how to free memory: it would make the library more lean.
|
That was my hypothesis too, but we have observed that without example 02 (which uses TextEncoder):
- Running the CV on example 07, with "transformers" in the doc environment fails ❌
- Running **a simple fit** on example 07, with "transformers" in the doc environment works ✅
- Running the CV **without "transformers"** in the doc environment works ✅
We only import sentence-transformers or pytorch within the TextEncoder, not outside. However, since I disabled example 02, the CI never imports or uses the TextEncoder. So I don't know how to conclude here.
if both installing transformers and running the CV take a lot of time,
could it be a time limit?
|
This is not what's happening. Cf my message above.
Interesting, it could be. |
I can't reproduce that locally. Let's rerun the CI with example 02 (TextEncoder), but without example 07 (Joiner) as a sanity check. Note that example 07 takes 8min to run: https://skrub-data.org/stable/sg_execution_times |
Great! So, the example 02 runs fine. The matter is with example 07. It looks like a timeout issue as @jeromedockes guessed. Let's have another run with all examples and |
When example 02 runs, the CI crashes before reaching the cross-validation part of example 07. It looks like there is a timeout at 10min (11 actually) in circle-ci, but the run is not idle for long, and Could it be RAM consumption? But then, why would the cross-validation in example 07 crash now, whereas we never import TextEncoder, or torch. I'm starting to run out of ideas. |
I think the duration of example 7 is a problem in itself anyways -- it takes 8 minutes on circleci. So I would suggest reducing it by reducing the sizes of the subsamples we use (say 5,000 flights and 10,000 weather points), and replacing the cross-validation with a train/test split. Hopefully that will get the doc build passing again. In a second time, we can work on improving that example and maybe using a different dataset. Indeed, at the moment
we may never pin down the exact reason why the circleci job getting killed but I think I can live with that as long as we get it running again :D |
and I think we owe @Vincent-Maladiere a beer for this long session of blind debugging with a 20-min feedback delay which can be quite frustrating 😅 |
Haha, I couldn't agree more! You're right, let's revamp this example slightly to make the CI run.
I only did my duty 🫡 |
cool, thanks! and if that doesn't work: I remember that for nilearn at some point I had reached out to the circleci support and had a good experience (and nilearn uses the free plan) so that option exists as well if all else fails |
It worked! 🎉🎉 The computation times on Circle-ci are now:
We could also simplify/speed up example 08 by removing the grid search and hardcoding the best hyper-parameters, in another PR. |
nice!! thanks again! I'll merge it before circleci gets a chance to change its mind 😆 |
the example still looks pretty much the same after reducing the subsample sizes so that's good |
Wohoo!
Thanks a lot Vincent!
|
Fix #1143