Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX failing sphinx-gallery CI #1145

Merged
merged 19 commits into from
Nov 20, 2024

Conversation

Vincent-Maladiere
Copy link
Member

Fix #1143

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Nov 19, 2024

So, it succeeds when I cut example 07 just before importing the Joiner. Let's now run the 2 next cells.

@Vincent-Maladiere
Copy link
Member Author

Let's now run everything but the final CV

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Nov 19, 2024

It worked (the tick is red because some unrelated tests failed due to server issues).
Then, the last cell with the CV should crash

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Nov 19, 2024

... which it did. So, there might be some weird memory usage due to the text-encoder example, and sentence-transformer or pytorch caching shenanigans.

I didn't properly run all examples on the first commit, let's do it now and check that everything runs smoothly.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 19, 2024 via email

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Nov 19, 2024

It failed, that's weird. It means that this is not provoked by the TextEncoder example: https://app.circleci.com/pipelines/github/skrub-data/skrub/4761/workflows/c392d89f-b5dd-4ab2-8d72-3438dd3e4911/jobs/9249 (example 2 is not run there, although example 7 fails).

Let's try this one more time, replacing the CV with a simple fit.

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Nov 19, 2024

The simple fit worked. So we're likely having a memory error due to CV on the (Joiner, TV, HGBT) pipeline. But why now? Let's remove the transformers module from the doc requirements and see what happens if I set the CV back.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 19, 2024 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 19, 2024 via email

@jeromedockes
Copy link
Member

thanks a lot for investigating this @Vincent-Maladiere I know it's not fun 😅 . it's good to have in mind that sphinx gallery runs all the examples in the same process so I guess it is possible that some of the memory from the textencoder example has not been freed when the example 7 runs and that the 2 together pass circleci's memory limit

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Nov 19, 2024

So, when we remove the "transformers" environment (i.e., pytorch, transformers and sentence-transformers dependencies) from the pixi doc environment, the CI runs fine.

some of the memory from the textencoder example has not been freed when the example 7 runs and that the 2 together pass circleci's memory limit

That was my hypothesis too, but we have observed that without example 02 (which uses TextEncoder):

  • Running the CV on example 07, with "transformers" in the doc environment fails
  • Running a simple fit on example 07, with "transformers" in the doc environment works
  • Running the CV without "transformers" in the doc environment works

We only import sentence-transformers or pytorch within the TextEncoder, not outside. However, since I disabled example 02, the CI never imports or uses the TextEncoder. So I don't know how to conclude here. I will try to reproduce it locally.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 19, 2024 via email

@jeromedockes
Copy link
Member

jeromedockes commented Nov 19, 2024 via email

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Nov 19, 2024

Quite possible. If that's the case, hopefully we can find how to free memory: it would make the library more lean.

This is not what's happening. Cf my message above.

could it be a time limit?

Interesting, it could be.

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Nov 19, 2024

I can't reproduce that locally. Let's rerun the CI with example 02 (TextEncoder), but without example 07 (Joiner) as a sanity check. Note that example 07 takes 8min to run: https://skrub-data.org/stable/sg_execution_times

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Nov 19, 2024

Great! So, the example 02 runs fine. The matter is with example 07. It looks like a timeout issue as @jeromedockes guessed.

Let's have another run with all examples and transformers, with maximum verbose on example 07 CV, to see if this is related to sphinx-gallery/sphinx-gallery#301

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Nov 19, 2024

When example 02 runs, the CI crashes before reaching the cross-validation part of example 07.

It looks like there is a timeout at 10min (11 actually) in circle-ci, but the run is not idle for long, and no_output_timeout is set to 30m anyway. With circle-ci, the max run time is 1h for freemium accounts, so we're not hitting this limit either. I can't find any other kind of timeout limit.

Could it be RAM consumption? But then, why would the cross-validation in example 07 crash now, whereas we never import TextEncoder, or torch. I'm starting to run out of ideas.

@jeromedockes
Copy link
Member

I think the duration of example 7 is a problem in itself anyways -- it takes 8 minutes on circleci. So I would suggest reducing it by reducing the sizes of the subsamples we use (say 5,000 flights and 10,000 weather points), and replacing the cross-validation with a train/test split. Hopefully that will get the doc build passing again.

In a second time, we can work on improving that example and maybe using a different dataset. Indeed, at the moment

  • it downloads a very large dataset
  • the join with weather data does not improve predictions significantly IIRC
  • the predictions are not super far from chance level which makes the example less compelling IMO

we may never pin down the exact reason why the circleci job getting killed but I think I can live with that as long as we get it running again :D

@jeromedockes
Copy link
Member

and I think we owe @Vincent-Maladiere a beer for this long session of blind debugging with a 20-min feedback delay which can be quite frustrating 😅

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Nov 20, 2024

we may never pin down the exact reason why the circleci job getting killed but I think I can live with that as long as we get it running again :D

Haha, I couldn't agree more! You're right, let's revamp this example slightly to make the CI run.
I'm also creating an issue to improve this example in the longer term.

and I think we owe @Vincent-Maladiere a beer for this long session of blind debugging with a 20-min feedback delay which can be quite frustrating 😅

I only did my duty 🫡

@jeromedockes
Copy link
Member

cool, thanks! and if that doesn't work: I remember that for nilearn at some point I had reached out to the circleci support and had a good experience (and nilearn uses the free plan) so that option exists as well if all else fails

@Vincent-Maladiere
Copy link
Member Author

It worked! 🎉🎉

The computation times on Circle-ci are now:

computation time summary:
    - ../examples/08_join_aggregation.py:            216.80 sec   0.0 MB
    - ../examples/02_text_with_string_encoders.py:   180.11 sec   0.0 MB
    - ../examples/06_ken_embeddings.py:               69.42 sec   0.0 MB
    - ../examples/01_encodings.py:                    67.53 sec   0.0 MB
    - ../examples/07_multiple_key_join.py:            28.30 sec   0.0 MB
    - ../examples/04_fuzzy_joining.py:                11.46 sec   0.0 MB
    - ../examples/00_getting_started.py:               6.69 sec   0.0 MB
    - ../examples/03_datetime_encoder.py:              5.45 sec   0.0 MB
    - ../examples/09_interpolation_join.py:            5.31 sec   0.0 MB
    - ../examples/05_deduplication.py:                 4.65 sec   0.0 MB

We could also simplify/speed up example 08 by removing the grid search and hardcoding the best hyper-parameters, in another PR.

@jeromedockes
Copy link
Member

nice!! thanks again! I'll merge it before circleci gets a chance to change its mind 😆

@jeromedockes
Copy link
Member

the example still looks pretty much the same after reducing the subsample sizes so that's good

@jeromedockes jeromedockes merged commit 2b8d68b into skrub-data:main Nov 20, 2024
25 checks passed
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 21, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CircleCI no longer runs
3 participants